Most of last year I worked on my thesis which attempted to answer one simple question: can an autonomous agent learn the rules of a game by observing people play it?
As we started working on the problem, we realized how difficult the first part of the problem was. Effectively, we had to solve the computer vision problem where you want to extract information about the world from visual input just like humans. To build a model of the world, you need to separate the players in the video from the background and the players from the pieces and the pieces from the board, essentially separate entities need to be identified separately (image segmentation). That is you need to group pixels together to form ‘discrete’ entities or objects. The next step would be tracking the objects you have identified(video tracking). Lighting changes, occluding objects, cluttered environment and shape changing objects are only some of the issues you would face while tracking objects in a scene. To be crude, all the above steps are still low-level vision.
The next step would be to keep track of how the relationship between objects is changing. That is high-level vision. Simply segmenting pixels and tracking them is not enough. This part is extremely tough to do given that the very concept of ‘relationship’ between objects is not easily described to a machine. For instance, say you have a glass. A person pours water into it and drinks it. You might teach a toddler or better yet, he might learn from observation that a glass is something that can hold a liquid and it can be then used for transferring liquids. But then one fine day, the toddler observes that an upside-down glass can’t be used for storing liquids. The water just spills. But now that glass is not totally useless. It can be used for keeping something on it i.e providing support. So the relationship between objects is not a bland static thing – it changes depending on their relative orientation among other factors(the theory of affordances). High-level vision is where a lot of ‘intelligence’ should come in. The problem is how do you formalize this problem for an autonomous agent to comprehend. In fact, the problem of linking pixel-level processing to object and event level processing is a topic of active research in both neuroscience and computer vision(mid-level vision).
Some assumptions and lines of code later we had a system that looks at a Kinect stream of a game being played and generalizes from the visual observation the rules of simple games like Towers of Hanoi and Peg Solitaire. We also learnt the spatial structure of the game played using some heuristics. Couple of thoughts on the game rule learning system:
- The rules we learnt are severely limited by the logical framework of the world provided to it. Some simple concepts need to be fed into the system like what a board is, what up and down are, what backward and forward are etc. We used an inductive logic programming framework to learn the rules. The rules can be learnt only in terms of old concepts present already in the system. In fact learning new concepts about the world is an aspect of intelligence. Think about how the language you think in limits the thoughts you have. To be more specific, the way one represents the world in his own head decides how he is going to think and act. For instance, a C++ programmer would probably be thinking in terms of for loops to solve a problem while someone in MATLAB would always look for ways how he can avoid for loops to solve the same problem. It is a two-fold problem: representation of learnt concepts and addition of new concepts. Heck, imagine an autonomous agent learning a new concept which changes the way it represents its old concepts!
- The problem that we attempted to solve was to make sense of the world from visual observation. It is in essence similar to what humans/scientists are doing all the time. Physicists are attempting to find the rules our world and the objects in it follow. Newton coming up with the Laws of Motion from his experiments is a perfect example. He made observations of the world around him and made a rational guess of what the rules governing the world(physics) might be. Thankfully for the rest of us, he realized how inefficient our representation system(maths) was and gave us calculus. A lot of hard-work(see this page from his notebook for example) went in to develop calculus. He used his old concepts to learn something new and one of the new concepts he learnt was how his(and the rest of the world’s) old representation system sucked. Newton is what a really intelligent agent should aspire to be. (Yeah, I get the impending joke. A guy who calculates so rigorously is already a robot.)
An agent that can play all games( General Game Playing system) has applications in real world like carrying out search operations and strategizing in military operations and electronic commerce. But the system needs to know the rules before it can start strategizing. This system needs its complement: an agent that can learn the rules of any game presented to it. Many real world problems can be represented in form of games(nudge nudge Game Theory). A truly intelligent system would be able to learn all the games presented to it i.e. become a General Game Learner. And to do that an agent should be able to learn new representations, assimilate new concepts and come up with ‘legibly’ elegant solutions.
How do we get there is a tough question to answer. Are we stuck with the wrong hardware to actually come up with true intelligence? How does a computer realize motivation? Can we define motivation in any way for a being which has no life? Is vision necessary for intelligence or can we come up with better sensors for agents to learn about the world? This list never ends.