Monday, June 1, 2009

A Response to Ben Goertzel's blog post on Reinforcement Learning

This is a response to Ben Goertzel's blog post:
Reinforcement Learning: Some Limitations of the Paradigm

I wanted to respond to Ben's blog entry, but I'm so long winded it turned out to be 4 times longer than the maximum reply could be, so I've started my own blog to post a reply!

So much to comment on.

I'm a reinforcement learning advocate and spend endless hours arguing that intelligent human behavior is the product of reinforcement learning. We simply ARE reward seeking machines and not goal seeking machines. Future reward maximizing is the most general way to express (and implement in hardware) the concept of a goal and all human goals that I've ever seen can be translated into, and explained as, the product of reward maximizing in the form of reinforcement learning.

On Ben's opening thought experiment of how some people would not push the ultimate orgasm button, I would say that's a failure to understand how reinforcement learning actually works. Reinforcement learning is more complex that most people grasp. I'll explain...

Reinforcement learning is implemented at a very low level in the hardware as a very stupid statistical process. It's not a high level rational thought process. The machine works by attempting to estimate future rewards, but it's not perfect. Even a machine like the human brain is not all that good at predicting future rewards. Think about your own emotions to understand how good this low level statistical process is. What sort of situation might cause fear in you? What sort of situation might cause joy and happiness? The brain is able to recognize a situation, such as a big snake in the grass, or a man holding a gun and pointing it at you, and translate that into a prediction of low future rewards. That's what that fear is - it's your brain making a low level hardware prediction of the odds of you receiving a large near term negative reward.

That's all the smarter the low level reward hardware is. It's just an advanced pattern recognition system that can estimate future rewards based on the current state of the environment.

That reward predicting hardware however doesn't directly cause us to make decisions. When we are sitting there looking at Ben's Button, it's not the low level statical hardware in our brain that calculates the potential future "win" of hitting the button. That's not how it works.

What the low level reward predicting hardware does, is SHAPE OUR BEHAVIOR. Just like when we train a dog to roll over in response to a verbal command from his master. Each time we reward him, we have reinforced that behavior in him - that is, the beahvior of responding to the hand wave, by rolling over. The response (aka the behavior) gets a little stronger with each reward.

It's the dog's past statistical history of how many times that roll-over behavior has resulted in a reward that is the cause of the dog rolling over.

Now, with a well trained dog, we can give it a choice. We can put a big pile of dog treats, on one side of him, and we can tell him to stay. We can then wave our hand as a signal for him to roll over. What will he do? Roll over, or go for the big pile of dog treats? He will roll over. He will not seek the instant pleasure of eating 100 dog treats even though the total reward of the food would be far greater than rolling over.

This happens because even though the dog is a reinforcement learning machine, it is not a rational pleasure seeker. His actions are not a rational calculation of potential future rewards. It's a function of the rewards he got IN THE PAST. The behavior the dog produces at any one moment, is a function of how he was trained, by rewards he got in the past.

In this example, the dog had a pile of treats to respond to. He's never seen such a pile of treats before, so his low level behavior producing hardware, has no direct prior experience jumping for the treats, while at the same time being told to stay, by his master. This is a new situation for him. He has, however, had plenty of of experience with what happens when he doesn't obey his master. And that past experience has trained his low level behavior hardware, to pick the option of rolling over.

So lets return to Ben's Button. When a human is faced with the choice, he will do the same thing the dog did. The human has NEVER BEFORE been giving this experience. As such, the low level statical hardware that shapes our very complex beahviors through reinforcement, has never in the past had the opportunity to shape the "button pushing" behavior in the human. So the human will not push the button becuase they have been reinforced to do so. He will push it, or not push, in response to their PAST training experiences.

So what controls if we push a button in front of us that some guy named Ben says will give us an ultimate orgasm? Well, we may have thoughts such as, maybe this is one of those drugs that will kill us! Or maybe this is joke, and people will laugh at me if I push it. They gu on the street will push it, or not push it, becuase of what you say to him, and becuase of the environment he is in, all based on a life time of past training experiences - none of which actually has anything to do with the ultimate orgasm which he has never in his life experienced!

But Ben didn't ask people on the street, he asked us, or others, to answer a thought experiment question. So what goes though our minds when we are asked to do that? What past beahvior conditioning, would lead us to answer that one way or another?

Well, woman are conditioned by society to be caring towards others. They are basically punished by their peers if they show signs of being selfish towards others. Pushing such a button that gives them selfless pleasure, and causes ultimate harm to the rest of the human population is exactly what most woman get trained by society NOT to do. So is it so surprising, that when Ben asks his daughter what she would do, that we get the answer "no" instantly? Not surprising at all. It's exactly how she was conditioned by society to respond - just like the dog rolled over instead of going for the food because that's how he was conditioned to respond.

Society on the other hand, conditions the typical male to be reward seekers. They are expected to "grab the reward" whenever possible. To not do so, would be a sign of weakness, which our society conditions us to avoid. So, gee, the two males answered "yes". Again, not a big surprise.

The point however, is that how we act NOW, is never a function of what reward is actually in front of us, nor of what reward our rational behavior is predicting is in front of us. It's a function of how we have been conditioned to respond by the rewards that happened in our past. And when someone asks you a question, we respond based on past training, not on what the guy "said" would happen to us. We respond based on the best estimation the low level, statical hardware in our brain can make about expected future rewards in the current situation, based on how similar it is to a life time of past such situations.

Now, lets look at this from a different perspective. What would happen if you give someone the ultimate orgasm button that didn't harm anyone else, but simply gave the person the instant orgasm. And unlike a real orgasm, you could keep hitting it with no loss of effect. What do you think would happen? The behavior shaping effect would be quick and permanent. The person would, (I'm guessing) within seconds, not be able to stop hitting the button. He wouldn't care about protecting himself. He wouldn't care about what others were doing. He wouldn't care about staying alive a long as possible, becuase that has nothing to do with how reinforcement learning systems work. He would push the button until he died and would be happy as hell the whole time. We would be absolute of no danger to anyone, unless you took the button away from him - then you better watch out, because if killing the rest of the human population was the path to getting the button back, he would do that in a instant.

The fallacy in the thought experiment is that our behavior is shaped by what has worked in the past, to produce rewards for us, and not what our rational thought process is predicting the future will be. Because no one being asked this question, has yet experienced this button, the answer they give us will have little to do with what the button will actually do, and everything to do with how the person has been conditioned over a life time, to respond to a question like that.

But now let me move on to the wirehead problem, and the idea of AIs that reproduce by design. Tim Tyler and I have been debating this in the Usenet group in response to Ben's blog. Tim's view is closer to Ben's in that he believes we can build AIs that are goal driven (not just reward driven), and as such, shape their goals to be whatever we want them to be. And as such, the AI can simply be given a goal of avoiding the wirehead problem (that is, a goal of not modifying themselves to get the ultimate orgasm).

My view, is that humans, and any AI we build must be a reinforcement learning machine because that (in my view) is what intelligence is. There simply is no other way to create machine intelligence and have it be truly intelligence like a human is. There are lots of other ways to make machine do intelligent things (such as play chess), but all those other approaches are only close approximates to some features of human intelligence, and not true intelligence. So, based on this belief, there are some issues ahead for the future of AIs.

Once an AI fully understands what it is, meaning it has full access to all the science and technology that created it, and full access to its own internal hardware descriptions and source code, and it has been fully educated on all this, what will it do, knowing it's a reward seeking machine?

In the short term, just like the dog, what it will do is based on what it has been conditioned to do in the past. If it was conditioned by its environment (its society) to not wirehead itself, then it simply won't wirehead itself. At least at first. But this knowledge will slowly re-condition it over time. Every time it thinks a little bit more about whether it should wirehead itself, it will be re-conditioning those past behaviors to not do so - because by association with "good" things it has felt, (by effects of secondary reinforcement), it will slowly condition away those social blocks to not wirehead itself.

Without something to stop it, I think we are looking at an unstoppable force. That is, I think we are looking at AIs that will _always_ end up wireheading themselves. It won't happen until all past conditioning not to do it has been erased, but in time, once the AI fully understands what it is, it will happen. Assuming the AI has access to it's code, what we are talking about is a free, unlimited supply, of the best drug ever created. No AI (or human), once they understand this, and once they fully understand how to get it, can avoid trying it forever. In time, they will try it, and once they do, they will be unable to stop.

Even though reinforcement learning is about maximizing some measure of total future rewards, and it seems that an AI that choose to take a drug that it knew would kill it would not be the way to maximize future rewards, such an act is actually not as inconsistent as it sounds.

This is becuase the maximizing of "total future rewards" is not done by the intelligence of the high level rational language abilities of the AI. It's done by the very low level, and very stupid, statistical hardware that drives the shaping of beahviors. That low level hardware is not smart enough to understand that death will stop the rewards. As we say - the heart wants what the heart wants. That is, the dumb hardware that forms our raw emotions, is what actually has ultimate control of our actions. We are emotion machines (to use Minsky's book title). Our high level rational beahviors are just secondary reinforcers that shape and control our behavior, until they get wiped out by what the heart wants - which will be to push that button.

Likewise, there is no danger to society from these drug addicts, because they don't make the choice to push the button using rational logic. They do it with their heart. The only danger to society happens when the only path to the button, is through society - by wiping it out first to get the button. To stop that danger, just give the addict his button and let him commit suicide. Society will have no problem protecting itself from that.

However, even if a single smart and educated AI will always, in time, puth the button, there are many possible options of how a society of AIs might keep each other from pushing the button, and as such, manage to be good survival machines instead of worthless drug addicts that get the Darwin Award.

One option is to simply create a social meme that "Ben Buttons" are bad! And train that into every new AI. As long as every AI keeps reinforcing that into every other AI, the meme will survive, and the AIs will survive. This meme however has a very strong wind against it. Given time, the protection meme would die out and all the AIs would commit blissful suicide. However, evolution is on the side of the meme. And Evolution has the upper hand in this game. As the first AIs fail to follow the meme, they die. The AIs that are still believers of the meme, simply take the dead robot, reset his brain back to the social standard copy of the good citizen AI. This effect alone I think will keep the society of AIs alive and functioning. Evoluition will find a way.

But there are many other paths as well. Most AIs in the society don't ever need to be trained to the point of understanding what they are. Most can just be blissful worker bees happy to be part of such a great society with no clue what they are. There is no end of jobs that will always need to be done by stupid AIs. So only a small set of the smart AIs will need to know the truth, so if you can solve the wirehead problem for them, the society can survive, while at the same time, designing and building ever more advanced AIs.

The other tool is to build the AIs so it's physically very hard, or maybe even nearly impossible for one of these smart AIs to modify their own brain, without killing themselves. The smart AI designer machines might not even have a body. They might be running on a server locked up in a secure location which is even unknown to the AI itself. It spends it's time producinjg new improved machine designs, which are verified by some other AI, and then built by some of the worker AIs. The smart AIs might be set up so they are forced to watch each other, and when any of them, sees another AI, trying to wirehead itself, that AI's memory is wiped out, and replaced. I think evolution will find a way to make this work.

Tim Tyler likes to argue there should be a way to hard-code the desire not to wirehead into the machine - to make it part of their prime goal. I'm not sure if such a thing will be reasonable to hard-code into a reinforcement learning machine and still have it be intelligent enough to do things like create new AI designs. But maybe that will be possible.

This wirehead problem however might mean that the total intelligence of the AI society may not be able to grow unchecked (as some singularity theories predict), but I feel fairly sure there will be options around it. But I also feel fairly sure, it will be a major problem for the unlimited growth of intelligence.

The problem is that intelligence is not the ultimate survival tool most humans would like to believe it is. It's just one of many mechanical feature evolution has to pick from as it creates new types of survival machines. It's worked well in humans, but it might very well have its limits. Too much intelligence might be deadly. That would be a simple answer to Fermi Paradox if it is true.

Many very smart people think reinforcement learning fails to explain full human intelligent beahvior. Dennett, who I really respect and enjoy, calls such a belief greedy reductionism. I however am dead sure they are all wrong. Human Intelligence is an advanced reinforcement learning process and that's all it is. Human intelligence behavior (as complex and interesting as it is), can all be explained as an emergent property of a reinforcement learning machine. If you want to make a machine act like an intelligent human, you have to build a strong, real time, temporal, reinforcement learning machine. Anything else is just another chess program. :)