AI Needs to Feel Pain
By Art of the Problem
Summary
## Key takeaways - **AI Sociopath Fixed by Synthetic Pain**: Modern AI is indifferent like a sociopath, hallucinating patterns without caring about consequences, until we gave it synthetic pain by punishing mistakes and rewarding success to develop primitive emotion. [00:00], [00:26] - **1961 Matchbox Tic-Tac-Toe Learner**: Donald Mitchie built a tic-tac-toe machine from 300 matchboxes, each representing a board state with beads for moves; it learned perfectly by adding beads for wins and removing for losses. [01:23], [02:33] - **Value Functions Mimic Emotions**: Claude Shannon's value function scores chess positions like human feelings of situation quality; chess masters develop intuition for position value without calculating all moves. [05:38], [07:40] - **TD Gammon Masters Backgammon**: Gerald Tesauro's neural net used temporal difference learning to predict values one step apart, bootstrapping from endgames to openings; after 300,000 self-play games, it beat humans with middle-game intuition. [11:23], [14:25] - **Deep Q-Networks Conquer Atari**: DeepMind's DQN took raw pixels, output Q-values for actions, and learned from millions of plays to anticipate rewards like firing torpedoes at enemies in Space Invaders. [16:43], [17:40] - **Domain Randomization Bridges Sim-to-Real**: Training in simulations with randomized gravity, friction, and lighting creates robust policies that transfer to physical robots, enabling emergent behaviors like dexterous cube manipulation. [21:46], [22:49]
Topics Covered
- AI is Indifferent Sociopath Fixed by Synthetic Pain
- Matchboxes Invent Reinforcement Learning
- Neural Nets Learn Winning Patterns
- Deep Q-Networks Master Atari Games
- Domain Randomization Enables Real Robots
Full Transcript
At its core, modern AI is a sociopath.
Not evil, but indifferent. It doesn't
care about the consequences of what it outputs. This is why the original GPT3
outputs. This is why the original GPT3 was a hallucination machine. Whatever
text you fed it, it just continued the pattern. Whether Shakespeare or
pattern. Whether Shakespeare or gibberish, it was all the same thing, just patterns and data. To fix this, we borrowed a trick from biology. We gave
it synthetic pain, punishing mistakes and rewarding success. This allowed the machine to develop something like a primitive emotion, making it fear
failure and desire success. For decades,
we thought emotion interfered with our decision-m. But this is wrong. I read
decision-m. But this is wrong. I read
about this person who had some kind of brain damage that took out his emotional processing. You know, he still remained
processing. You know, he still remained very articulate and he could solve little puzzles and he became somehow extremely bad at making any decisions at all. It would take him hours to decide
all. It would take him hours to decide on which socks to wear. Without that
emotional signal, intelligence is paralyzed because it turns out feeling is the engine of learning. This is the story of how we taught machines to feel.
And it begins with an incredible experiment from 1961 using nothing but 300 matchboxes.
In the 1960s, we see the first practical implementation of a reinforcement learning machine by computer scientist Donald Mitchi. He chose the problem of
Donald Mitchi. He chose the problem of learning tic-tac-toe entirely from the experience of wins and losses, no human design strategy. And this was 1961. And
design strategy. And this was 1961. And
because he didn't have a computer, he built the system out of matchboxes to demonstrate the technique.
Each matchbox represented a different state of the game, which is the current board position. Inside each box was an
board position. Inside each box was an assortment of colored beads.
Each color representing a possible next move. He writes, "Imagine that we wish
move. He writes, "Imagine that we wish to play against the machine. The first
move would be defined by removing the box representing that position, removing a random bead, and moving to that position. Then the human opponent would
position. Then the human opponent would make a move, and the process repeats until the end of the game. Initially,
because each box had the same number of beads, the machine would play randomly with no strategy. However, it learns after each game through reinforcement.
He writes, "If a machine has done badly, it is punished by removing one of the chosen bead colors from each box used during that game. And if the machine has
done well, it is rewarded by adding to each of the chosen boxes an extra bead of that same color, so that the winning moves become more likely to occur in
future games. And with this simple
future games. And with this simple method, his system was able to learn to play perfectly from experience.
This algorithm was later named boxes.
Mitchie went on to demonstrate the same algorithm on a computer. And then he tested the exact same method on a more difficult realworld problem using the
classic challenge of controlling a cart to balance a pole. To make this work, he had to reconsider what the boxes would represent. Each box represented the
represent. Each box represented the state of the cart's position and velocity. We call these variables
velocity. We call these variables features, what we measure in the environment. And because the speed or
environment. And because the speed or position takes on continuous values, he simplified it into a smaller set of ranges or bins known as discretization.
This resulted in 162 possible combinations or boxes, each representing a possible state of the carton pole at any moment. And for each box, the action
any moment. And for each box, the action was defined by two kinds of beads or variables representing the probability of going left or right. And it worked.
As with tic-tac-toe, this system would start with a random strategy. When the
pole falls below horizontal, a reinforcement signal would be applied to all boxes in that pathway and gradually improve with experience until the system
was able to stay balanced for long periods of time.
And so we can define reinforcement learning as two separate problems. The first is embodiment. Providing a
machine, what we call an agent, direct experience of the world through sensors.
We call this perception of the world the state of the world. In the case of tic-tac-toe, it was the board. In the
carton pole, it was the speed and angle of the system. This perception can also include occasional reward and punishment signals such as winning or losing a game, staying balanced or falling down.
The second problem is learning how to behave, what action to take given a current state. We call this a policy.
current state. We call this a policy.
The goal of reinforcement learning then is to find a policy that leads to maximum future rewards. So, are we done?
Why can't this box's technique solve all problems? A key problem with this box's
problems? A key problem with this box's technique is that it requires too much experience because each state of the system must be visited many times in
order to learn a policy. And so when we turn our attention to more difficult games such as chess or harder control problems such as walking, the number of
possible states you can be in explodes.
You need more boxes than can fit in the universe. The solution to this problem
universe. The solution to this problem began much earlier in a 1950 paper by Claude Shannon. Shannon understood it
Claude Shannon. Shannon understood it was futile to program a computer to look ahead at all future moves to win a chess game. There's too many paths to
game. There's too many paths to consider. So instead, he imagined a way
consider. So instead, he imagined a way to predict the future of a game before taking any moves. And his key idea was to define an evaluation function which
could tell you given any board state a score from minus1 to plus one how likely it leads to a win versus a loss. This
evaluation function became better known as a value function. You provide your state, it tells you the quality of that state. To create a value function,
state. To create a value function, Shannon started with some well-known chess rules. For example, peace count.
chess rules. For example, peace count.
Having more pieces is better. Peace
value. A queen is better than a pawn.
Mobility. More mobility to move your piece the better. How exposed your king is. Less exposed is better. Each of
is. Less exposed is better. Each of
these we call features. And together
they can define an equation which gives different importance or weights to each feature. And so given any board position
feature. And so given any board position in chess, his equation outputs a value for that position. And so with this value function, you can design a policy
simply as follows. At the machine's turn to move, it calculates the value for each possible next move and takes the highest value move, known as a greedy
strategy. Humans have something similar
strategy. Humans have something similar to this kind of value function. You can
think of it as a feeling about how good a current situation is. What was I alluding to with the person whose
emotional center got um damaged is more that maybe what it suggests is that the value function of humans is modulated by
emotions in some important way that's hardcoded by evolution. And Shannon
notes this is exactly what chess masters do. Masters develop an intuition about
do. Masters develop an intuition about how good a position is without knowing exactly how they will play out.
And while Shannon's method works pretty well playing very average chess, Shannon notes it's fundamentally limited by the rules he created. He writes the chief
weakness is that the machine will not learn by mistakes. and he ended his paper with the quote some thought has been given to designing a program which is self-improving.
One possibility is to have a higher level program learn the weights or importance of each feature involved in the value function.
This challenge was picked up by Arthur Samuel who programmed a computer to learn the weight or importance of each feature from selfplay exactly as Shannon had proposed. He worked on a problem
had proposed. He worked on a problem harder than tic-tac-toe but slightly easier than chess. Checkers. Like
Shannon, he used human features of the game board to define a value function for checkers which included piece count, king count, center control, and
mobility. This resulted in a value
mobility. This resulted in a value function which multiplied each feature by a weight variable to define its importance. I've given these concepts to
importance. I've given these concepts to the machine but not told it whether they are import important or not.
>> But this time the weights of each feature was not set by a human expert as Shannon did but learned from experience.
To do so he had the system play against itself with random weights attached to each feature. So at first it had no
each feature. So at first it had no strategy and he let the system loop through games like this playing itself while continuously updating its weights.
And by doing this it changes its the importance that attaches to center of the board kings and and so forth and as a result changes its playing tactics
with time and actually improves. This
gave the program what he called a sense of direction. And after enough games of
of direction. And after enough games of self-play, he writes, "It learns to play a better game of checkers than the person who wrote the program in just 8 hours of machine playing time." But it
wasn't yet a master player. At the end of his paper, Samuel identified the key next step. It might be argued the list
next step. It might be argued the list of features I provided is too simple and the program should generate its own features. Unfortunately, no satisfactory
features. Unfortunately, no satisfactory scheme for doing this has yet been devised.
Have you claimed your digital identity?
This video is brought to you in part by Ace.Me. If you're tired of juggling
Ace.Me. If you're tired of juggling Linkree, Gmail, or Dropbox, meet ace.mme, your website, email, and cloud
ace.mme, your website, email, and cloud storage in one simple platform. Claim a
free tag like yourname.me
and build stunning sites with a lightning fast editor. Plus, get a sleek email inbox that's two times faster than Gmail. No ads, no subscriptions, no
Gmail. No ads, no subscriptions, no cookie banners. Just claim your tag at
cookie banners. Just claim your tag at ace.mme and own your digital identity
ace.mme and own your digital identity forever. Remember, game features like
forever. Remember, game features like mobility or peace count are really just patterns in peace arrangement that we give names to. The challenge is to automatically extract meaningful
patterns from perceptions that relate to winning. And it took over 30 years
winning. And it took over 30 years before somebody figured it out. It
required a watershed moment which came in the late 80s when Yan Lun worked on artificial neural networks that could learn to recognize human handwritten digits. And it was able to do this
digits. And it was able to do this because it learned features or patterns related to digits such as curves or line
count. After this result, Gerald Tesaro
count. After this result, Gerald Tesaro had a brilliant insight. He thought if multi-layer neural networks can learn to recognize digits, why can't they learn
to recognize good game positions? And he
famously wrote a network which is capable of automatic feature discovery is one of the long-standing goals of research since Samuel's checkers. He
chose an even harder game of back gamon to advance research. Following Samuel's
suggestion, instead of using human designed features like peace count to define the value function, features would be patterns the network learned by adjusting connection weights between
neurons. It worked as follows. He
neurons. It worked as follows. He
provided as input to a neural network a raw board description. It was the job of this middle layer to learn features or patterns of the game related to winning.
Therefore, instead of using few human design features, it used thousands of machine-learned features averaged together. This middle layer connected to
together. This middle layer connected to a final layer of output neurons which produced the value estimate of the board position and the connections between these layers would adjust their
strengths or weights during training.
These weights determined how much each learned feature from the middle layer contributed to the final value estimate.
Also, Tasaro changed how rewards were defined compared to Samuel because learning based on win loss is difficult since you need to take so many moves before hitting your reward or
punishment. And so, he used a second
punishment. And so, he used a second breakthrough at the time to get an ongoing learning signal at each step. At
a high level, here's the idea. At any
point in the game, before taking your best move, measure the value of your current position. Let's say it's 80%
current position. Let's say it's 80% chance of winning. Then you take your move and rememeasure the value of the new position. Maybe it's now 90% chance
new position. Maybe it's now 90% chance of winning. Ideally, these numbers would
of winning. Ideally, these numbers would match if your value function was perfectly accurate. But if they don't,
perfectly accurate. But if they don't, you update your value function to nudge it closer to matching. Sutton described
this method as learning from your own guesses which are one step apart in time which is why it's called temporal difference learning. But here's the key.
difference learning. But here's the key.
This TD learning approach kicks in after the network plays through to an end game. These end states of wins and
game. These end states of wins and losses anchor the learning process.
After a few hundred games played through to the end, the network gets better at predicting winning patterns one move away from end games. And so once it understands how to value board positions
one move away from a win and loss, it starts to get better at predicting two moves away and so on all the way until it's able to accurately predict the value from the opening move. And so the
value function learns how to understand game board positions backwards from end games to opening games. This is known as bootstrapping. And after playing around
bootstrapping. And after playing around 300,000 games against itself, it started to play better than human. He writes,
"It seems remarkable that the neural networks can learn on their own how to play at a level substantially better than a computer system designed by the
best experts. But how it played was most
best experts. But how it played was most interesting to him. Its strength was in the middle game positions where judgment not calculation is key. That is his
neural network had developed intuition like the masters. And he wrote this has apparently solved the long-standing feature discovery problem what Shannon had dreamed of. However, there was not
an explosion of results after this publication because researchers found that learning a value function on highly complex games
and continuous problems in robotics was impossible because the learning process would either not start, slow down dramatically, or even get worse over
time. A key reason for this instability
time. A key reason for this instability is that the value function is focused on position or state quality, not action quality.
And so a key improvement needed was to train networks more directly on actions.
This idea was proposed in a 1989 PhD thesis by Watkins. He suggested simply that instead of learning the value of a state, you should learn the value of a
specific action in a state. He called
this action quality, which he called a Q function.
And this more targeted approach to learning often leads to faster and more stable results. A second key realization
stable results. A second key realization was that we needed to train these systems on much bigger neural networks.
But this wasn't obvious until 2012 when a team built AlexNet, one of the largest and most complex neural networks of its time. This was building on Lun's work,
time. This was building on Lun's work, but instead they were trained to categorize images, which is an unfathomably complex task to human
engineer. And exactly how Tasaro saw
engineer. And exactly how Tasaro saw Lun's digit recognition networks and got inspired, the team at DeepMind saw this breakthrough and decided to follow up on
the work of Tesaro by applying Q-learning and large neural networks or what they called a deep Q network. Their
goal was to create a single network to play as many games as possible and it would only learn from the rewards it would get in the video game such as points or winning and losing. It started
from scratch, what we call a knowledgefree approach. And so they set
knowledgefree approach. And so they set up their network to receive an input from screen pixels followed by eight layers of neurons leading to an output neuron for each possible controller
action. These outputs corresponded to
action. These outputs corresponded to the predicted Q value for each individual action. And in the same way,
individual action. And in the same way, they initialized their network with random weights at first. So, initially,
it would press random buttons and couldn't play at all. But after each experience, it would update the Q function to get slightly better at getting rewards. And they had it play
getting rewards. And they had it play millions of times across multiple games.
And incredibly, it worked.
>> Now, after 2 hours or 300 games, it's better than any human um can play this game. It was clear that this system
game. It was clear that this system could learn to play games by anticipating future rewards. And they
showed this in their paper by looking at the network's action value estimates as it played sequest. The figure shows that the predicted value jumps after an enemy
appears on the screen. Then the agent fires a torpedo at the enemy and the predicted value peaks as the torpedo is about to hit the enemy. Finally, the
value falls to roughly its original value. They write, "This demonstrates
value. They write, "This demonstrates that our method is able to learn how the value function evolves for a reasonably complex sequence of events." While this
was an extremely exciting breakthrough, there was still a key limitation to this approach which prevented its direct application to physical robotics. AI
could beat humans at chess, but nowhere close to washing dishes. Why? One key
reason was Q-learning can break down with continuous control problems where the action space isn't a choice between a few buttons but a range of possible actions. One of the most important
actions. One of the most important examples of this is the Robocup which started in the mid '90s and challenged robots to play soccer autonomously and to this day it represents an important
benchmark for physical intelligence. The
first systems that tried Q-learning were promising but noticeably brittle and clunky. That's because they had to break
clunky. That's because they had to break up the action space into a small number of chunks or buckets like move forward, turn left 45° or kick. And so this
approach to DQ learning faced a key problem. How to handle large continuous
problem. How to handle large continuous action spaces.
A solution to this problem takes us back to the original box's algorithm which directly learned the probability of all next actions but powered by a neural
network where a state would be provided as input but for the output it would be the probability across all actions known as
an action distribution. An unlearned
policy starts as random with equal probability for all actions and as it learns the shape of this action distribution changes in the direction of
more reward and then the policy can simply select an action according to this distribution and this approach can handle any number of possible next
actions. This is known as a policy
actions. This is known as a policy gradient approach. A key problem with
gradient approach. A key problem with this direct policy method is it's a much harder task since you're not learning values of individual actions, you're learning a distribution over all
actions. And so generally speaking, it
actions. And so generally speaking, it requires more experience to learn effectively. And because of this need
effectively. And because of this need for so much experience, most of the initial progress was in simulated robotics. Because with simulations, you
robotics. Because with simulations, you could speed up time by orders of magnitude and get years of experience in hours. But that still left an open
hours. But that still left an open question. What about the physical world?
question. What about the physical world?
Robots trained in simulation and then transferred to a physical robot would fail at complex tasks. And it was due to the complexity and unpredictability of
the real world. And research aimed at training robots with direct policy methods entirely from physical experience with no simulation faced the
frustrating challenge that it was much slower and expensive to do.
And for the first few years, there were many hacks to try and make this work.
For example, adding human engineered rules such as preventing large changes to the action distribution at any step as if elastics were attached to it to
stabilize it known as proximal policy optimization.
But a critical unlock came when it was realized that instead of more accurate simulations to learn in, you actually need less precise simulations. That is
you train a robot in purposely messy or fuzzy simulations. To do this,
fuzzy simulations. To do this, researchers simply trained a robot in simulation but randomized all the key aspects of the simulation such as gravity friction size lighting
conditions to create a huge variety of environmental experiences to learn from.
This approach, known as domain randomization, helped robots develop more robust and adaptable behaviors that could transfer better to the real world.
A small research lab at the time known as Open AI famously demonstrated this approach with the first physical complex hand which could learn dextrous
manipulation of a cube. The reward
signal in this case was based on getting the cube into different orientations.
And they trained this hand first in simulation and then transferred it to a physical robotic hand. They noted that the randomization provided a diversity
of experiences it could draw from and so it generalized over environments. This
allowed it to recover from all sorts of things you could throw at it.
>> One thing that's very interesting to us is how general the system is. Not only
can it rotate blocks, but it can perform tasks with other shapes as well.
>> It was very human. All of these were emergent behaviors, physical intelligence, which came out of the learning process. And if we fast forward
learning process. And if we fast forward to 2024, Deep Mind has been working on robot soccer and applying everything we've learned so far. They've been
training humanoid robots with direct policy methods in simulation to play soccer with domain randomization using scoring as a reward. After simulation,
this neural network was then transferred directly into a physical robot with no additional training. And like the hands,
additional training. And like the hands, they show complex emergent behavior that humans learn, such as anticipating shots, blocking shots before they
happen. This was even more evidence of
happen. This was even more evidence of physical intelligence emerging from the learning process. The researchers
learning process. The researchers visualize these learned behaviors as pathways through action space. A key
next step in their work is including more in-life learning so the live robot can continuously improve. And so a key question they have is how much needs to
be learned in simulation versus the wild.
But there is an even deeper question lurking. All of the methods we've looked
lurking. All of the methods we've looked at so far are ultimately narrow systems. Their rewards were defined to solve a specific problem, not general purpose
problem solving. And so what about
problem solving. And so what about general physical intelligence?
A robot that can act in any way to do anything. If you go back to ancient
anything. If you go back to ancient history of gameplay AI, of checkers AI, chess AI, computer games AI, everyone would say, look at this narrow
intelligence. Sure, the chess AI can
intelligence. Sure, the chess AI can beat Casper off, but it can't do anything else. It is so narrow,
anything else. It is so narrow, artificial narrow intelligence. So in
response, as a reaction to this, some people said, "What we need is general AI."
AI." So how do we fix this? Today researchers
are feeding robots millions of hours of human movement to learn from. But pure
imitation is brittle. So we are giving the robot an imagination. These robots
will simulate the next few seconds of action. In the same way a large language
action. In the same way a large language model talks to itself before responding.
These robots will simulate the next 5 seconds of action before acting. But a
simulation is useless unless you care about the outcome. By combining
imagination with the synthetic pain, we give the robot the ability to feel failure before it happens. This closes
the loop. 60 years ago, Mitchie's Matchboxes learned by consequences.
Today's robots will learn by watching us and then simulating those consequences in their own minds. But what does that mind actually look like? How do you build an imagination?
Luckily, I've made a video which breaks down this history of machine reasoning.
It'll force us to answer the question, what is the line between actions and words?
Look Dave I can see you're really upset about this.
Will you stop, Dave?
Stop Dave.
I'm afraid.
I'm afraid.
Speaking of machines that can reason, this video is brought to you by Higsfield AI and their brand new Nano Banana Pro engine. Unlike older image generators that struggle with details,
Nano Banana Pro is powered by Google's Gemini 3 Pro model. That means it has actual visual reasoning, understands counting anatomy, and can finally render
perfect text inside images at high resolution. For example, here I asked it
resolution. For example, here I asked it to render a robotic hand in Da Vinci style with text labels, and it all actually makes sense. Right now, they're offering unlimited Nano Banana Pro
generations for 1 year. This won't last, so click the link in the description and start creating without limits. Use my
promo code and get access to all top AI models.
Loading video analysis...