LongCut logo

AI Needs to Feel Pain

By Art of the Problem

Summary

## Key takeaways - **AI Sociopath Fixed by Synthetic Pain**: Modern AI is indifferent like a sociopath, hallucinating patterns without caring about consequences, until we gave it synthetic pain by punishing mistakes and rewarding success to develop primitive emotion. [00:00], [00:26] - **1961 Matchbox Tic-Tac-Toe Learner**: Donald Mitchie built a tic-tac-toe machine from 300 matchboxes, each representing a board state with beads for moves; it learned perfectly by adding beads for wins and removing for losses. [01:23], [02:33] - **Value Functions Mimic Emotions**: Claude Shannon's value function scores chess positions like human feelings of situation quality; chess masters develop intuition for position value without calculating all moves. [05:38], [07:40] - **TD Gammon Masters Backgammon**: Gerald Tesauro's neural net used temporal difference learning to predict values one step apart, bootstrapping from endgames to openings; after 300,000 self-play games, it beat humans with middle-game intuition. [11:23], [14:25] - **Deep Q-Networks Conquer Atari**: DeepMind's DQN took raw pixels, output Q-values for actions, and learned from millions of plays to anticipate rewards like firing torpedoes at enemies in Space Invaders. [16:43], [17:40] - **Domain Randomization Bridges Sim-to-Real**: Training in simulations with randomized gravity, friction, and lighting creates robust policies that transfer to physical robots, enabling emergent behaviors like dexterous cube manipulation. [21:46], [22:49]

Topics Covered

  • AI is Indifferent Sociopath Fixed by Synthetic Pain
  • Matchboxes Invent Reinforcement Learning
  • Neural Nets Learn Winning Patterns
  • Deep Q-Networks Master Atari Games
  • Domain Randomization Enables Real Robots

Full Transcript

At its core, modern AI is a sociopath.

Not evil, but indifferent. It doesn't

care about the consequences of what it outputs. This is why the original GPT3

outputs. This is why the original GPT3 was a hallucination machine. Whatever

text you fed it, it just continued the pattern. Whether Shakespeare or

pattern. Whether Shakespeare or gibberish, it was all the same thing, just patterns and data. To fix this, we borrowed a trick from biology. We gave

it synthetic pain, punishing mistakes and rewarding success. This allowed the machine to develop something like a primitive emotion, making it fear

failure and desire success. For decades,

we thought emotion interfered with our decision-m. But this is wrong. I read

decision-m. But this is wrong. I read

about this person who had some kind of brain damage that took out his emotional processing. You know, he still remained

processing. You know, he still remained very articulate and he could solve little puzzles and he became somehow extremely bad at making any decisions at all. It would take him hours to decide

all. It would take him hours to decide on which socks to wear. Without that

emotional signal, intelligence is paralyzed because it turns out feeling is the engine of learning. This is the story of how we taught machines to feel.

And it begins with an incredible experiment from 1961 using nothing but 300 matchboxes.

In the 1960s, we see the first practical implementation of a reinforcement learning machine by computer scientist Donald Mitchi. He chose the problem of

Donald Mitchi. He chose the problem of learning tic-tac-toe entirely from the experience of wins and losses, no human design strategy. And this was 1961. And

design strategy. And this was 1961. And

because he didn't have a computer, he built the system out of matchboxes to demonstrate the technique.

Each matchbox represented a different state of the game, which is the current board position. Inside each box was an

board position. Inside each box was an assortment of colored beads.

Each color representing a possible next move. He writes, "Imagine that we wish

move. He writes, "Imagine that we wish to play against the machine. The first

move would be defined by removing the box representing that position, removing a random bead, and moving to that position. Then the human opponent would

position. Then the human opponent would make a move, and the process repeats until the end of the game. Initially,

because each box had the same number of beads, the machine would play randomly with no strategy. However, it learns after each game through reinforcement.

He writes, "If a machine has done badly, it is punished by removing one of the chosen bead colors from each box used during that game. And if the machine has

done well, it is rewarded by adding to each of the chosen boxes an extra bead of that same color, so that the winning moves become more likely to occur in

future games. And with this simple

future games. And with this simple method, his system was able to learn to play perfectly from experience.

This algorithm was later named boxes.

Mitchie went on to demonstrate the same algorithm on a computer. And then he tested the exact same method on a more difficult realworld problem using the

classic challenge of controlling a cart to balance a pole. To make this work, he had to reconsider what the boxes would represent. Each box represented the

represent. Each box represented the state of the cart's position and velocity. We call these variables

velocity. We call these variables features, what we measure in the environment. And because the speed or

environment. And because the speed or position takes on continuous values, he simplified it into a smaller set of ranges or bins known as discretization.

This resulted in 162 possible combinations or boxes, each representing a possible state of the carton pole at any moment. And for each box, the action

any moment. And for each box, the action was defined by two kinds of beads or variables representing the probability of going left or right. And it worked.

As with tic-tac-toe, this system would start with a random strategy. When the

pole falls below horizontal, a reinforcement signal would be applied to all boxes in that pathway and gradually improve with experience until the system

was able to stay balanced for long periods of time.

And so we can define reinforcement learning as two separate problems. The first is embodiment. Providing a

machine, what we call an agent, direct experience of the world through sensors.

We call this perception of the world the state of the world. In the case of tic-tac-toe, it was the board. In the

carton pole, it was the speed and angle of the system. This perception can also include occasional reward and punishment signals such as winning or losing a game, staying balanced or falling down.

The second problem is learning how to behave, what action to take given a current state. We call this a policy.

current state. We call this a policy.

The goal of reinforcement learning then is to find a policy that leads to maximum future rewards. So, are we done?

Why can't this box's technique solve all problems? A key problem with this box's

problems? A key problem with this box's technique is that it requires too much experience because each state of the system must be visited many times in

order to learn a policy. And so when we turn our attention to more difficult games such as chess or harder control problems such as walking, the number of

possible states you can be in explodes.

You need more boxes than can fit in the universe. The solution to this problem

universe. The solution to this problem began much earlier in a 1950 paper by Claude Shannon. Shannon understood it

Claude Shannon. Shannon understood it was futile to program a computer to look ahead at all future moves to win a chess game. There's too many paths to

game. There's too many paths to consider. So instead, he imagined a way

consider. So instead, he imagined a way to predict the future of a game before taking any moves. And his key idea was to define an evaluation function which

could tell you given any board state a score from minus1 to plus one how likely it leads to a win versus a loss. This

evaluation function became better known as a value function. You provide your state, it tells you the quality of that state. To create a value function,

state. To create a value function, Shannon started with some well-known chess rules. For example, peace count.

chess rules. For example, peace count.

Having more pieces is better. Peace

value. A queen is better than a pawn.

Mobility. More mobility to move your piece the better. How exposed your king is. Less exposed is better. Each of

is. Less exposed is better. Each of

these we call features. And together

they can define an equation which gives different importance or weights to each feature. And so given any board position

feature. And so given any board position in chess, his equation outputs a value for that position. And so with this value function, you can design a policy

simply as follows. At the machine's turn to move, it calculates the value for each possible next move and takes the highest value move, known as a greedy

strategy. Humans have something similar

strategy. Humans have something similar to this kind of value function. You can

think of it as a feeling about how good a current situation is. What was I alluding to with the person whose

emotional center got um damaged is more that maybe what it suggests is that the value function of humans is modulated by

emotions in some important way that's hardcoded by evolution. And Shannon

notes this is exactly what chess masters do. Masters develop an intuition about

do. Masters develop an intuition about how good a position is without knowing exactly how they will play out.

And while Shannon's method works pretty well playing very average chess, Shannon notes it's fundamentally limited by the rules he created. He writes the chief

weakness is that the machine will not learn by mistakes. and he ended his paper with the quote some thought has been given to designing a program which is self-improving.

One possibility is to have a higher level program learn the weights or importance of each feature involved in the value function.

This challenge was picked up by Arthur Samuel who programmed a computer to learn the weight or importance of each feature from selfplay exactly as Shannon had proposed. He worked on a problem

had proposed. He worked on a problem harder than tic-tac-toe but slightly easier than chess. Checkers. Like

Shannon, he used human features of the game board to define a value function for checkers which included piece count, king count, center control, and

mobility. This resulted in a value

mobility. This resulted in a value function which multiplied each feature by a weight variable to define its importance. I've given these concepts to

importance. I've given these concepts to the machine but not told it whether they are import important or not.

>> But this time the weights of each feature was not set by a human expert as Shannon did but learned from experience.

To do so he had the system play against itself with random weights attached to each feature. So at first it had no

each feature. So at first it had no strategy and he let the system loop through games like this playing itself while continuously updating its weights.

And by doing this it changes its the importance that attaches to center of the board kings and and so forth and as a result changes its playing tactics

with time and actually improves. This

gave the program what he called a sense of direction. And after enough games of

of direction. And after enough games of self-play, he writes, "It learns to play a better game of checkers than the person who wrote the program in just 8 hours of machine playing time." But it

wasn't yet a master player. At the end of his paper, Samuel identified the key next step. It might be argued the list

next step. It might be argued the list of features I provided is too simple and the program should generate its own features. Unfortunately, no satisfactory

features. Unfortunately, no satisfactory scheme for doing this has yet been devised.

Have you claimed your digital identity?

This video is brought to you in part by Ace.Me. If you're tired of juggling

Ace.Me. If you're tired of juggling Linkree, Gmail, or Dropbox, meet ace.mme, your website, email, and cloud

ace.mme, your website, email, and cloud storage in one simple platform. Claim a

free tag like yourname.me

and build stunning sites with a lightning fast editor. Plus, get a sleek email inbox that's two times faster than Gmail. No ads, no subscriptions, no

Gmail. No ads, no subscriptions, no cookie banners. Just claim your tag at

cookie banners. Just claim your tag at ace.mme and own your digital identity

ace.mme and own your digital identity forever. Remember, game features like

forever. Remember, game features like mobility or peace count are really just patterns in peace arrangement that we give names to. The challenge is to automatically extract meaningful

patterns from perceptions that relate to winning. And it took over 30 years

winning. And it took over 30 years before somebody figured it out. It

required a watershed moment which came in the late 80s when Yan Lun worked on artificial neural networks that could learn to recognize human handwritten digits. And it was able to do this

digits. And it was able to do this because it learned features or patterns related to digits such as curves or line

count. After this result, Gerald Tesaro

count. After this result, Gerald Tesaro had a brilliant insight. He thought if multi-layer neural networks can learn to recognize digits, why can't they learn

to recognize good game positions? And he

famously wrote a network which is capable of automatic feature discovery is one of the long-standing goals of research since Samuel's checkers. He

chose an even harder game of back gamon to advance research. Following Samuel's

suggestion, instead of using human designed features like peace count to define the value function, features would be patterns the network learned by adjusting connection weights between

neurons. It worked as follows. He

neurons. It worked as follows. He

provided as input to a neural network a raw board description. It was the job of this middle layer to learn features or patterns of the game related to winning.

Therefore, instead of using few human design features, it used thousands of machine-learned features averaged together. This middle layer connected to

together. This middle layer connected to a final layer of output neurons which produced the value estimate of the board position and the connections between these layers would adjust their

strengths or weights during training.

These weights determined how much each learned feature from the middle layer contributed to the final value estimate.

Also, Tasaro changed how rewards were defined compared to Samuel because learning based on win loss is difficult since you need to take so many moves before hitting your reward or

punishment. And so, he used a second

punishment. And so, he used a second breakthrough at the time to get an ongoing learning signal at each step. At

a high level, here's the idea. At any

point in the game, before taking your best move, measure the value of your current position. Let's say it's 80%

current position. Let's say it's 80% chance of winning. Then you take your move and rememeasure the value of the new position. Maybe it's now 90% chance

new position. Maybe it's now 90% chance of winning. Ideally, these numbers would

of winning. Ideally, these numbers would match if your value function was perfectly accurate. But if they don't,

perfectly accurate. But if they don't, you update your value function to nudge it closer to matching. Sutton described

this method as learning from your own guesses which are one step apart in time which is why it's called temporal difference learning. But here's the key.

difference learning. But here's the key.

This TD learning approach kicks in after the network plays through to an end game. These end states of wins and

game. These end states of wins and losses anchor the learning process.

After a few hundred games played through to the end, the network gets better at predicting winning patterns one move away from end games. And so once it understands how to value board positions

one move away from a win and loss, it starts to get better at predicting two moves away and so on all the way until it's able to accurately predict the value from the opening move. And so the

value function learns how to understand game board positions backwards from end games to opening games. This is known as bootstrapping. And after playing around

bootstrapping. And after playing around 300,000 games against itself, it started to play better than human. He writes,

"It seems remarkable that the neural networks can learn on their own how to play at a level substantially better than a computer system designed by the

best experts. But how it played was most

best experts. But how it played was most interesting to him. Its strength was in the middle game positions where judgment not calculation is key. That is his

neural network had developed intuition like the masters. And he wrote this has apparently solved the long-standing feature discovery problem what Shannon had dreamed of. However, there was not

an explosion of results after this publication because researchers found that learning a value function on highly complex games

and continuous problems in robotics was impossible because the learning process would either not start, slow down dramatically, or even get worse over

time. A key reason for this instability

time. A key reason for this instability is that the value function is focused on position or state quality, not action quality.

And so a key improvement needed was to train networks more directly on actions.

This idea was proposed in a 1989 PhD thesis by Watkins. He suggested simply that instead of learning the value of a state, you should learn the value of a

specific action in a state. He called

this action quality, which he called a Q function.

And this more targeted approach to learning often leads to faster and more stable results. A second key realization

stable results. A second key realization was that we needed to train these systems on much bigger neural networks.

But this wasn't obvious until 2012 when a team built AlexNet, one of the largest and most complex neural networks of its time. This was building on Lun's work,

time. This was building on Lun's work, but instead they were trained to categorize images, which is an unfathomably complex task to human

engineer. And exactly how Tasaro saw

engineer. And exactly how Tasaro saw Lun's digit recognition networks and got inspired, the team at DeepMind saw this breakthrough and decided to follow up on

the work of Tesaro by applying Q-learning and large neural networks or what they called a deep Q network. Their

goal was to create a single network to play as many games as possible and it would only learn from the rewards it would get in the video game such as points or winning and losing. It started

from scratch, what we call a knowledgefree approach. And so they set

knowledgefree approach. And so they set up their network to receive an input from screen pixels followed by eight layers of neurons leading to an output neuron for each possible controller

action. These outputs corresponded to

action. These outputs corresponded to the predicted Q value for each individual action. And in the same way,

individual action. And in the same way, they initialized their network with random weights at first. So, initially,

it would press random buttons and couldn't play at all. But after each experience, it would update the Q function to get slightly better at getting rewards. And they had it play

getting rewards. And they had it play millions of times across multiple games.

And incredibly, it worked.

>> Now, after 2 hours or 300 games, it's better than any human um can play this game. It was clear that this system

game. It was clear that this system could learn to play games by anticipating future rewards. And they

showed this in their paper by looking at the network's action value estimates as it played sequest. The figure shows that the predicted value jumps after an enemy

appears on the screen. Then the agent fires a torpedo at the enemy and the predicted value peaks as the torpedo is about to hit the enemy. Finally, the

value falls to roughly its original value. They write, "This demonstrates

value. They write, "This demonstrates that our method is able to learn how the value function evolves for a reasonably complex sequence of events." While this

was an extremely exciting breakthrough, there was still a key limitation to this approach which prevented its direct application to physical robotics. AI

could beat humans at chess, but nowhere close to washing dishes. Why? One key

reason was Q-learning can break down with continuous control problems where the action space isn't a choice between a few buttons but a range of possible actions. One of the most important

actions. One of the most important examples of this is the Robocup which started in the mid '90s and challenged robots to play soccer autonomously and to this day it represents an important

benchmark for physical intelligence. The

first systems that tried Q-learning were promising but noticeably brittle and clunky. That's because they had to break

clunky. That's because they had to break up the action space into a small number of chunks or buckets like move forward, turn left 45° or kick. And so this

approach to DQ learning faced a key problem. How to handle large continuous

problem. How to handle large continuous action spaces.

A solution to this problem takes us back to the original box's algorithm which directly learned the probability of all next actions but powered by a neural

network where a state would be provided as input but for the output it would be the probability across all actions known as

an action distribution. An unlearned

policy starts as random with equal probability for all actions and as it learns the shape of this action distribution changes in the direction of

more reward and then the policy can simply select an action according to this distribution and this approach can handle any number of possible next

actions. This is known as a policy

actions. This is known as a policy gradient approach. A key problem with

gradient approach. A key problem with this direct policy method is it's a much harder task since you're not learning values of individual actions, you're learning a distribution over all

actions. And so generally speaking, it

actions. And so generally speaking, it requires more experience to learn effectively. And because of this need

effectively. And because of this need for so much experience, most of the initial progress was in simulated robotics. Because with simulations, you

robotics. Because with simulations, you could speed up time by orders of magnitude and get years of experience in hours. But that still left an open

hours. But that still left an open question. What about the physical world?

question. What about the physical world?

Robots trained in simulation and then transferred to a physical robot would fail at complex tasks. And it was due to the complexity and unpredictability of

the real world. And research aimed at training robots with direct policy methods entirely from physical experience with no simulation faced the

frustrating challenge that it was much slower and expensive to do.

And for the first few years, there were many hacks to try and make this work.

For example, adding human engineered rules such as preventing large changes to the action distribution at any step as if elastics were attached to it to

stabilize it known as proximal policy optimization.

But a critical unlock came when it was realized that instead of more accurate simulations to learn in, you actually need less precise simulations. That is

you train a robot in purposely messy or fuzzy simulations. To do this,

fuzzy simulations. To do this, researchers simply trained a robot in simulation but randomized all the key aspects of the simulation such as gravity friction size lighting

conditions to create a huge variety of environmental experiences to learn from.

This approach, known as domain randomization, helped robots develop more robust and adaptable behaviors that could transfer better to the real world.

A small research lab at the time known as Open AI famously demonstrated this approach with the first physical complex hand which could learn dextrous

manipulation of a cube. The reward

signal in this case was based on getting the cube into different orientations.

And they trained this hand first in simulation and then transferred it to a physical robotic hand. They noted that the randomization provided a diversity

of experiences it could draw from and so it generalized over environments. This

allowed it to recover from all sorts of things you could throw at it.

>> One thing that's very interesting to us is how general the system is. Not only

can it rotate blocks, but it can perform tasks with other shapes as well.

>> It was very human. All of these were emergent behaviors, physical intelligence, which came out of the learning process. And if we fast forward

learning process. And if we fast forward to 2024, Deep Mind has been working on robot soccer and applying everything we've learned so far. They've been

training humanoid robots with direct policy methods in simulation to play soccer with domain randomization using scoring as a reward. After simulation,

this neural network was then transferred directly into a physical robot with no additional training. And like the hands,

additional training. And like the hands, they show complex emergent behavior that humans learn, such as anticipating shots, blocking shots before they

happen. This was even more evidence of

happen. This was even more evidence of physical intelligence emerging from the learning process. The researchers

learning process. The researchers visualize these learned behaviors as pathways through action space. A key

next step in their work is including more in-life learning so the live robot can continuously improve. And so a key question they have is how much needs to

be learned in simulation versus the wild.

But there is an even deeper question lurking. All of the methods we've looked

lurking. All of the methods we've looked at so far are ultimately narrow systems. Their rewards were defined to solve a specific problem, not general purpose

problem solving. And so what about

problem solving. And so what about general physical intelligence?

A robot that can act in any way to do anything. If you go back to ancient

anything. If you go back to ancient history of gameplay AI, of checkers AI, chess AI, computer games AI, everyone would say, look at this narrow

intelligence. Sure, the chess AI can

intelligence. Sure, the chess AI can beat Casper off, but it can't do anything else. It is so narrow,

anything else. It is so narrow, artificial narrow intelligence. So in

response, as a reaction to this, some people said, "What we need is general AI."

AI." So how do we fix this? Today researchers

are feeding robots millions of hours of human movement to learn from. But pure

imitation is brittle. So we are giving the robot an imagination. These robots

will simulate the next few seconds of action. In the same way a large language

action. In the same way a large language model talks to itself before responding.

These robots will simulate the next 5 seconds of action before acting. But a

simulation is useless unless you care about the outcome. By combining

imagination with the synthetic pain, we give the robot the ability to feel failure before it happens. This closes

the loop. 60 years ago, Mitchie's Matchboxes learned by consequences.

Today's robots will learn by watching us and then simulating those consequences in their own minds. But what does that mind actually look like? How do you build an imagination?

Luckily, I've made a video which breaks down this history of machine reasoning.

It'll force us to answer the question, what is the line between actions and words?

Look Dave I can see you're really upset about this.

Will you stop, Dave?

Stop Dave.

I'm afraid.

I'm afraid.

Speaking of machines that can reason, this video is brought to you by Higsfield AI and their brand new Nano Banana Pro engine. Unlike older image generators that struggle with details,

Nano Banana Pro is powered by Google's Gemini 3 Pro model. That means it has actual visual reasoning, understands counting anatomy, and can finally render

perfect text inside images at high resolution. For example, here I asked it

resolution. For example, here I asked it to render a robotic hand in Da Vinci style with text labels, and it all actually makes sense. Right now, they're offering unlimited Nano Banana Pro

generations for 1 year. This won't last, so click the link in the description and start creating without limits. Use my

promo code and get access to all top AI models.

Loading...

Loading video analysis...