LongCut logo

Embodied Spatial Intelligence:Bridging Perception, Reasoning, and Action

By Frontiers of Engineering Management

Summary

## Key takeaways - **Construction's Deadly Stagnation**: Construction has the highest occupational injuries and fatalities compared to other industries worldwide, and its labor productivity has remained stagnant for nearly 100 years versus rising in other non-farm sectors. [03:06], [04:02] - **Unique Construction Challenges**: Construction sites are unstructured and dynamic unlike manufacturing, require extensive material manipulation unlike self-driving cars, and face constant changes, featureless areas, repetitive structures, occlusion, and data sparsity. [06:00], [07:27] - **Embodied Spatial Intelligence Defined**: Embodied spatial intelligence means understanding and interacting with real-world space through physical actions, not just in virtual worlds, essential for enabling construction automation and robotics. [13:06], [13:50] - **Four M's Framework**: The keys to embodied spatial intelligence are mapping to represent the world using egocentric observations, moving for navigation and decision-making, making to build in the physical world, and mingling for multi-agent collaboration. [19:13], [19:26] - **AI Experts' Failed Predictions**: John von Neumann, Herbert Simon, and Marvin Minsky made overoptimistic predictions nearly 100 years ago about machines doing human work like construction, yet we still cannot achieve it. [16:56], [18:21] - **Mobile Manipulator Necessity**: Construction equipment as mobile manipulators must perform simultaneous navigation and environment modification, unlike fixed robotic arms in manufacturing. [10:40], [11:00]

Topics Covered

  • Construction Productivity Stagnates for Decades
  • Construction Sites Defy Manufacturing Automation
  • Embodied Spatial Intelligence Enables Wall-E Robots
  • Four Ms Unlock Construction Robotics
  • AI Robots Engineer Termite-Level Building

Full Transcript

Yeah. So the topic um um that I'm sharing today is related to construction, automation and robotics.

Especially I summarize uh the research needs and also my group's research focus in the past uh six seven years as embodied spatial intelligence and I will

try to make the connection between uh the two topics. Uh before going into the details, let me uh very briefly

introduce my lab AI force. Um the lab's mission is to address the following challenge in the field. um which is uh

to develop novel algorithms and uh systems for uh intelligent agents to accurately and um uh efficiently

understand and interact with materials and humans um in dynamic and unstructured environments. Uh we believe

unstructured environments. Uh we believe this is the sort of like the fundamental uh research need in this field of

construction automation and robotics.

um uh before we can make uh technologies in this field uh really useful at a large scale. So in order to achieve this

large scale. So in order to achieve this uh mission we need to adopt the so-called multid-disciplinary uh research methodology and adopt the

so-called use inspired research paradigm. Um in order to achieve such

paradigm. Um in order to achieve such methodologies we need expertise from multidisciplinary team. So in my group

multidisciplinary team. So in my group we have students working uh both in fundamental robotics and AI fields such as computer vision, robotics, machine learning.

>> More, specifically, um, we, focus, on, uh visual or LAR based localization mapping navigation um 3D vision or perception

and learning. Um in the problem domain

and learning. Um in the problem domain that we focus on um uh we mainly we're mainly interested in problems originates from uh the construction or

manufacturing field as well as uh smart cities. So that means uh transportation

cities. So that means uh transportation uh connected and autonomous vehicle.

So we mentioned today's uh main theme is about construction automation and robotics. Um and there is a strong need

robotics. Um and there is a strong need um in in in this whole field globally right I'm sure um the colleagues in

China will uh notice this uh u this this shift um um recently so basically um we

said in in the society there are two um uh important uh sort of problems uh faced by the construction sector um And

uh for those of you who have been in the uh construction field for um for years you would um not be uh um unfamiliar with the following uh two figures that

I'm showing. One of them demonstrates

I'm showing. One of them demonstrates the high injuries and fatalities um in the construction site. So here is a uh figure from the uh US but I think the

same figure um or the same trend uh pretty much applies to uh every country worldwide. So it basically shows that

worldwide. So it basically shows that construction has you typically has the highest uh largest number of uh occupational injuries uh as compared to

uh different industry industry sectors right? Uh not only uh the absolute

right? Uh not only uh the absolute number but also the rate is also quite significant. Um we also uh know that

significant. Um we also uh know that construction has always been um accused of having stagnant labor productivity.

So if you compare the labor productivity of construction uh versus all other non-farm industries um in the past several decades right um almost 100

years the uh product labor productivity of construction uh no matter how you um um like calculate the index different kinds of index you always get this

stagnant curve as compared to the rising curve in other nonfarm industry. So we

said there is uh this strong need or demands for uh bringing new technology into construction to address these two important problems. And this is not the

only two problems that are faced by construction. We're also facing by

construction. We're also facing by issues of uh lack of uh young talents uh aspirational young generations. um they

do not tend to think of construction industry as something um as a rewarding field. So uh not many people are willing

field. So uh not many people are willing to enter this field. So how do we address all these as a whole? Of course

u there are different ways to address it. Um uh one uh one of the many ways um

it. Um uh one uh one of the many ways um that um um we can we can we can do and we can try to bring and is also uh my

personal and my group's research focus is try to look at the the high technology right the automation and robotics um can we learn from manufacturing uh or transportation

fields um that has already u significantly benefit from uh high-tech especially robotics Right. Companies like Tesla and Google

Right. Companies like Tesla and Google they have been attracting um some of the um really bright new

talents um into their fields by their robotic factory and self-driving cars.

Um but besides making this um uh analogy between construction and uh manufacturing or transportation, I want to point out several unique challenges

um for construction robotics. Um so

unlike manufacturing, construction environments are usually um structured and dynamic. Um and unlike self-driving

and dynamic. Um and unlike self-driving cars, the construction robots have to um um perform more materials manipulation

and they have to typically um uh collaborate with um human workers on the site. So u fundamental research are

site. So u fundamental research are required for construction robots. uh we

cannot just think about applying existing techniques in automation and robotics or machine learning or even AI to solve to magically solve problems

that we face in construction um and engineering management. So let me um

engineering management. So let me um um illustrate some of the more specific technical challenges uh faced by uh construction folks when we try to bring

automation and robotics onto the site.

So one of the first one is the so-called constantly changing job sites right so this is a data set that iso um and a challenge hosted by Stanford um uh civil

engineering uh department um they call this as uh nothing stands still challenge right so basically they go to some construction job site and scan in

3D very detailedly um almost every week um and then you notice the the site nothing stand still in a construction site, right? The the things change uh

site, right? The the things change uh every week if not every day. Um and this brings a lot of challenge to robotics um as we will discuss later. And another

thing for um specific to construction is that you typically face great difficulties in localizing and navigating um agents. agents including

um autonomous equipment um or even construction workers. Uh because a lot

construction workers. Uh because a lot of the um construction site when it is being built um it's featureless. Um it

has many repetitive structures and it is difficult for uh even human to know uh where they are if they're not familiar with the site and especially when the

site gets large and complicated.

Last but not least um just like in um um transportation instead of driving cars construction site also face this severe occlusion and data sparity issue. So if

the construction equipment are observing something from far away uh this the number of data points you can get from either a LAR scanner or a camera um is

limited. you only have limited pixels

limited. you only have limited pixels focusing on a particular distant object um or you you have limited uh lighter points focus on that object, right? So

um not to mention the heavy occlusion we all know when you have to when you deal with um frames frame operations um or

like really um u complex confined space in a mega site. Occlusion is another issue where you constantly uh need to check whether things are going to

collide with each other or not. If

you're especially if you're going to bring autonomous or semiautonomous equipments like trucks, excavators cranes uh that needs to operate uh on

its own into this kind of complex site with heavy occlusion. This is even more problematic, right? So not to mention we

problematic, right? So not to mention we we want to use these technique to make the site more safe. Now if you cannot address these technical challenges

you're actually bringing more danger into the construction site.

So basically we're saying construction vehicles are becoming autonomous these days. Um as you can see on the image

days. Um as you can see on the image here um this is not just happening in the academia. It's happening in the

the academia. It's happening in the field in those heavy equipment manufacturers um um companies like Volvo uh Boston

Dynamics um uh different kind of like uh um construction heavy equipment manufacturers even in um um in in China

I know uh companies like Shong um and San Jongong they are all investing in this um autonomous equipment technology

right so basically um In robotics term we can call these construction vehicles um or equipments as mobile manipulators.

These mobile manipulators uh needs to finish one task uh regardless of different specific construction jobs they have to perform um in a from a

fundamental robotics perspective they all need to perform simultaneous navigation and environment modification.

So they need to move in order to perform their jobs. Unlike in manufacturing

their jobs. Unlike in manufacturing side, you can rarely have a a uh robotic arm that's fixed at one place and just

do work waiting for for things to be transported to the vicinity of that arm.

Um the robot actually has to go around move around um and while they're moving around um they have to make modifications of their environment. So

this is what we call as mobile manipulator um in robotics and uh this is an active um uh field research field

even in robotics uh with rising interest uh because it has fundamental challenges. It is challenging the uh

challenges. It is challenging the uh existing theories and algorithms for perception for localization for planning for control and coordination.

So um on the left hand side I show this u um uh kids like a very famous kids story book called mighty mighty construction site.

Right? So if you notice on this uh uh story book uh it is basically highlighting the a vision or a future that we may want to see where you have

all the equipments becoming autonomous and they collaborate um coher coherently uh with each other. Is there any question? By the way if there is any

question? By the way if there is any question feel free to unmute yourself and ask uh interrupt me at any time um

or you can always type in the chat.

All right. Anyway, so um in in my um um uh past research, I tend to call these mobile manipulators as wallally. Uh it's

just because they are more widely known to the general public uh to to to to people who are not in this field of robotics. If you if you talk about

robotics. If you if you talk about Wall-E, people immediately gets the idea. So how do we make these wallies

idea. So how do we make these wallies safer and uh uh more efficient uh when we try to bring them onto construction sites? Well, the fundamental research

sites? Well, the fundamental research that I believe is needed is called embodied spatial intelligence. This is a term that I coined that combines two

very popular terms nowadays. uh embodied

intelligence or in Chinese and spatial intelligence or in Chinese uh so it's embodied spatial intelligence

um we think u that this is the secret sauce to uh fundamentally uh enable construction automation and robotics right so what do I what I mean by

embodied spatial intelligence well um uh fundamentally we need to understand and interact with the space that we resides in. And importantly, we need to form

in. And importantly, we need to form this kind of understanding and interaction uh through physical actions.

We're not just uh doing things in imaginary world or in a digital virtual world such as being only, right? In

order to achieve embodied spatial intelligence, you actually your equipment, your your um your vehicles they move, they work in a real job site.

And this is not something that is actually special or unique to construction. Um they reside in mother

construction. Um they reside in mother nature. Um as you can see on these

nature. Um as you can see on these figures uh many animals that you see in in the cities uh they have this cap or even wild they have this capability.

They know how and where to find food.

They know how to um um um for example very interestingly uh if you um ever visit um um cities in uh in uh the North America you'll notice this kind of

squirrels um uh living in the city and they learn to cross roads um with certain uh level of safety right so they

can avoid being hit by a car um this is very very interesting phenomenon and uh recently there is a uh PNAS uh journal published a work uh that compares this

uh um collaborative intelligence in ants versus human. And the task they give the

versus human. And the task they give the ants is to move this structure across um some kind of like u uh interesting geometry, right? Um and imagine um

geometry, right? Um and imagine um actually and they did this test with human uh workers as well in order to collaboratively move this um uh object

this red object through this um confined space with ge ge geometric obstacles.

Even human needs to try multiple times and we are not believe it or not we are not uh more efficient than ants. Right?

So this is very interesting. Um but most importantly uh to the interest of construction folks we want to engineer those capabilities back in machines. So

uh not only do we want to uh achieve autonomous vehicles or autonomous delivery uh systems material handling system um we want them um also to be

able to interact and and move objects work in factories and construction site.

Now talking about this um you probably will think how far are we from engineering such intelligence uh given all the great excitement um and the progress in the field broad field of AI

artificial intelligence um especially yesterday the GP5 um is released right so um here I want to

quote from some famous u um um experts um that are known to uh the general public here this figure shows uh John

vonoman um a very famous like um believed to be the father of um uh modern computers, right? So um in not

like this is almost almost 100 years ago and he said if you will tell me precisely what it is that a machine cannot do then I can always make a machine which will do just that. Right?

This is almost 100 years ago and think about what we can do now and what we cannot do now. Right? There's still so many things that I can precisely tell

you what uh I wanted uh to be able to do but we cannot. For example, um a machine that can do household jobs for us or a machine that can construct.

So this is another uh very famous scholar, a touring awardee and also um who won Nobel Prize um later. His name

is Herbert Simon. He was a professor at Carnegie Menan University. Right. So he

once said 1965 that machines will be capable within 20 years of doing any work a man can do.

And clearly we know that is not true even nowadays. Right? And one more uh

even nowadays. Right? And one more uh very famous guy in artificial intelligence professor Marvin uh Minsky

who also won his touring award uh in 1969. And he said one year later after

1969. And he said one year later after he got his Turing award he said um in from three to to eight years uh we would have a machine with the general

intelligence of an average human being.

And are we there yet? Clearly not. So so

the reason I'm showing these quotes from these famous people is that first it is very difficult to make this kind of predictions even by some of the brightest minds in the human history. Um

and it also shows this is this field has so much potential for uh young and aspirational um um minds like you guys

uh to join this field. There's so many interesting things to work on um and let's work on these together. So um from

my lab's perspective or personally what I think um are the keys to address um spatial um embodied spatial intelligence that will eventually lead to

construction automation and robotics is the following four M that I summarized.

Um they are mapping, moving, making and mingling. So I'm going to very quickly

mingling. So I'm going to very quickly explain what these four words means before diving into the details.

So um mapping really means representing the world. How do robots build maps or

the world. How do robots build maps or basically a representation of its surrounding um environments? This means

a robot has to use what we call as egocentric observations. They're not

egocentric observations. They're not necessarily just depending on cameras or sensors installed in the environment. Um

they have to rely on sensors installed on their body uh that can move together while they're moving uh to representations of the observed

environment. So this is what we call as

environment. So this is what we call as mapping. And uh as as the some of the

mapping. And uh as as the some of the animation figure shows in in this slide that highlights my group's work um

mapping um uh well the most well-known mapping are metric mapping. The idea is to get u um accurate geometry so that

you can measure distance, angle, volume and all those things um um in that map.

uh mapping is not just metric um it also uh represent u uh uh requires topology understanding. So the topological

understanding. So the topological mapping is another very important form of maps and mapping is not just about parsing the data that we observe. It's

also about reasoning in those kind of complex environment. Okay. Um so the

complex environment. Okay. Um so the second key word is moving. So we need to teach robots how to decide where to go because if they don't they cannot do

this then basically the whole u uh Wall-E story or the mobile manipulation um um um hope that um we think we hope

the um construction robots can do will not be possible. Right? So in order to um enable robots to decide where to go um one of the most fundamental problem

is localization. They need to understand

is localization. They need to understand where they are in a complex 3D um or 2D simplified 2D environments. Um and they

need to know more than that. Um so even if you know precisely as shown here in this figure which is from our group's very recent work that achieves

state-of-the-art six off localization of a camera um in 6D. So that means we know both 3D position of the camera and also

uh uh the orientation uh in that space uh from only camera images without requiring any GPS or any other sensors.

Um so even if you know this it is not enough for a robot to navigate in those complex environments. The robot could be

complex environments. The robot could be a self-driving car. It could also be an excavator uh on a job site.

Right? So um it they need to uh understand the situation that they um the surrounding situation. What is

happening? Is there any signal that prevents me from doing certain actions?

When should I uh when can I move across this field in front of me and when can I not? Uh so all these are uh uh beyond

not? Uh so all these are uh uh beyond localization. They're decision- making

localization. They're decision- making problems. Of course, the things that are most interesting and and most um useful to

robotic to construction folks are the making part, right? We actually need to enable these autonomous machines to build in physical world instead of in

addition to uh localize and move themselves like self self-driving cars do, they actually need to move and do things and make things happen. Right? So

here I'm make some um so the animation is from the well-known Wall-E movie. I'm

sure many of you or your kids have watched this before. Um but from I want to remind you uh the amazing capability

um u by the mother nature. So here is a termite mound that people found um in Africa, right? And I'm sure you can find

Africa, right? And I'm sure you can find similar structures uh in other part of the world. So if you think about animal

the world. So if you think about animal architects, this includes termites but beyond termites, right? They can build structures. Uh this is an amazing

structures. Uh this is an amazing capability. Even nowadays after

capability. Even nowadays after thousands is thousands years of technology update in construction. We

still don't have a autonomous system that can do this um uh kind of quote unquote simple task uh that even a

termite or ant can do right. So that

means there are some theoretical challenge uh fundamentally uh uh lies beneath these these um um application.

So we a couple years ago we actually um uh greatly simplified these uh construction problem in a very theoretical and non-realistic manner for

us to study the theor theor theor theoretical challenge of doing construction by a mobile agents and we simplified things in a grid like

environment uh just like the the go game right so we know a couple of years ago um uh Google um uh was able to do this

alpha go AI that can uh uh perform uh super well um they can win the human champion of the go game. So if you build

if you try to study the construction the mobile construction problem um as in a grid world then you will be facing something like this. Um even though this

is already abstract out a lot of the realistic real world construction challenges it is already showing in this simplified environment there's still

theoretical challenge and that challenge is what we call as palm DP for those who are interested in theory or partial observable and mark of decision problem.

So when you're facing the palm DP problem uh then of course um the theoreticians will know this is a super hard super challenging problem uh even

for pure roboticist or machine learning uh AI researchers. So no wonder construction robotics and construction automation remains one of the big

biggest challenge for robotics. Okay. So

we're also showing some of the figures here um um that we try to use engineering tricks to circumvent some of these theoretical challenges. For

example, on the top, we're showing a mobile 3D printing platform that my team developed uh to uh enable uh the u 3D

printing beyond a gantry based system that you typically see either in a tabletop uh uh plastics 3D printer or on a construction site, a concrete 3D

printer. Uh currently they almost always

printer. Uh currently they almost always uh depends on some the so-called gantry based system or a fixed robot arm. um to

deliver the material. But we think uh if you can move this print head onto a mobile robot and let mobile robots collaborate on the construction site

then you can if uh essentially print much faster in a collaborative manner.

Uh and you can print something bigger than the printer itself. Uh on the bottom we're showing a um um construction like the like the terrain

grading autonomous terrain grading experiments happening in our lab. This

is an ongoing project that we're hoping to by deliver uh uh researching this kind of um topics will one day enable

robots to go to extraterrestrial conditions like moons or mars uh to build bases for us. Right? So this is

the making part. Um and the last um uh key word is mingling. Um this really means multi- aent uh collaboration. So

in again in mother the nature we see all kinds of uh collaboration among different agents. Um this is an example

different agents. Um this is an example of monkeys. uh they talk to each other

of monkeys. uh they talk to each other through um their way and they communicate, with, each, wi, with with, each other to tell uh people where foos are

to tell monkeys where foods are and uh where the other dangers uh uh may happen right and this is of course more than just uh um the monkeys. We see this

behaviors in ants. We see this behaviors in birds, many animals, right? So this

um inspire us to think how can we enable autonomous agents like robots or construction equipments uh to work together uh so they can see and act

better right um so my group has actually pioneered in this direction and and developed uh a a research uh sub field

called co-perception or collaborative or cooperative perception. Okay. So um due

cooperative perception. Okay. So um due to the lack of needs in uh due to the lack of data in construction site we we started uh by collaborating with

researchers in transportation. Um so in transportation we already have this kind of u possibilities to either simulate or

collect real world multi-agent uh uh interaction data where you have vehicles um uh nearby each other and they all equipped with sensors and they can talk

and communicate with each other. So

essentially if you allow these agents to share what they see with each other then collectively they see uh further they

see better they see through occlusions right so this will make the whole system fundamentally safer um uh than the single agent uh perception system and of

course we want to go beyond uh just the uh collaborative perception.

We want to when we are bringing this onto construction sites, we want them to be able uh to collaboratively build as previously shown the two uh mobile

robots performing uh uh 3D printing tasks together. Eventually we want these

tasks together. Eventually we want these robots to collaborate just like how we are doing now with humans and and construction equipments. they need to

construction equipments. they need to collectively collaboratively complete um

um construction task together. Right? So

um I know that we have about 40 minutes um in terms of the presentation. So I

actually have more details for um um any one of these any any of these keywords depending on the audience interest. I

can go uh deeper into any one of them.

But uh maybe uh uh in the interest of time because uh uh professor Ma told me that I should uh um finish this in about 40 minutes uh so that we have

>> you, don't, worry, about, that, you, can, give more detail for this.

>> Okay., Yeah,, not, a, problem., So, then, I will have a lot more time then um I'll still have about 20 minutes so I can go uh into more details. um and uh really

depends on uh the um interest of the audience. I can go faster or slower in

audience. I can go faster or slower in any one of those topics. All right. So

maybe um let me uh talk about the mapping first. So uh many for many of uh

mapping first. So uh many for many of uh colleagues working in the um um in the um FEM uh engineering

management you're not you're not unfamiliar with the term point cloud um many times you actually need to go uh out to go out to your job site and take

your uh favorite laser scanners terrestrial laser scanners sit on a tripod and let it scan the environment at different locations of the job site

and eventually you need to perform this registration task uh which is what we call as mapping uh or point cloud

mapping and this is the mathematical um um uh summarization of the mapping problem point cloud mapping problem you don't really need to worry about I don't

want to uh put put you into sleep at at a Saturday night but uh what I want you to know is this problem is not unique to construction. Even in um autonomous

construction. Even in um autonomous driving, we need to solve uh the exact same problem where you have the um uh

cars moving around in the city and eventually um they scan um the local geometry at their um uh own pose at different time and eventually they need

to uh take all these scans together and build a map of the entire city. This is

of no difference as a uh construction site point cloud registration problem.

Okay. So in the mapping part uh my groups have done some work that try to make this process more efficient and less labor intensive even though I know

this field has uh uh grown um rapidly and with the latest uh um service or software that you may get from companies

like Leica or Trimble um uh the process is already much more smooth than before.

But still if you're talking about scanning a site and and giving all the data uh to the software and let it uh automatically register all the point

clouds without any human interference or any human attendance um I believe we can we still cannot do that 100%

nowadays you may always run into issues here and where that requires some human um attendance right so uh in order to um um make this whole thing as autonomous

as possible because again as construction engineers or or project managers we don't really care about the fundamental um uh technology. What we

want is a a fast scan um and understanding um of the current status of the drop sites and do any any quantitative work as soon as possible.

So basically we developed this work called deep mapping. um that is one of the most accurate metric mapping system and it can take all the point clouds you scan, no matter, whether, this, is, uh, done

by a um uh autonomous uh uh driving car or a um job site scanning robots and it will autonomously uh register as you see here we're

plotting the sensor pose um in each frame and this is a CT scale scan um and it will automatically register all the

pose um uh that eventually uh aligns and and very accurately and here this sensor pose uh shows the structure of a street um of a city right so you see it from

the beginning it is it is very noisy and eventually the system converges without any human attendance um at a CT scale um um autonomously and here shows both the

trajectory of the sensor and the scanned laser uh points projected onto a 2D plane and we showed that our system is very close to the uh the ground truth um

that is very like the the the high accuracy ground truth that requires um even some human attendance to build.

So um in the interest of time I will skip the technical details um again I don't want to put you to sleep at a Saturday evening. Um so the important

Saturday evening. Um so the important thing is we're we developed a AI system or basically a deep neuronet network and we make this problem equivalent to

training that deep neural network. So by

training that deep neuronet network uh um the problem gets solved. Uh so we don't even need to uh uh uh deploy this train the neuronet network later. just

by training the network itself this process is equivalent to the mapping problem and then we can leverage all the good technique in uh AI to make this

problem autonomous. Okay. So we can do

problem autonomous. Okay. So we can do mapping not only from point clouds but also from images. This typically is um

uh lower cost because laser scanners are pretty are still pretty um uh expensive and and they are often bulky and heavy.

If you have a uh pocket camera like your cell phone or a 360 camera, you can very easily walk around in the job site and then you you wonder if you can build a

map from those images. And this is a very well-known problem called structure from motion problem in computer vision and robotics. In robotics, we call this

and robotics. In robotics, we call this as visual slam. So um there are different ways people uh uh perform the traditional SFM or more recently people

uh use neuronet network to do learning based metric mapping um or learning based topological mapping. And here uh just for those uh who have a mind of u

theoretician you may be interested in this very abstract uh mathematical form that summarize all the mapping from images problems. But for practitioners

here are something that may be of interest to you. Um, this is a work that we published a couple of two years ago at CVPR, one of the top AI conferences.

Um, of something named the Vox former which is a camerabased 3D semantic scene completion problem. Um it is originally

completion problem. Um it is originally popularized by the company Tesla uh in the year of uh 202 uh2 and uh and then

basically what this problem is um it is that when we have a camera moving in the environment we want to develop some kind of like end to end learning based AI

system that will generate this kind of uh what we call as uh vaual like representation um of the envir environment which not

only tells me uh the occupied space versus free space in high accuracy in the 3D world but also it can tell me what things are this different color

codes are showing different part of this environment the road surface the uh the vehicles and buildings right so um this

is a a very interesting and challenging problem and we were able uh uh to uh achieve the SOTA at that time uh by

developing a uh novel transformer-based uh uh neuronet network architecture that generate results from this single view

uh image into this kind of result that matches very closely to the ground truth data. Uh whereas the previous state uh

data. Uh whereas the previous state uh previous baselines uh all make mistake more mistakes. Okay.

more mistakes. Okay.

So um we can also go beyond metric mapping as I mentioned the uh multi- view singraph work that our group published last year um um is uh a

representation of the so-called topological scene representation and the reason that we want to move um we want to develop methods in addition to metric

mapping is the metric mapping for those of you who have used terrestrial laser scanners you would know if the scene changed or if the scene changed during your scan, it create it can create a lot

of problems not only to the registration of multiple point cloud scans but also uh it brings problems of um recognition

like where like later on you need to you need to figure out where this this thing is and uh uh which which thing happens first and and last and how to put a new

scan how to register a new scan into the existing scan where the scene has already changed significantly. So um if we move beyond the metric mapping

requirement which requires us to build a global map that is metrically accurate everywhere. We can um relax that

everywhere. We can um relax that condition by saying we just need to build a graph um or the that preserves the topological connectivity information

among those uh images or the objects in those images. then we can more quickly

those images. then we can more quickly and easily maintain such a graph even if there are changes in the environment because if there are any changes in the

environment we just need to replace that image and remove the corresponding uh object node and replace it with the new image and the new object inside it.

Right? So um this is the fundamental idea of the topological representation uh beyond the the metric mapping and this kind of representation will be super useful for tasks like asset

management. For those of you who have

management. For those of you who have managed MAGA project, you would know how challenge it could be if the uh job sites asset are not properly managed.

you would constantly face problems of the need of finding a particular uh uh useful and unique uh uh asset where it is and who used it in the last. Um and

we believe this kind of topological mapping is a very flexible and robust way of addressing uh uh um that problem.

Um and here are just some of the applications that can be enabled by this kind of top topological map. For

example, the embodied Q question and answering problem. uh uh uh application

answering problem. uh uh uh application of the topological maps is a great example of um uh resolving the asset management challenge. When you build

management challenge. When you build such a topological map of the environment from the videos the surveillance video you have of the site then you can you can basically just like

how you interact with GPT you can ask questions to such a model where is which object what is the time you see that

object in in in in the last place right? What's the last time you see that

right? What's the last time you see that object?

Um, and I will skip the technical details how we implement um such an uh AI system that can do this. Uh, but I want to move on um uh to the next uh

keyword, the moving with a bit more details. So, um so we say a robot uh we

details. So, um so we say a robot uh we want to uh teach a robot how to move or basically how to navigate an

environment. One example um that can be

environment. One example um that can be useful for uh construction especially um in the operational phase not in the

construction phase um is the exploration problem or uh the problem basically says in a new environment that the robot has never been uh to before um how do you

autonomously navigate this robot so that it can explore every single aspect of this environment um uh without human

attendance again. Um if a um like a

attendance again. Um if a um like a building manager when they um um first

uh, taking, charge, of, a, a a, building, um they basically if if they can just send a robot dog that autonomous go everywhere. Um then they can leverage

everywhere. Um then they can leverage the information scanned or or ma or collected by these robots um to build the maps that we talked before. But

before that you have to res solve the problem of how to tell the the robot to go um u to all the places in this building because remember this is a new

place that this uh robot has never been to before and you don't have a map right so maybe you have a beam but the beam never um has the actual condition

they're always as designed the asbuilt beam as far as I know is still far from being implemented in the real world and even if they're implemented they don't have the real world conditions. You

don't have the real world texture, real world u details. So it will still be good for a robot to auton to be able to autonomously navigate um and explore

such an environment without missing any key places. So this slide we're showing

key places. So this slide we're showing something called deep explorer that we published again two years ago in a top robotics conference called uh RSS

robotic system and sciences where we uh enable this robot equipped with only one 360 camera. So this image you see are

360 camera. So this image you see are the equangular images that are captured by a 360 camera. So basically you get the surround view very easily with only

this one sensor uh without relying on any other sensor uh more complicated sensors like LAR or whatever. Um the

system we developed a system that just like how human do it can autonomously navigate an environment by memorizing which places it has already visited

before. Therefore, it doesn't require

before. Therefore, it doesn't require too much more revisiting so that it can spend more time on the unexplored nonfamiliar regions. And we were able to

nonfamiliar regions. And we were able to train this whole system in a simulated environment and then deploy it in a completely different uh real world

environment. Uh on uh this slides you

environment. Uh on uh this slides you are on this uh animation you're seeing an actual real world deployment uh in our uh university building.

So I will again skip the technical details how we achieve this. Again the

key is to develop a neuronet network architecture uh that uh address both the highle task planning and also the low-level motion planning uh through

imitation learning uh so that it imitates some kind of like uh uh Oracle uh experts uh to perform this task and eventually be able to autonomously

explore the environment.

I want to highlight a recent work called City Walker from my group this year uh published at CVPR. Um it's called

webcale uh um uh using webscale videos for urban navigation. And we believe if we're able to um resolve the urban

navigation problem such as navigating a robo-dog um um say from NYU to Columbia uh without any human attendance but by just relying on all the information

available to a general human being like a cell phone um and Google map then we should be able to bring u it is easier

for us to bring a robot um to a construction job site Because many of the problems that you you encounter in an urban navigation will be similar to a

construction site. You face dynamic

construction site. You face dynamic environment, you face dense traffic, you face different construction conditions right? But the key of such kind of

right? But the key of such kind of embodied AI is always the data. Where do

you get the data? We never had um a a large enough of uh Robo-Dog um um navigation recordings um in um like that

is available to researchers or to any anyone in the company. Um um and what we were able to do is we say we can

leverage uh the thousand hours of videos people posted on YouTube um that um either includes human walking holding a

camera walking on different kinds of cities um or driving in different kinds of cities. So, we curate this over 2,000

of cities. So, we curate this over 2,000 hours of YouTube videos that shows very different, very dynamic um uh uh situations at different uh season and

weather conditions. And then we're able

weather conditions. And then we're able to develop a AI scalable AI systems um that can consume this data and learn all

the sort of like implicitly all the rules that humans follow uh either walking or driving in those environments. Um and then very

environments. Um and then very interesting we're able to discover this the scaling law. So without changing the neuronet network architecture by just

increasing the number of training data from a couple hundred hours to a couple thousand hours we notice the error um basic the error decrease or basically

performance in um which shows this scaling law and we also show that by including different modality um we also uh improve the

performance. So even if your task is to

performance. So even if your task is to walk but if you have watched enough driving videos it will help you walking as well. So these are all very

as well. So these are all very interesting uh discoveries. So um

let me see I want to really um uh skip some of these um uh uh later slides.

This really shows um our um uh works on using AI to teach robots how to do things without requiring requiring uh

the human to ever know robot programming. So we use essentially using

programming. So we use essentially using a um vision language model to translate a video into a robot code that um uh

makes robot programming easier. And we

think this is important for uh uh industries like construction and manufacturing um so that we don't really require robot robotics experts to to program things for us. Um and this is

one of our early demonstr uh early demonstration uh making keywords. Um and

we also um um are aware of the human robot collaboration uh which is also increasingly discussed in construction where um um human workers are working

with um exo exoskeletons. You wear this kind of exoskeleton robots and it can provide assistance to you. Uh but the key challenge here is how to make these

robots smart enough so they can uh uh understand the human intention. So here

we demonstrate a cap a capability uh through our own data set and AI uh trained on that data set we're able uh to enable such a robot arm to collaborate with this human uh to

predict where the human's hand is reaching out to within one second in advance uh to this uh peak and place

operations. Right? So this one second uh

operations. Right? So this one second uh advancement is very very important uh because hum uh robots needs time to execute its actions um and we need to do

this in a proactive manner instead of a passive uh observation manner. So we're

able to do this with reasonable accuracy that can enable better human robot collaboration.

Um and this is the uh uh collaborative 3D printing that I have already um explained before. So I'll I'll skip this

explained before. So I'll I'll skip this is the um uh autonomous uh terrain grading um again um yeah and also the uh

collaborative perception idea that I've um explained and I'll skip those technical details. So let me bring you

technical details. So let me bring you to the end of my talk. Um

so basically there are so many open problems in the embodied AI or embodied spatial intelligence. I list some of

spatial intelligence. I list some of those such as lifelong learning uh learning under uncertainty um um and energy efficiency so and so

forth safety um in the end why does these matter? Well, we believe the

these matter? Well, we believe the embodied spatial intelligence powers the future. Um uh it it it's very very

future. Um uh it it it's very very important for construction automation and robotics to really work. Uh instead

of building um ad hoc solutions uh the theoretical or the fundamental science needs to be done um in order to enable that future. Um and here I show one

that future. Um and here I show one example which is a translational research um or basically a startup

spin-off from my lab that exemplifies uh this vision. So um this is a student

this vision. So um this is a student startup company called building diagnostic robotics where we build this robotic system that take all kinds of

sensors like LAR, camera, radar um to scan building rooftops like a Roomba as you see in this video and then it is

able to produce um a line by line scan and uh with the AI based um classification of moisture uh damage

image on the rooftop. Um, of course, you can also repurpose this for uh rebar detection, which is also another service that this company is offering right now.

And the the cool thing is this whole thing is done very efficiently. Uh

within 24 hours um or 48 hours of a scan uh we can produce the entire um report to engineers and the scanning itself is

very efficient. typically within two

very efficient. typically within two hours uh one or two hours depends on the size of a commercial building's rooftop we can finish the scan fully autonomously right and then you you get

this kind of like a moisture rooftop moisture map so um with that I will stop it here um and I would like to thank all my students and uh collaborators and

funding agencies without whom uh all these cannot be achieved and I really want to highlight some of our great students uh that gets uh all kinds of uh

prestigious scholarships or fellowships um and is um having the courage to take those research into real world startups um and in and our industry

collaborators.

Um with that I'll stop it here and uh happy to answer any questions uh that uh the audience may have. Thank you.

Loading...

Loading video analysis...