Embodied Spatial Intelligence:Bridging Perception, Reasoning, and Action
By Frontiers of Engineering Management
Summary
## Key takeaways - **Construction's Deadly Stagnation**: Construction has the highest occupational injuries and fatalities compared to other industries worldwide, and its labor productivity has remained stagnant for nearly 100 years versus rising in other non-farm sectors. [03:06], [04:02] - **Unique Construction Challenges**: Construction sites are unstructured and dynamic unlike manufacturing, require extensive material manipulation unlike self-driving cars, and face constant changes, featureless areas, repetitive structures, occlusion, and data sparsity. [06:00], [07:27] - **Embodied Spatial Intelligence Defined**: Embodied spatial intelligence means understanding and interacting with real-world space through physical actions, not just in virtual worlds, essential for enabling construction automation and robotics. [13:06], [13:50] - **Four M's Framework**: The keys to embodied spatial intelligence are mapping to represent the world using egocentric observations, moving for navigation and decision-making, making to build in the physical world, and mingling for multi-agent collaboration. [19:13], [19:26] - **AI Experts' Failed Predictions**: John von Neumann, Herbert Simon, and Marvin Minsky made overoptimistic predictions nearly 100 years ago about machines doing human work like construction, yet we still cannot achieve it. [16:56], [18:21] - **Mobile Manipulator Necessity**: Construction equipment as mobile manipulators must perform simultaneous navigation and environment modification, unlike fixed robotic arms in manufacturing. [10:40], [11:00]
Topics Covered
- Construction Productivity Stagnates for Decades
- Construction Sites Defy Manufacturing Automation
- Embodied Spatial Intelligence Enables Wall-E Robots
- Four Ms Unlock Construction Robotics
- AI Robots Engineer Termite-Level Building
Full Transcript
Yeah. So the topic um um that I'm sharing today is related to construction, automation and robotics.
Especially I summarize uh the research needs and also my group's research focus in the past uh six seven years as embodied spatial intelligence and I will
try to make the connection between uh the two topics. Uh before going into the details, let me uh very briefly
introduce my lab AI force. Um the lab's mission is to address the following challenge in the field. um which is uh
to develop novel algorithms and uh systems for uh intelligent agents to accurately and um uh efficiently
understand and interact with materials and humans um in dynamic and unstructured environments. Uh we believe
unstructured environments. Uh we believe this is the sort of like the fundamental uh research need in this field of
construction automation and robotics.
um uh before we can make uh technologies in this field uh really useful at a large scale. So in order to achieve this
large scale. So in order to achieve this uh mission we need to adopt the so-called multid-disciplinary uh research methodology and adopt the
so-called use inspired research paradigm. Um in order to achieve such
paradigm. Um in order to achieve such methodologies we need expertise from multidisciplinary team. So in my group
multidisciplinary team. So in my group we have students working uh both in fundamental robotics and AI fields such as computer vision, robotics, machine learning.
>> More, specifically, um, we, focus, on, uh visual or LAR based localization mapping navigation um 3D vision or perception
and learning. Um in the problem domain
and learning. Um in the problem domain that we focus on um uh we mainly we're mainly interested in problems originates from uh the construction or
manufacturing field as well as uh smart cities. So that means uh transportation
cities. So that means uh transportation uh connected and autonomous vehicle.
So we mentioned today's uh main theme is about construction automation and robotics. Um and there is a strong need
robotics. Um and there is a strong need um in in in this whole field globally right I'm sure um the colleagues in
China will uh notice this uh u this this shift um um recently so basically um we
said in in the society there are two um uh important uh sort of problems uh faced by the construction sector um And
uh for those of you who have been in the uh construction field for um for years you would um not be uh um unfamiliar with the following uh two figures that
I'm showing. One of them demonstrates
I'm showing. One of them demonstrates the high injuries and fatalities um in the construction site. So here is a uh figure from the uh US but I think the
same figure um or the same trend uh pretty much applies to uh every country worldwide. So it basically shows that
worldwide. So it basically shows that construction has you typically has the highest uh largest number of uh occupational injuries uh as compared to
uh different industry industry sectors right? Uh not only uh the absolute
right? Uh not only uh the absolute number but also the rate is also quite significant. Um we also uh know that
significant. Um we also uh know that construction has always been um accused of having stagnant labor productivity.
So if you compare the labor productivity of construction uh versus all other non-farm industries um in the past several decades right um almost 100
years the uh product labor productivity of construction uh no matter how you um um like calculate the index different kinds of index you always get this
stagnant curve as compared to the rising curve in other nonfarm industry. So we
said there is uh this strong need or demands for uh bringing new technology into construction to address these two important problems. And this is not the
only two problems that are faced by construction. We're also facing by
construction. We're also facing by issues of uh lack of uh young talents uh aspirational young generations. um they
do not tend to think of construction industry as something um as a rewarding field. So uh not many people are willing
field. So uh not many people are willing to enter this field. So how do we address all these as a whole? Of course
u there are different ways to address it. Um uh one uh one of the many ways um
it. Um uh one uh one of the many ways um that um um we can we can we can do and we can try to bring and is also uh my
personal and my group's research focus is try to look at the the high technology right the automation and robotics um can we learn from manufacturing uh or transportation
fields um that has already u significantly benefit from uh high-tech especially robotics Right. Companies like Tesla and Google
Right. Companies like Tesla and Google they have been attracting um some of the um really bright new
talents um into their fields by their robotic factory and self-driving cars.
Um but besides making this um uh analogy between construction and uh manufacturing or transportation, I want to point out several unique challenges
um for construction robotics. Um so
unlike manufacturing, construction environments are usually um structured and dynamic. Um and unlike self-driving
and dynamic. Um and unlike self-driving cars, the construction robots have to um um perform more materials manipulation
and they have to typically um uh collaborate with um human workers on the site. So u fundamental research are
site. So u fundamental research are required for construction robots. uh we
cannot just think about applying existing techniques in automation and robotics or machine learning or even AI to solve to magically solve problems
that we face in construction um and engineering management. So let me um
engineering management. So let me um um illustrate some of the more specific technical challenges uh faced by uh construction folks when we try to bring
automation and robotics onto the site.
So one of the first one is the so-called constantly changing job sites right so this is a data set that iso um and a challenge hosted by Stanford um uh civil
engineering uh department um they call this as uh nothing stands still challenge right so basically they go to some construction job site and scan in
3D very detailedly um almost every week um and then you notice the the site nothing stand still in a construction site, right? The the things change uh
site, right? The the things change uh every week if not every day. Um and this brings a lot of challenge to robotics um as we will discuss later. And another
thing for um specific to construction is that you typically face great difficulties in localizing and navigating um agents. agents including
um autonomous equipment um or even construction workers. Uh because a lot
construction workers. Uh because a lot of the um construction site when it is being built um it's featureless. Um it
has many repetitive structures and it is difficult for uh even human to know uh where they are if they're not familiar with the site and especially when the
site gets large and complicated.
Last but not least um just like in um um transportation instead of driving cars construction site also face this severe occlusion and data sparity issue. So if
the construction equipment are observing something from far away uh this the number of data points you can get from either a LAR scanner or a camera um is
limited. you only have limited pixels
limited. you only have limited pixels focusing on a particular distant object um or you you have limited uh lighter points focus on that object, right? So
um not to mention the heavy occlusion we all know when you have to when you deal with um frames frame operations um or
like really um u complex confined space in a mega site. Occlusion is another issue where you constantly uh need to check whether things are going to
collide with each other or not. If
you're especially if you're going to bring autonomous or semiautonomous equipments like trucks, excavators cranes uh that needs to operate uh on
its own into this kind of complex site with heavy occlusion. This is even more problematic, right? So not to mention we
problematic, right? So not to mention we we want to use these technique to make the site more safe. Now if you cannot address these technical challenges
you're actually bringing more danger into the construction site.
So basically we're saying construction vehicles are becoming autonomous these days. Um as you can see on the image
days. Um as you can see on the image here um this is not just happening in the academia. It's happening in the
the academia. It's happening in the field in those heavy equipment manufacturers um um companies like Volvo uh Boston
Dynamics um uh different kind of like uh um construction heavy equipment manufacturers even in um um in in China
I know uh companies like Shong um and San Jongong they are all investing in this um autonomous equipment technology
right so basically um In robotics term we can call these construction vehicles um or equipments as mobile manipulators.
These mobile manipulators uh needs to finish one task uh regardless of different specific construction jobs they have to perform um in a from a
fundamental robotics perspective they all need to perform simultaneous navigation and environment modification.
So they need to move in order to perform their jobs. Unlike in manufacturing
their jobs. Unlike in manufacturing side, you can rarely have a a uh robotic arm that's fixed at one place and just
do work waiting for for things to be transported to the vicinity of that arm.
Um the robot actually has to go around move around um and while they're moving around um they have to make modifications of their environment. So
this is what we call as mobile manipulator um in robotics and uh this is an active um uh field research field
even in robotics uh with rising interest uh because it has fundamental challenges. It is challenging the uh
challenges. It is challenging the uh existing theories and algorithms for perception for localization for planning for control and coordination.
So um on the left hand side I show this u um uh kids like a very famous kids story book called mighty mighty construction site.
Right? So if you notice on this uh uh story book uh it is basically highlighting the a vision or a future that we may want to see where you have
all the equipments becoming autonomous and they collaborate um coher coherently uh with each other. Is there any question? By the way if there is any
question? By the way if there is any question feel free to unmute yourself and ask uh interrupt me at any time um
or you can always type in the chat.
All right. Anyway, so um in in my um um uh past research, I tend to call these mobile manipulators as wallally. Uh it's
just because they are more widely known to the general public uh to to to to people who are not in this field of robotics. If you if you talk about
robotics. If you if you talk about Wall-E, people immediately gets the idea. So how do we make these wallies
idea. So how do we make these wallies safer and uh uh more efficient uh when we try to bring them onto construction sites? Well, the fundamental research
sites? Well, the fundamental research that I believe is needed is called embodied spatial intelligence. This is a term that I coined that combines two
very popular terms nowadays. uh embodied
intelligence or in Chinese and spatial intelligence or in Chinese uh so it's embodied spatial intelligence
um we think u that this is the secret sauce to uh fundamentally uh enable construction automation and robotics right so what do I what I mean by
embodied spatial intelligence well um uh fundamentally we need to understand and interact with the space that we resides in. And importantly, we need to form
in. And importantly, we need to form this kind of understanding and interaction uh through physical actions.
We're not just uh doing things in imaginary world or in a digital virtual world such as being only, right? In
order to achieve embodied spatial intelligence, you actually your equipment, your your um your vehicles they move, they work in a real job site.
And this is not something that is actually special or unique to construction. Um they reside in mother
construction. Um they reside in mother nature. Um as you can see on these
nature. Um as you can see on these figures uh many animals that you see in in the cities uh they have this cap or even wild they have this capability.
They know how and where to find food.
They know how to um um um for example very interestingly uh if you um ever visit um um cities in uh in uh the North America you'll notice this kind of
squirrels um uh living in the city and they learn to cross roads um with certain uh level of safety right so they
can avoid being hit by a car um this is very very interesting phenomenon and uh recently there is a uh PNAS uh journal published a work uh that compares this
uh um collaborative intelligence in ants versus human. And the task they give the
versus human. And the task they give the ants is to move this structure across um some kind of like u uh interesting geometry, right? Um and imagine um
geometry, right? Um and imagine um actually and they did this test with human uh workers as well in order to collaboratively move this um uh object
this red object through this um confined space with ge ge geometric obstacles.
Even human needs to try multiple times and we are not believe it or not we are not uh more efficient than ants. Right?
So this is very interesting. Um but most importantly uh to the interest of construction folks we want to engineer those capabilities back in machines. So
uh not only do we want to uh achieve autonomous vehicles or autonomous delivery uh systems material handling system um we want them um also to be
able to interact and and move objects work in factories and construction site.
Now talking about this um you probably will think how far are we from engineering such intelligence uh given all the great excitement um and the progress in the field broad field of AI
artificial intelligence um especially yesterday the GP5 um is released right so um here I want to
quote from some famous u um um experts um that are known to uh the general public here this figure shows uh John
vonoman um a very famous like um believed to be the father of um uh modern computers, right? So um in not
like this is almost almost 100 years ago and he said if you will tell me precisely what it is that a machine cannot do then I can always make a machine which will do just that. Right?
This is almost 100 years ago and think about what we can do now and what we cannot do now. Right? There's still so many things that I can precisely tell
you what uh I wanted uh to be able to do but we cannot. For example, um a machine that can do household jobs for us or a machine that can construct.
So this is another uh very famous scholar, a touring awardee and also um who won Nobel Prize um later. His name
is Herbert Simon. He was a professor at Carnegie Menan University. Right. So he
once said 1965 that machines will be capable within 20 years of doing any work a man can do.
And clearly we know that is not true even nowadays. Right? And one more uh
even nowadays. Right? And one more uh very famous guy in artificial intelligence professor Marvin uh Minsky
who also won his touring award uh in 1969. And he said one year later after
1969. And he said one year later after he got his Turing award he said um in from three to to eight years uh we would have a machine with the general
intelligence of an average human being.
And are we there yet? Clearly not. So so
the reason I'm showing these quotes from these famous people is that first it is very difficult to make this kind of predictions even by some of the brightest minds in the human history. Um
and it also shows this is this field has so much potential for uh young and aspirational um um minds like you guys
uh to join this field. There's so many interesting things to work on um and let's work on these together. So um from
my lab's perspective or personally what I think um are the keys to address um spatial um embodied spatial intelligence that will eventually lead to
construction automation and robotics is the following four M that I summarized.
Um they are mapping, moving, making and mingling. So I'm going to very quickly
mingling. So I'm going to very quickly explain what these four words means before diving into the details.
So um mapping really means representing the world. How do robots build maps or
the world. How do robots build maps or basically a representation of its surrounding um environments? This means
a robot has to use what we call as egocentric observations. They're not
egocentric observations. They're not necessarily just depending on cameras or sensors installed in the environment. Um
they have to rely on sensors installed on their body uh that can move together while they're moving uh to representations of the observed
environment. So this is what we call as
environment. So this is what we call as mapping. And uh as as the some of the
mapping. And uh as as the some of the animation figure shows in in this slide that highlights my group's work um
mapping um uh well the most well-known mapping are metric mapping. The idea is to get u um accurate geometry so that
you can measure distance, angle, volume and all those things um um in that map.
uh mapping is not just metric um it also uh represent u uh uh requires topology understanding. So the topological
understanding. So the topological mapping is another very important form of maps and mapping is not just about parsing the data that we observe. It's
also about reasoning in those kind of complex environment. Okay. Um so the
complex environment. Okay. Um so the second key word is moving. So we need to teach robots how to decide where to go because if they don't they cannot do
this then basically the whole u uh Wall-E story or the mobile manipulation um um um hope that um we think we hope
the um construction robots can do will not be possible. Right? So in order to um enable robots to decide where to go um one of the most fundamental problem
is localization. They need to understand
is localization. They need to understand where they are in a complex 3D um or 2D simplified 2D environments. Um and they
need to know more than that. Um so even if you know precisely as shown here in this figure which is from our group's very recent work that achieves
state-of-the-art six off localization of a camera um in 6D. So that means we know both 3D position of the camera and also
uh uh the orientation uh in that space uh from only camera images without requiring any GPS or any other sensors.
Um so even if you know this it is not enough for a robot to navigate in those complex environments. The robot could be
complex environments. The robot could be a self-driving car. It could also be an excavator uh on a job site.
Right? So um it they need to uh understand the situation that they um the surrounding situation. What is
happening? Is there any signal that prevents me from doing certain actions?
When should I uh when can I move across this field in front of me and when can I not? Uh so all these are uh uh beyond
not? Uh so all these are uh uh beyond localization. They're decision- making
localization. They're decision- making problems. Of course, the things that are most interesting and and most um useful to
robotic to construction folks are the making part, right? We actually need to enable these autonomous machines to build in physical world instead of in
addition to uh localize and move themselves like self self-driving cars do, they actually need to move and do things and make things happen. Right? So
here I'm make some um so the animation is from the well-known Wall-E movie. I'm
sure many of you or your kids have watched this before. Um but from I want to remind you uh the amazing capability
um u by the mother nature. So here is a termite mound that people found um in Africa, right? And I'm sure you can find
Africa, right? And I'm sure you can find similar structures uh in other part of the world. So if you think about animal
the world. So if you think about animal architects, this includes termites but beyond termites, right? They can build structures. Uh this is an amazing
structures. Uh this is an amazing capability. Even nowadays after
capability. Even nowadays after thousands is thousands years of technology update in construction. We
still don't have a autonomous system that can do this um uh kind of quote unquote simple task uh that even a
termite or ant can do right. So that
means there are some theoretical challenge uh fundamentally uh uh lies beneath these these um um application.
So we a couple years ago we actually um uh greatly simplified these uh construction problem in a very theoretical and non-realistic manner for
us to study the theor theor theor theoretical challenge of doing construction by a mobile agents and we simplified things in a grid like
environment uh just like the the go game right so we know a couple of years ago um uh Google um uh was able to do this
alpha go AI that can uh uh perform uh super well um they can win the human champion of the go game. So if you build
if you try to study the construction the mobile construction problem um as in a grid world then you will be facing something like this. Um even though this
is already abstract out a lot of the realistic real world construction challenges it is already showing in this simplified environment there's still
theoretical challenge and that challenge is what we call as palm DP for those who are interested in theory or partial observable and mark of decision problem.
So when you're facing the palm DP problem uh then of course um the theoreticians will know this is a super hard super challenging problem uh even
for pure roboticist or machine learning uh AI researchers. So no wonder construction robotics and construction automation remains one of the big
biggest challenge for robotics. Okay. So
we're also showing some of the figures here um um that we try to use engineering tricks to circumvent some of these theoretical challenges. For
example, on the top, we're showing a mobile 3D printing platform that my team developed uh to uh enable uh the u 3D
printing beyond a gantry based system that you typically see either in a tabletop uh uh plastics 3D printer or on a construction site, a concrete 3D
printer. Uh currently they almost always
printer. Uh currently they almost always uh depends on some the so-called gantry based system or a fixed robot arm. um to
deliver the material. But we think uh if you can move this print head onto a mobile robot and let mobile robots collaborate on the construction site
then you can if uh essentially print much faster in a collaborative manner.
Uh and you can print something bigger than the printer itself. Uh on the bottom we're showing a um um construction like the like the terrain
grading autonomous terrain grading experiments happening in our lab. This
is an ongoing project that we're hoping to by deliver uh uh researching this kind of um topics will one day enable
robots to go to extraterrestrial conditions like moons or mars uh to build bases for us. Right? So this is
the making part. Um and the last um uh key word is mingling. Um this really means multi- aent uh collaboration. So
in again in mother the nature we see all kinds of uh collaboration among different agents. Um this is an example
different agents. Um this is an example of monkeys. uh they talk to each other
of monkeys. uh they talk to each other through um their way and they communicate, with, each, wi, with with, each other to tell uh people where foos are
to tell monkeys where foods are and uh where the other dangers uh uh may happen right and this is of course more than just uh um the monkeys. We see this
behaviors in ants. We see this behaviors in birds, many animals, right? So this
um inspire us to think how can we enable autonomous agents like robots or construction equipments uh to work together uh so they can see and act
better right um so my group has actually pioneered in this direction and and developed uh a a research uh sub field
called co-perception or collaborative or cooperative perception. Okay. So um due
cooperative perception. Okay. So um due to the lack of needs in uh due to the lack of data in construction site we we started uh by collaborating with
researchers in transportation. Um so in transportation we already have this kind of u possibilities to either simulate or
collect real world multi-agent uh uh interaction data where you have vehicles um uh nearby each other and they all equipped with sensors and they can talk
and communicate with each other. So
essentially if you allow these agents to share what they see with each other then collectively they see uh further they
see better they see through occlusions right so this will make the whole system fundamentally safer um uh than the single agent uh perception system and of
course we want to go beyond uh just the uh collaborative perception.
We want to when we are bringing this onto construction sites, we want them to be able uh to collaboratively build as previously shown the two uh mobile
robots performing uh uh 3D printing tasks together. Eventually we want these
tasks together. Eventually we want these robots to collaborate just like how we are doing now with humans and and construction equipments. they need to
construction equipments. they need to collectively collaboratively complete um
um construction task together. Right? So
um I know that we have about 40 minutes um in terms of the presentation. So I
actually have more details for um um any one of these any any of these keywords depending on the audience interest. I
can go uh deeper into any one of them.
But uh maybe uh uh in the interest of time because uh uh professor Ma told me that I should uh um finish this in about 40 minutes uh so that we have
>> you, don't, worry, about, that, you, can, give more detail for this.
>> Okay., Yeah,, not, a, problem., So, then, I will have a lot more time then um I'll still have about 20 minutes so I can go uh into more details. um and uh really
depends on uh the um interest of the audience. I can go faster or slower in
audience. I can go faster or slower in any one of those topics. All right. So
maybe um let me uh talk about the mapping first. So uh many for many of uh
mapping first. So uh many for many of uh colleagues working in the um um in the um FEM uh engineering
management you're not you're not unfamiliar with the term point cloud um many times you actually need to go uh out to go out to your job site and take
your uh favorite laser scanners terrestrial laser scanners sit on a tripod and let it scan the environment at different locations of the job site
and eventually you need to perform this registration task uh which is what we call as mapping uh or point cloud
mapping and this is the mathematical um um uh summarization of the mapping problem point cloud mapping problem you don't really need to worry about I don't
want to uh put put you into sleep at at a Saturday night but uh what I want you to know is this problem is not unique to construction. Even in um autonomous
construction. Even in um autonomous driving, we need to solve uh the exact same problem where you have the um uh
cars moving around in the city and eventually um they scan um the local geometry at their um uh own pose at different time and eventually they need
to uh take all these scans together and build a map of the entire city. This is
of no difference as a uh construction site point cloud registration problem.
Okay. So in the mapping part uh my groups have done some work that try to make this process more efficient and less labor intensive even though I know
this field has uh uh grown um rapidly and with the latest uh um service or software that you may get from companies
like Leica or Trimble um uh the process is already much more smooth than before.
But still if you're talking about scanning a site and and giving all the data uh to the software and let it uh automatically register all the point
clouds without any human interference or any human attendance um I believe we can we still cannot do that 100%
nowadays you may always run into issues here and where that requires some human um attendance right so uh in order to um um make this whole thing as autonomous
as possible because again as construction engineers or or project managers we don't really care about the fundamental um uh technology. What we
want is a a fast scan um and understanding um of the current status of the drop sites and do any any quantitative work as soon as possible.
So basically we developed this work called deep mapping. um that is one of the most accurate metric mapping system and it can take all the point clouds you scan, no matter, whether, this, is, uh, done
by a um uh autonomous uh uh driving car or a um job site scanning robots and it will autonomously uh register as you see here we're
plotting the sensor pose um in each frame and this is a CT scale scan um and it will automatically register all the
pose um uh that eventually uh aligns and and very accurately and here this sensor pose uh shows the structure of a street um of a city right so you see it from
the beginning it is it is very noisy and eventually the system converges without any human attendance um at a CT scale um um autonomously and here shows both the
trajectory of the sensor and the scanned laser uh points projected onto a 2D plane and we showed that our system is very close to the uh the ground truth um
that is very like the the the high accuracy ground truth that requires um even some human attendance to build.
So um in the interest of time I will skip the technical details um again I don't want to put you to sleep at a Saturday evening. Um so the important
Saturday evening. Um so the important thing is we're we developed a AI system or basically a deep neuronet network and we make this problem equivalent to
training that deep neural network. So by
training that deep neuronet network uh um the problem gets solved. Uh so we don't even need to uh uh uh deploy this train the neuronet network later. just
by training the network itself this process is equivalent to the mapping problem and then we can leverage all the good technique in uh AI to make this
problem autonomous. Okay. So we can do
problem autonomous. Okay. So we can do mapping not only from point clouds but also from images. This typically is um
uh lower cost because laser scanners are pretty are still pretty um uh expensive and and they are often bulky and heavy.
If you have a uh pocket camera like your cell phone or a 360 camera, you can very easily walk around in the job site and then you you wonder if you can build a
map from those images. And this is a very well-known problem called structure from motion problem in computer vision and robotics. In robotics, we call this
and robotics. In robotics, we call this as visual slam. So um there are different ways people uh uh perform the traditional SFM or more recently people
uh use neuronet network to do learning based metric mapping um or learning based topological mapping. And here uh just for those uh who have a mind of u
theoretician you may be interested in this very abstract uh mathematical form that summarize all the mapping from images problems. But for practitioners
here are something that may be of interest to you. Um, this is a work that we published a couple of two years ago at CVPR, one of the top AI conferences.
Um, of something named the Vox former which is a camerabased 3D semantic scene completion problem. Um it is originally
completion problem. Um it is originally popularized by the company Tesla uh in the year of uh 202 uh2 and uh and then
basically what this problem is um it is that when we have a camera moving in the environment we want to develop some kind of like end to end learning based AI
system that will generate this kind of uh what we call as uh vaual like representation um of the envir environment which not
only tells me uh the occupied space versus free space in high accuracy in the 3D world but also it can tell me what things are this different color
codes are showing different part of this environment the road surface the uh the vehicles and buildings right so um this
is a a very interesting and challenging problem and we were able uh uh to uh achieve the SOTA at that time uh by
developing a uh novel transformer-based uh uh neuronet network architecture that generate results from this single view
uh image into this kind of result that matches very closely to the ground truth data. Uh whereas the previous state uh
data. Uh whereas the previous state uh previous baselines uh all make mistake more mistakes. Okay.
more mistakes. Okay.
So um we can also go beyond metric mapping as I mentioned the uh multi- view singraph work that our group published last year um um is uh a
representation of the so-called topological scene representation and the reason that we want to move um we want to develop methods in addition to metric
mapping is the metric mapping for those of you who have used terrestrial laser scanners you would know if the scene changed or if the scene changed during your scan, it create it can create a lot
of problems not only to the registration of multiple point cloud scans but also uh it brings problems of um recognition
like where like later on you need to you need to figure out where this this thing is and uh uh which which thing happens first and and last and how to put a new
scan how to register a new scan into the existing scan where the scene has already changed significantly. So um if we move beyond the metric mapping
requirement which requires us to build a global map that is metrically accurate everywhere. We can um relax that
everywhere. We can um relax that condition by saying we just need to build a graph um or the that preserves the topological connectivity information
among those uh images or the objects in those images. then we can more quickly
those images. then we can more quickly and easily maintain such a graph even if there are changes in the environment because if there are any changes in the
environment we just need to replace that image and remove the corresponding uh object node and replace it with the new image and the new object inside it.
Right? So um this is the fundamental idea of the topological representation uh beyond the the metric mapping and this kind of representation will be super useful for tasks like asset
management. For those of you who have
management. For those of you who have managed MAGA project, you would know how challenge it could be if the uh job sites asset are not properly managed.
you would constantly face problems of the need of finding a particular uh uh useful and unique uh uh asset where it is and who used it in the last. Um and
we believe this kind of topological mapping is a very flexible and robust way of addressing uh uh um that problem.
Um and here are just some of the applications that can be enabled by this kind of top topological map. For
example, the embodied Q question and answering problem. uh uh uh application
answering problem. uh uh uh application of the topological maps is a great example of um uh resolving the asset management challenge. When you build
management challenge. When you build such a topological map of the environment from the videos the surveillance video you have of the site then you can you can basically just like
how you interact with GPT you can ask questions to such a model where is which object what is the time you see that
object in in in in the last place right? What's the last time you see that
right? What's the last time you see that object?
Um, and I will skip the technical details how we implement um such an uh AI system that can do this. Uh, but I want to move on um uh to the next uh
keyword, the moving with a bit more details. So, um so we say a robot uh we
details. So, um so we say a robot uh we want to uh teach a robot how to move or basically how to navigate an
environment. One example um that can be
environment. One example um that can be useful for uh construction especially um in the operational phase not in the
construction phase um is the exploration problem or uh the problem basically says in a new environment that the robot has never been uh to before um how do you
autonomously navigate this robot so that it can explore every single aspect of this environment um uh without human
attendance again. Um if a um like a
attendance again. Um if a um like a building manager when they um um first
uh, taking, charge, of, a, a a, building, um they basically if if they can just send a robot dog that autonomous go everywhere. Um then they can leverage
everywhere. Um then they can leverage the information scanned or or ma or collected by these robots um to build the maps that we talked before. But
before that you have to res solve the problem of how to tell the the robot to go um u to all the places in this building because remember this is a new
place that this uh robot has never been to before and you don't have a map right so maybe you have a beam but the beam never um has the actual condition
they're always as designed the asbuilt beam as far as I know is still far from being implemented in the real world and even if they're implemented they don't have the real world conditions. You
don't have the real world texture, real world u details. So it will still be good for a robot to auton to be able to autonomously navigate um and explore
such an environment without missing any key places. So this slide we're showing
key places. So this slide we're showing something called deep explorer that we published again two years ago in a top robotics conference called uh RSS
robotic system and sciences where we uh enable this robot equipped with only one 360 camera. So this image you see are
360 camera. So this image you see are the equangular images that are captured by a 360 camera. So basically you get the surround view very easily with only
this one sensor uh without relying on any other sensor uh more complicated sensors like LAR or whatever. Um the
system we developed a system that just like how human do it can autonomously navigate an environment by memorizing which places it has already visited
before. Therefore, it doesn't require
before. Therefore, it doesn't require too much more revisiting so that it can spend more time on the unexplored nonfamiliar regions. And we were able to
nonfamiliar regions. And we were able to train this whole system in a simulated environment and then deploy it in a completely different uh real world
environment. Uh on uh this slides you
environment. Uh on uh this slides you are on this uh animation you're seeing an actual real world deployment uh in our uh university building.
So I will again skip the technical details how we achieve this. Again the
key is to develop a neuronet network architecture uh that uh address both the highle task planning and also the low-level motion planning uh through
imitation learning uh so that it imitates some kind of like uh uh Oracle uh experts uh to perform this task and eventually be able to autonomously
explore the environment.
I want to highlight a recent work called City Walker from my group this year uh published at CVPR. Um it's called
webcale uh um uh using webscale videos for urban navigation. And we believe if we're able to um resolve the urban
navigation problem such as navigating a robo-dog um um say from NYU to Columbia uh without any human attendance but by just relying on all the information
available to a general human being like a cell phone um and Google map then we should be able to bring u it is easier
for us to bring a robot um to a construction job site Because many of the problems that you you encounter in an urban navigation will be similar to a
construction site. You face dynamic
construction site. You face dynamic environment, you face dense traffic, you face different construction conditions right? But the key of such kind of
right? But the key of such kind of embodied AI is always the data. Where do
you get the data? We never had um a a large enough of uh Robo-Dog um um navigation recordings um in um like that
is available to researchers or to any anyone in the company. Um um and what we were able to do is we say we can
leverage uh the thousand hours of videos people posted on YouTube um that um either includes human walking holding a
camera walking on different kinds of cities um or driving in different kinds of cities. So, we curate this over 2,000
of cities. So, we curate this over 2,000 hours of YouTube videos that shows very different, very dynamic um uh uh situations at different uh season and
weather conditions. And then we're able
weather conditions. And then we're able to develop a AI scalable AI systems um that can consume this data and learn all
the sort of like implicitly all the rules that humans follow uh either walking or driving in those environments. Um and then very
environments. Um and then very interesting we're able to discover this the scaling law. So without changing the neuronet network architecture by just
increasing the number of training data from a couple hundred hours to a couple thousand hours we notice the error um basic the error decrease or basically
performance in um which shows this scaling law and we also show that by including different modality um we also uh improve the
performance. So even if your task is to
performance. So even if your task is to walk but if you have watched enough driving videos it will help you walking as well. So these are all very
as well. So these are all very interesting uh discoveries. So um
let me see I want to really um uh skip some of these um uh uh later slides.
This really shows um our um uh works on using AI to teach robots how to do things without requiring requiring uh
the human to ever know robot programming. So we use essentially using
programming. So we use essentially using a um vision language model to translate a video into a robot code that um uh
makes robot programming easier. And we
think this is important for uh uh industries like construction and manufacturing um so that we don't really require robot robotics experts to to program things for us. Um and this is
one of our early demonstr uh early demonstration uh making keywords. Um and
we also um um are aware of the human robot collaboration uh which is also increasingly discussed in construction where um um human workers are working
with um exo exoskeletons. You wear this kind of exoskeleton robots and it can provide assistance to you. Uh but the key challenge here is how to make these
robots smart enough so they can uh uh understand the human intention. So here
we demonstrate a cap a capability uh through our own data set and AI uh trained on that data set we're able uh to enable such a robot arm to collaborate with this human uh to
predict where the human's hand is reaching out to within one second in advance uh to this uh peak and place
operations. Right? So this one second uh
operations. Right? So this one second uh advancement is very very important uh because hum uh robots needs time to execute its actions um and we need to do
this in a proactive manner instead of a passive uh observation manner. So we're
able to do this with reasonable accuracy that can enable better human robot collaboration.
Um and this is the uh uh collaborative 3D printing that I have already um explained before. So I'll I'll skip this
explained before. So I'll I'll skip this is the um uh autonomous uh terrain grading um again um yeah and also the uh
collaborative perception idea that I've um explained and I'll skip those technical details. So let me bring you
technical details. So let me bring you to the end of my talk. Um
so basically there are so many open problems in the embodied AI or embodied spatial intelligence. I list some of
spatial intelligence. I list some of those such as lifelong learning uh learning under uncertainty um um and energy efficiency so and so
forth safety um in the end why does these matter? Well, we believe the
these matter? Well, we believe the embodied spatial intelligence powers the future. Um uh it it it's very very
future. Um uh it it it's very very important for construction automation and robotics to really work. Uh instead
of building um ad hoc solutions uh the theoretical or the fundamental science needs to be done um in order to enable that future. Um and here I show one
that future. Um and here I show one example which is a translational research um or basically a startup
spin-off from my lab that exemplifies uh this vision. So um this is a student
this vision. So um this is a student startup company called building diagnostic robotics where we build this robotic system that take all kinds of
sensors like LAR, camera, radar um to scan building rooftops like a Roomba as you see in this video and then it is
able to produce um a line by line scan and uh with the AI based um classification of moisture uh damage
image on the rooftop. Um, of course, you can also repurpose this for uh rebar detection, which is also another service that this company is offering right now.
And the the cool thing is this whole thing is done very efficiently. Uh
within 24 hours um or 48 hours of a scan uh we can produce the entire um report to engineers and the scanning itself is
very efficient. typically within two
very efficient. typically within two hours uh one or two hours depends on the size of a commercial building's rooftop we can finish the scan fully autonomously right and then you you get
this kind of like a moisture rooftop moisture map so um with that I will stop it here um and I would like to thank all my students and uh collaborators and
funding agencies without whom uh all these cannot be achieved and I really want to highlight some of our great students uh that gets uh all kinds of uh
prestigious scholarships or fellowships um and is um having the courage to take those research into real world startups um and in and our industry
collaborators.
Um with that I'll stop it here and uh happy to answer any questions uh that uh the audience may have. Thank you.
Loading video analysis...