LongCut logo

Building AI Agents that actually automate Knowledge Work - Jerry Liu, LlamaIndex

By AI Engineer

Summary

Topics Covered

  • RAG Fails Knowledge Work Automation
  • Build Document Toolbox Beyond RAG
  • Excel Agents Surpass Human Accuracy
  • Automation Agents Backend Assistants
  • Agents Automate Financial Due Diligence

Full Transcript

[Music] Okay. Hey everyone. Uh I'm Jerry,

Okay. Hey everyone. Uh I'm Jerry, co-founder, CEO of Llama Inducts. It's

great to be here. Um and today my topic uh the talk title is building AI agents that actually automate knowledge work.

So basically a big promise of AI agents these days is making knowledge workers more efficient. Um I'm I'm sure you've

more efficient. Um I'm I'm sure you've heard the like highlevel business speak of this. Um and I copy and pasted a

of this. Um and I copy and pasted a bunch of B2B SAS vendors on the on the right in terms of screenshots. You know

increase operational efficiency better decision-m through more data. But what

does this actually mean right does knowledgework automation actually just mean building rag chat bots? and if not what is the stack and what are the use cases that AI agents can actually do um

in terms of automating knowledge work.

So for us a lot of our use cases and a lot of our core focus areas is basically automating knowledge work over unstructured data. Um 90% of enterprise

unstructured data. Um 90% of enterprise data lives within the form of documents whether it is PDFs, powerpoints, um word and uh you know as you'll soon see Excel

but humans uh have historically needed to basically read and write these types of docs right you have you know investment banker or someone you know kind of on the customer support side

reviewing a lot of just unstructured data and using that documentation to basically make decisions and take actions. For the first time, AI agents

actions. For the first time, AI agents can actually reason and act over massive amounts of unstructured context tokens and you know do analysis, do research,

syn synthesize these insights and actually take actions end to end.

And so for us when we think about the use cases and the types of agents for automating knowledge work, they really fall into two main categories. There's

what we call assist of agents. Um so

those that are kind of more like a standard chat interface. They help

humans get more information faster. And

then there's automation type agents.

Agents that automate routine tasks uh can run in the background, maybe require a little bit less human in the loop, and can take actions um that automate the routine operational stuff. When we think

about the stack that's required to actually build either the assistive or automation type agents, there's two main components. There's really really nice

components. There's really really nice tools um and then there's a really nice agent architecture. uh with MCP 80A

agent architecture. uh with MCP 80A these days, a lot of people are thinking about how do I build really nice tools that allow agents to interface with the external world to basically surface

relevant context and let the agent take uh external actions. And a lot of the agent architecture, you know, there's very general reasoning loops as well as well as more constrained loops. It's

basically how do I encode the business logic um through an agentic workflow to help achieve the task.

So for the purposes of this talk, we'll talk about three main things. Um, a lot of stuff to cover, so I'll probably pick up my clock speed a little bit, but basically there's building a document toolbox. Um, which is how do I build

toolbox. Um, which is how do I build really nice tools that allow uh, you know, AI agents to interact with massive amounts of unstructured documents. Um,

two is agent design patterns. Um, so

thinking about just at a high level the two categories of agents from assistance to automation and three is bringing it together in terms of document agent use cases. So first step is on building a

cases. So first step is on building a document toolbox.

Basically if you think about agents interacting with tools um and as LLMs get better you're going to have these very general front-end interfaces like cloud or chatbt um agents need access to

the right tools to basically interface with the external world. And for the purposes of you know massive amounts of unstructured enterprise data um they basically need the right toolbox to interact with this data. It's basically

a generalization beyond naive rag, right? Rag is just uh retrieval. I know

right? Rag is just uh retrieval. I know

this is a rag workshop. Um but nag rag is just like retrieval and then oneshot synthesis. Um a lot of what agents can

synthesis. Um a lot of what agents can do over your documents includes retrieval but also includes other operations like filebased search uh manipulation and more. And one of the

points I'm trying to make is that to basically create these tool interfaces in the first place, you need a really nice pre-processing layer. Um, so you need, you know, actual data connectors to your data sources, um, that basically

sync data from your data source into a format that your agents can access. Um,

you know, could be SharePoint, Google Drive, S3, Confluence. It needs to sync permissions to and the right metadata.

You need the right document parsing and extraction piece. More on this in just a

extraction piece. More on this in just a bit, but you basically need actual actually good understanding over your documents, over tables, charts, and more. And of course, you know, if you

more. And of course, you know, if you have a large collection of docs and you needed to index it in some way, um it could be vector indexing into, you know, vector search. It could also be uh

vector search. It could also be uh indexing into a SQL table. Uh it could be graph DBs, uh it could be anything.

So basically to ensure the data is high quality, you need this layer to actually process andru structure your documents and expose the right tool interfaces. In

terms of the right tool interfaces, this is what I want to kind of uh define a term. It's basically called like a

term. It's basically called like a document MCP server. Um again it's like a generalization of this idea of rag, right? If rag is just oneshot vector

right? If rag is just oneshot vector retrieval. You kind of need like a set

retrieval. You kind of need like a set of tools um to basically equip an AI agent with uh to basically uh understand and manipulate different types of documents. It could be you know doing

documents. It could be you know doing semantic search to fuzzy find the relevant source of data. It could be file lookup to basically look up the right file metadata. Um it could be manipulation to actually do operations

on top of the files and it could be structure quering right quering a more structured database to get aggregate insights over the types of data um that

that you've extracted out.

One, you know, top consideration uh when actually building this type of toolbox is uh complex documents. Uh for those of you who follow our socials, we talk a lot about this type of issue where a lot

of human knowledge in the form of like really complicated PDFs and other formats too. Embedded tables, charts,

formats too. Embedded tables, charts, images, irregular layouts, headers, footers. This is typically stuff that's

footers. This is typically stuff that's designed for human consumption and not machine consumption. And so, you know,

machine consumption. And so, you know, if the documents are not processed correctly, no matter how good your LLM is, um it will fail.

So we were probably one of the first people to actually realize that LLMs and LVMs could be used for document understanding. Um if uh in contrast to

understanding. Um if uh in contrast to more traditional techniques where you use kind of like hand-tuned and task specific ML models to achieve uh kind of like document parsing over a specific

class of documents, LLMs actually have a much general layer of accuracy um that you can use to your advantage and just like understanding and inhaling any type of document with comp uh any type of

complexity. Um obviously the baseline

complexity. Um obviously the baseline these days is you can just screenshot a PDF, feed it into chatbt or claude. um

it doesn't actually give you amazing accuracy, but it's a good start. And so

one of the kind of secret sauce like uh magic tricks we found was figuring out how to interle LLMs and LVMs with more traditional parsing techniques and adding kind of test time tokens in terms

of agentic validation and reasoning to really get a higher level of accuracy.

Um and so you know we have a cloud service that does document parsing and is a core step of this document toolbox.

uh we basically benchmarked uh our modes where we adapt uh you know sauna 3.5 4.0 uh Gemini 2.5 Pro 4.1 from OpenAI and it basically outperforms all existing

parsing benchmarks um and and tools out there in terms of open source to proprietary um yeah so some of you might know us as a rag framework that's basically how we

started um you know for those of you who don't know we have this uh managed platform that is basically this GI native document toolbox um contains a lot of operations that you need to do on top of your docs it could be doc

document parsing, document extraction, uh, uses some of those, you know, kind of capabilities I just mentioned and allows you to parse, extract index data for all the set of tools I just mentioned.

One of the special releases I actually want to highlight today um, and we just announced this in a blog post a few hours ago is Excel capabilities to help complement this document toolbox. A lot

of knowledge work happens in Microsoft Excel and also Google Sheets and, you know, numbers and basically it's spreadsheets, right? but it's been

spreadsheets, right? but it's been unsolved by LLMs. Um, if you look at the document to the right, uh, neither rag nor Texas CSV techniques will actually

work over this because it's not really a structured 2D table. There's a bunch of gaps in the rows and gaps in the columns.

So, we basically built an Excel agent um that's capable of taking unnormalized Excel spreadsheets and transforming them um into a normalized 2D format and also

allows you to do a gentic QA um over uh both the unnormalized and normalized versions of the Excel spreadsheet. Um

it's a pretty cool capability. I'll

describe uh how it kind of works in just a bit. um but it's going to complement

a bit. um but it's going to complement our toolbox right in terms of uh more traditional document parsing, extraction, indexing and it's available in uh early preview. So if you just uh

take a look at the video, it's also on our blog post. We basically uploaded that example synthetic data set, transformed it into a 2D table, and you can also ask questions over it to basically get insights. And it's really

doing the heavy lifting of deeply understanding the semantic structure of the Excel spreadsheet. Um, and then using that and plugging that in as

specialized tools to an AI agent.

Um, the best baseline is not really rag or text to CSV. Um, those both suck. Um,

it's really just an LLM being able to write code. Um, so LLM with the code

write code. Um, so LLM with the code interpreter tool is a reasonable baseline. Gets you to 70 75% accuracy.

baseline. Gets you to 70 75% accuracy.

Um, over like a private data set of synthetic Excel sheets, uh, we basically were able to get this up to 95%. Um, it

actually surpasses human baselines of 90% of a human trying to go and do the data transformation by hand.

Um, a brief note on how it works. Uh,

it's a little bit technical. Um but you know more details are in the blog post.

Um first we do some sort of structure understanding of the Excel spreadsheet.

So we do a little bit of RL reinforcement learning. Um you know we

reinforcement learning. Um you know we actually kind of adapt dynamically to the specific format of the document um and learn a semantic map of the sheet.

By learning a semantic map uh we can then translate this into um kind of a set of specialized tools that you provide to an agent. And so from a abstract perspective, you can kind of

think about it as an agent could just write code from scratch. Um as LLMs get better, that will certainly become um an e like a a kind of higher performing baseline. But in the meantime, we're

baseline. But in the meantime, we're helping it out by really providing uh a set of specialized tools over the semantic map so you can reason over an Excel spreadsheet.

Great. Um the next piece here is so we talked about a document toolbox. uh we

talked about a lot of operations basically make this uh document toolbox really good and comprehensive. So now

that you plugged it into an agent, what are the different agent architectures and what are the use cases are implied by them? Um as many of you probably know

by them? Um as many of you probably know from building agents yourselves, agent orchestration ranges from more constrained architectures to unconstrained architectures. Um

unconstrained architectures. Um constrained is basically you kind of more explicitly define the control flow.

Unconstrained is like a react loop, function calling, codeact, whatever. you

basically give it a set of tools and let it run. Um, deep research is kind of the

it run. Um, deep research is kind of the same thing.

Um, for us, we basically noticed there's two main categories of UX's. Um, there's

more assistantbased UXs that can basically surface information and um, help a human surface information or produce some unit of knowledge work through usually a chatbased interface.

It's usually chat oriented. The inputs

natural language. Um, the architecture is a little bit more unconstrained. you

know, it's basically a react loop over some set of tools. Um, and it's inherently both unconstrained but also with a higher degree of human in the loop. So the goal is or the expectation

loop. So the goal is or the expectation is that the human is supposed to kind of guide and coax the agent uh along the steps of the process to basically achieve the task at hand.

There's a I mean there's I'm sure many of you have built these types of use cases and so this is just a very small subset um but it's basically just you know your uh generalization of a of a rag chatbot.

There's a second category of use cases that I think is interesting and I think a lot of folks are actually starting to build more into this space which is um this automation interface. So being able

to actually instead of uh providing some assistant or co-pilot to help a human get more information um processing routine tasks in a multi-step end to end manner and usually the architecture is a

little bit different. Um it takes in some batch of inputs. Uh it can run in the background or it could be triggered ad hoc by the human. Um the architecture is a little bit more constrained which kind of makes sense right? If you want

this thing to run more end to end um you need it to not just go off the rails.

Um, and there's usually a little bit less human in the loop at every step of the process and usually some sort of like batch review in the end. And the

output is like structured results, integration with APIs, uh, decision-m after approval, it'll just go route to the downstream systems. Some of the use cases here include, you know, financial data normalization, data

sheet extraction, invoice reconciliation, contract view, and more.

Um, I'll skip this video, but you know, there's some fun example of some community- based open source repos we built in this area, like the invoice reconciler by Lori Boss.

a kind of general idea that we've emer that has emerged and we've noticed as a pattern is you know oftentimes the automation agents can serve as a backend because it runs in the background you

know can do the data ETL transformation they're still human in the loop but it's kind of the doing the thing where it needs to process and structure a lot of data um and do decisions in the background and then assistant agents are

kind of more front-end facing right and so automation agents can structure process your data and provide the right tool interfaces um for assistant agents.

Not every tool depends on agentic reasoning, but for a lot of these use cases like for a very generalized data pipeline um where you're processing a lot of unstructured context, you might

have automation agents go in and process your data, provide the right tools for some sort of more uh research userfacing interface.

So we talked about building a document toolbox. We talked about, you know, the

toolbox. We talked about, you know, the the different categories of agentic architectures and putting it together.

Um, here are some real world use cases of document agents. And these are basically examples of agents that actually help automate different types of knowledge work. So, one of our

favorite examples is a combination of both automation and assistant UXs for financial due diligence. Um, Carl is one of our uh favorite customers and and partners. Um you know they basically

partners. Um you know they basically used uh some of the core capabilities that we have to build an end toend leverage bio agent um you know it requires an automation interface to

inhale massive amounts of unstructured public and private financial data um Excel sheets PDFs powerpoints go through some bespoke extraction algorithms with

human in the loop review and then once that data is actually structured in the right format providing a co-pilot interface uh for the analyst teams to actually both get insights and generate

reports over that data.

If you look at any enterprise search use case that typically falls within the assistant UX, um, SEMX is one of our favorite uh, customers in this space where, you know, just being able to define a lot of different collections to

different sources of data and providing more task specific specialized agentic rag chat bots over your data, right? Um,

you know, it's basically rag, but you add like an agentic reasoning layer on top so that you can basically break down user queries, do research, and answer the question at hand.

And on the pure automation UX aside, uh we notice a lot of kind of use cases popping up around automate automation and efficiency. And so one example is

and efficiency. And so one example is actually technical data sheet injection.

Um you know, we're working with a global electronics company. They have a lot of

electronics company. They have a lot of data sheets that need to be automatically processed and reviewed.

And historically, it's taken a lot of human effort to actually do this. Um so

by creating the right end toend automation agent you can basically encode the business specific logic for parsing these types of documents extracting out the right pieces of

information matching it against specific rules and outputting the structured data into SQL. There's human in the loop

into SQL. There's human in the loop review. Um but if we're actually able to

review. Um but if we're actually able to do this end to end, it transforms weeks of just like you know technical writer work um into an automated extraction

interface.

So that's basically it. Um you know for those of you who are less familiar, Llama Index is the most accurate customizable platform for automating your document workflows with Adantic AI.

Um our mission statements evolved a little bit since the past few years where we're a very broad horizontal uh framework oftentimes focused on rag. Um,

but if you're interested in some of the capabilities, uh, come talk to us and then please come check us out at Booth G11. Thank you.

G11. Thank you.

[Music]

Loading...

Loading video analysis...