Traditional RAG
The traditional way to do RAG is to find information relevant to a query - and then incorporate it into the LLM prompt together with the question we want it to answer. Something like:
Given that:
Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming.
Please answer the following question:
Can I write object oriented code in Python?
The main problem with that approach is that there is no error-correction mechanism - if the retrieved information is wrong or not enough there is no second chance to fix that. Adaptive RAG, Corrective RAG and similar improve on this by letting the LLM decide if the retrieved information is enough and then repeat the search (with potentially changed parameters - like expanded context size or sources) if needed.
The ReAct Approach: Integrating LLMs with Search and other tools
The earlier “ReAct” paper, proposed a different, and more promising procedure: it asked the LLM to use a search engine and create the search queries itself, then evaluate the results in an iterative loop of looking for the right info. I believe there are numerous advantages in that approach:
In the RAG tool pair it is the LLM that is the more general technology - relying on it more will make the whole more general.
It is an elegant architecture. You can plug in all kind of tools in the same way - code execution, web browsing, various API calls to get weather or other information - making the whole even more general.
By relying more on LLMs we can take advantage of the explosive evolution of it.
We can reuse the same information tools, procedures and heuristics that people already use to find the information they need - no need for writing new vector databases or migrating data. Writing connectors for LLMs to the human centred tools will be a lot less complicated than writing whole new tools.
Semantic search can be a useful tool - but deciding if a piece of information is relevant to a given question involves enough reasoning that would make the search tool a language model in itself.
To make an LLM solve any given task - we can analyze how humans do it, what information resources they access, how they search, what calculations they do - write these procedures down, give that to the LLM and let it use the same tools that humans do.
The core of this approach is the simple loop:
Think - reflect on the question at hand and available information. Answer if possible.
Decide what new information is needed.
Gather that new information - by searching, browsing and accessing other external tools
Go back to thinking
We can add all kinds of additional sub-procedures - like disambiguation of the question, splitting the question into simpler parts, writing plans, etc. - without breaking this overall structure. All of the additional information can be recorded in the Gather and accessed in the Think and Decide phases. We can also modify it - for example by adding hardcoded Gather steps, or splitting the work into many agents, etc.
No free lunch
There are also problems with this approach:
The human-centric tools provide different context than what is optimal for the language models. This is a problem you encounter immediately when you for example try to give the LLM a full web page as context - even a single page used to exceed the context limits, now those limits are much bigger - but there are reports that often LLMs still don’t work very well if you cram too much information into the context. But actually when looking for some specific information humans don’t read whole web pages in one go, they tend to skim the paragraphs, read table of contents and jump into interesting sections, use indexes etc. All of these techniques can be utilized by machines too. In answerbot (my experiment with reimplementation of ReAct) we have operations for searching Wikipedia, reading a chunk from current position, following a link, searching for keywords - and LLMs skillfully use these capabilities, soon we shall be adding more - like something approximating skimming paragraphs.
Using an entirely external mechanism to ground the LLM reduces hallucinations, when the mechanism relies more and more on the LLM itself this effect diminishes.
LLMs tend to be unreliable when doing long processing sequences.
Function calling gives us an occasion to revisit ReAct
All of that is true - but there are ways to mitigate these problems. I suspect that the actual reason that ReAct was mainly abandoned is that the paper was published before wide adoption of function calling mechanisms and to get structured output from the LLM the authors had to design their own language, teach the model to follow their language and write a parser for it. The language was very simplistic - so they did not have to devise character escaping or other techniques to insert the external data into the conversation in a correct way. Parsing is hard, especially if you try to create something practical and not just a toy example. But now that we have widespread support for the function calling mechanism it is time to revisit ReAct.