Schemas for OpenAI functions parameters

What have I learned and what I still don't know.

Jan 09, 2024

The documentation says that the parameters element in the tool definition structure should be a JSON Schema object. The example given at that page is a minimal schema:

"parameters": {
  "type": "object",
  "properties": {
    "location": {
      "type": "string",
      "description": "The city and state, e.g. San Francisco, CA",
    },
    "unit": {
      "type": "string",
      "enum": ["celsius", "fahrenheit"]
    },
  },
  "required": ["location"],
}

But tools like pydantic generate extended schemas which for example include title fields. Both flavors are technically valid - the spec says the title field is optional. I assume that when the schemas are consumed by software the title fields are just ignored (because they don’t bring any new information) - but they might be helpful for people trying to understand what is going on. That is why they are there.

The extended schemas should be accepted by the LLM but there are two caveats:

The GPTs compliance with the specs is not ideal and it makes sense to optimize for it. The additional fields can help the LLM or confuse it. Most probably the LLMs will work best with the kind of schemas that they were trained/fine tuned on and OpenAI put that in the documentation. So it is prudent to assume that the minimal schema will be better. But ideally the question should be settled by an experiment.
The extra fields also mean extra tokens.

Currently there is no official way to generate the minimal schemas from pedantic models. Many libraries remove the title fields - so this might be indicator that minimal schemas are better.

I don’t like removing the title fields after the schema is generated. You either remove all title fields from the schema and then you need to forbid users to have title fields in their structures, or you do tricks like removing them only if they are adjacent to a type field - which also can fail in many ways. But for now it looks like the only reasonable way to use nested structures for function calling which are very powerful (see https://minimaxir.com/2023/12/chatgpt-structured-data/#nested-schema, or many of the examples from https://jxnl.github.io/instructor/examples/). I wrote my own simple tool for generating schemas from function parameter annotations - but it is probably a dead end - I don’t have time to write code for nested structures cases. So now I rewrote it for functions with exactly one argument with a class/type a subclass of pydantic BaseModel.

A related note - the instructor library, which is very powerful for such a simple extension of openai-python library, for now cannot generate a list with more than one tool for the chat completion call. The proposed workaround is to use the Union type. This is good for a quick workaround - but it generates a more complicated schema and in my experience the LLM struggle to comply with it. The library catches such failures and retries the query with additional prompting and on the second pass the LLM usually works OK - but it is one additional LLM call. This is why I still work on my own libraries.

In conclusion: Extended schemas, while advantageous for human understanding, may affect LLM performance and token usage. Currently, minimal schemas appear to be the most efficient, but further guidance from OpenAI or additional experiments are necessary to better understand the available trade-offs.

AI Adventures: A Programmer’s Journey

Discussion about this post