# Comprehensive Analysis of Structured Data Parsing with OpenAI
Written on
Chapter 1: Introduction to Structured Data Parsing
Parsing structured data from Large Language Models (LLMs) can be challenging, especially beyond simple examples. However, effectively extracting LLM outputs into specified formats is essential for the integration of LLMs into software systems and generative AI applications. OpenAI has paved the way by introducing GPT function calling and JSON mode. Nonetheless, these features necessitate comprehensive prompt engineering, efficient parsing, retries, and effective error management to ensure reliable performance in real-world applications.
In this article, I will discuss some of the challenges I've encountered while parsing structured data using LLMs. It's worth noting that this content was entirely crafted by a human, supplemented by Grammarly's grammar checker, which has been my writing assistant since 2019.
Section 1.1: Key Challenges in Structured Data Parsing
- Classification: LLMs must strictly conform to a predefined list of acceptable classes, which can range from dozens to hundreds in practical scenarios. When faced with more than a few classes, LLMs tend to generate outputs that include unauthorized classes.
- Named Entity Recognition (NER): The LLM should only identify entities explicitly mentioned in the text, which may be organized in a multi-level nested structure, such as User → Address → City. LLMs often struggle with accurately pinpointing these nested elements, resulting in missed entries or the generation of non-existent ones.
- Synthetic Data Generation: Similar to NER, the generation of synthetic data may also require a multi-level nested structure, presenting analogous challenges.
Fortunately, several open-source initiatives aim to address these issues, although my experiences with them have yielded mixed results in complex, real-world scenarios. Thus, I decided to systematically evaluate three open-source frameworks I've utilized: Instructor, Fructose, and Langchain, to determine the most effective framework for the aforementioned tasks in challenging environments. Spoiler alert: Fructose emerged as the top contender!
Experiment Design and Methodology
I aimed to assess the out-of-the-box performance of each framework with minimal modifications based on their official documentation. The evaluation involved running the same input multiple times (20 iterations) to track the number of successful data parsing instances. Additionally, I compared the parsed output against expected results for accuracy measurement. The tests focused on three complex tasks:
- Extreme Multilabel Classification: Handling scenarios with 60 potential classes.
- Deeply Nested NER: Parsing data structured in a three-level hierarchy.
- Three-Level Nested Synthetic Data Generation: Similar to NER, requiring a nested structure.
To begin, establish a new Mamba environment and install the necessary libraries. Mamba serves as a drop-in alternative to Conda. If you prefer Conda, simply replace mamba with conda in the following code snippets.
Next, open a notebook and import all required libraries.
#### Extreme Multilabel Classification
Achieving high reliability in multilabel classification tasks with many classes can be challenging, especially when compared to simpler, toy problems. For my testing, I utilized the Alexa Intents Classification dataset, which includes 60 classes and is available under a cc-by-4.0 license.
First, we will define the allowed classes and provide sample text for classification. We will then outline the data structures for LLM output parsing. Fructose and Instructor can utilize Enums, while Langchain requires a Pydantic V1 model.
Instructor
Setting up Instructor for multilabel classification is straightforward, requiring just a single line of code as shown in the documentation. The desired output format is passed into the response_model argument. I will annotate the Instructor code with my experiment decorator and run it 20 times against the expected response.
Fructose
Configuring Fructose for multilabel classification involves creating a dummy function annotated with the ai decorator. The function’s type hints inform the LLM of the expected output format, while the docstring prompts the LLM about the task. I will also annotate this function with my experiment decorator for accuracy calculations.
Langchain
For multilabel classification with Langchain, it's necessary to create a prompt template that informs the LLM of the task, establish a Pydantic V1 model for expected output, and utilize the with_structured_output method of the LLM class with the Pydantic model. I will add my experiment decorator to run the code 20 times with the expected response for accuracy evaluation.
#### Deeply Nested Named Entity Recognition
LLMs excel at zero-shot NER extraction but often struggle to parse text into deeply nested structures, which are common in real-world applications.
To evaluate the reliability of the three libraries, I will first prepare the sample text and expected responses.
Instructor
The Instructor library supports any Pydantic model, including nested models, for the desired output format. I will define the nested Pydantic models and create a function that executes the Instructor code using the OpenAI API with an additional response_model parameter. This function will also be annotated with the experiment decorator for repeated runs and accuracy calculations.
Fructose
The Fructose library allows any Dataclass model, including nested structures, as the output format. I will define the nested Dataclasses and set up a dummy Python function with the ai decorator to guide the LLM in expected output format. This function will also feature the experiment decorator for accuracy measurements.
Langchain
Using Langchain for NER requires creating a prompt template to inform the LLM of the task, establishing a Pydantic V1 model for expected output, using it in the with_structured_output method, and tying it to the LLM with LCEL. I will also add my experiment decorator to evaluate accuracy through 20 runs.
#### Deeply Nested Synthetic Data Generation
Generating synthetic data is one of the exciting applications of LLMs. Although LLMs can produce rich synthetic data, they often struggle when parsing this information into complex nested structures.
Instructor
Similar to the previous sections, the Instructor library can utilize nested Pydantic models for the output format. I will define these models and implement a function that runs the Instructor code, asking for details about a fictitious individual. This function will be annotated for repeated testing.
Fructose
The Fructose library can also leverage nested Dataclasses for output formats. I will create a dummy function with the ai decorator and annotate it for repeated runs and metrics calculation.
Langchain
Utilizing Langchain for synthetic data generation is more complex, and I will reproduce the documentation code here for thoroughness. However, despite adhering to the guidelines, I encountered issues, achieving poor results after the initial iteration.
Summary of Findings
This analysis compares the performance of Instructor, Fructose, and Langchain across three tasks: Extreme Multilabel Classification, Deeply Nested NER, and Synthetic Data Generation.
In conclusion, my findings suggest:
- Instructor: Offers concise code, utilizes Pydantic for data structure definitions, and excels in extreme multilabel classification tasks.
- Fructose: Also features succinct code and performs well with deeply nested fields for NER and synthetic data generation, alongside multilabel classification. However, it employs Dataclasses, which I find less preferable.
- Langchain: While I wish to endorse Langchain, it struggles with more complex challenges like nested fields. Debugging with LCEL is cumbersome, and the documentation can be unclear.
Ultimately, Fructose stands out as the most versatile option for Multilabel Classification, NER, and Synthetic Data Generation. Instructor is a solid alternative for those preferring Pydantic over Dataclasses, but I would not recommend Langchain for these tasks.
Chapter 2: Further Exploration of OpenAI's Structured Outputs
In this video, titled "OpenAI Structured Output - All You Need to Know," we delve deeper into structured output generation techniques and best practices with OpenAI.
The second video, "Structured Output From OpenAI (Clean Dirty Data)," discusses methods to manage and parse structured outputs effectively, focusing on cleaning and organizing data for practical applications.