Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/1rgs/jsonformer
A Bulletproof Way to Generate Structured JSON from Language Models
https://github.com/1rgs/jsonformer
Last synced: 3 days ago
JSON representation
A Bulletproof Way to Generate Structured JSON from Language Models
- Host: GitHub
- URL: https://github.com/1rgs/jsonformer
- Owner: 1rgs
- License: mit
- Created: 2023-04-29T23:25:54.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-02-24T22:48:05.000Z (11 months ago)
- Last Synced: 2025-01-02T01:02:23.246Z (10 days ago)
- Language: Jupyter Notebook
- Homepage:
- Size: 1.09 MB
- Stars: 4,506
- Watchers: 25
- Forks: 159
- Open Issues: 42
-
Metadata Files:
- Readme: README.md
- License: license.txt
Awesome Lists containing this project
- AiTreasureBox - 1rgs/jsonformer - 01-07_4516_1](https://img.shields.io/github/stars/1rgs/jsonformer.svg) |A Bulletproof Way to Generate Structured JSON from Language Models| (Repos)
- jimsghstars - 1rgs/jsonformer - A Bulletproof Way to Generate Structured JSON from Language Models (Jupyter Notebook)
- awesome-ai-repositories - jsonformer
- awesome-ai-repositories - jsonformer
- awesome-interpretability - jsonformer
README
# Jsonformer: A Bulletproof Way to Generate Structured JSON from Language Models.
### Problem: Getting models to output structured JSON is hard
### Solution: Only generate the content tokens and fill in the fixed tokens
[![colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/1rgs/jsonformer/blob/main/Jsonformer_example.ipynb)
![cover](img/cover4.png)
Generating structured JSON from language models is a challenging task. The
generated JSON must be syntactically correct, and it must conform to a schema
that specifies the structure of the JSON.Current approaches to this problem are brittle and error-prone. They rely on prompt engineering, fine-tuning, and post-processing, but they still fail to generate syntactically correct JSON in many cases.
Jsonformer is a new approach to this problem. In structured data, many tokens are fixed and predictable. Jsonformer is a wrapper around Hugging Face models that fills in the fixed tokens during the generation process, and only delegates the generation of content tokens to the language model. This makes it more efficient and bulletproof than existing approaches.
This currently supports a subset of JSON Schema. Below is a list of the supported schema types:
- number
- boolean
- string
- array
- object## Example
```python
from jsonformer import Jsonformer
from transformers import AutoModelForCausalLM, AutoTokenizermodel = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-12b")
tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-12b")json_schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "number"},
"is_student": {"type": "boolean"},
"courses": {
"type": "array",
"items": {"type": "string"}
}
}
}prompt = "Generate a person's information based on the following schema:"
jsonformer = Jsonformer(model, tokenizer, json_schema, prompt)
generated_data = jsonformer()print(generated_data)
```### Jsonformer works on complex schemas, even with tiny models. Here is an example of a schema with nested objects and arrays, generated by a 3B parameter model.
```python
{"type": "object", "properties": {"car": {"type": "object", "properties": {"make": {"type": "string"}, "model": {"type": "string"}, "year": {"type": "number"}, "colors": {"type": "array", "items": {"type": "string"}}, "features": {"type": "object", "properties": {"audio": {"type": "object", "properties": {"brand": {"type": "string"}, "speakers": {"type": "number"}, "hasBluetooth": {"type": "boolean"}}}, "safety": {"type": "object", "properties": {"airbags": {"type": "number"}, "parkingSensors": {"type": "boolean"}, "laneAssist": {"type": "boolean"}}}, "performance": {"type": "object", "properties": {"engine": {"type": "string"}, "horsepower": {"type": "number"}, "topSpeed": {"type": "number"}}}}}}}, "owner": {"type": "object", "properties": {"firstName": {"type": "string"}, "lastName": {"type": "string"}, "age": {"type": "number"}}}}}
``````python
{
car: {
make: "audi",
model: "model A8",
year: 2016.0,
colors: [
"blue"
],
features: {
audio: {
brand: "sony",
speakers: 2.0,
hasBluetooth: True
},
safety: {
airbags: 2.0,
parkingSensors: True,
laneAssist: True
},
performance: {
engine: "4.0",
horsepower: 220.0,
topSpeed: 220.0
}
}
},
owner: {
firstName: "John",
lastName: "Doe",
age: 40.0
}
}
```## Features
- Bulletproof JSON generation: Jsonformer ensures that the generated JSON is always syntactically correct and conforms to the specified schema.
- Efficiency: By generating only the content tokens and filling in the fixed tokens, Jsonformer is more efficient than generating a full JSON string and parsing it.
- Flexible and extendable: Jsonformer is built on top of the Hugging Face transformers library, making it compatible with any model that supports the Hugging Face interface.## Installation
```bash
pip install jsonformer
```## Development
[Poetry](https://python-poetry.org/docs/#installation) is used for dependency management.
```bash
poetry install
``````bash
poetry run python -m jsonformer.example
```## License
Jsonformer is released under the MIT License. You are free to use, modify, and distribute this software for any purpose, commercial or non-commercial, as long as the original copyright and license notice are included.