Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/finic-ai/doctran


https://github.com/finic-ai/doctran

Last synced: 3 months ago
JSON representation

Awesome Lists containing this project

README

        


🐛 Doctran


Document transformation framework - use LLMs to process complex strings with natural language instructions.


License


Issues


There are certain applications that require documents to be parsed where human-level judgement matters more than speed. E.g. labelling transactions, or extracting semantic information from texts. In these cases, RegEx can be too inflexible, but LLMs are ideal. One way to think of Doctran is a LLM-powered black box where messy strings go in and nice, clean, labelled strings come out. Another way to think about it is a modular, declarative wrapper over OpenAI's functional calling feature that significantly improves the developer experience.

Doctran is (lightly) maintained by [jasonwcfan](https://github.com/jasonwcfan).

## Examples
Clone or download [`examples.ipynb`](/examples.ipynb) for interactive demos.

#### Doctran converts messy, unstructured text
```

J u l y 1 , 2 0 2 3
Updates and Discussions on Various Topics;

Dear Team,


I hope this email finds you well. In this document, I would like to provide you with some important updates and discuss various topics that require our attention. Please treat the information contained herein as highly confidential.

Security and Privacy Measures

As part of our ongoing commitment to ensure the security and privacy of our customers' data, we have implemented robust measures across all our systems. We would like to commend John Doe (email: [email protected]) from the IT department for his diligent work in enhancing our network security. Moving forward, we kindly remind everyone to strictly adhere to our data protection policies and guidelines. Additionally, if you come across any potential security risks or incidents, please report them immediately to our dedicated team at [email protected].

HR Updates and Employee Benefits

Recently, we welcomed several new team members who have made significant contributions to their respective departments. I would like to recognize Jane Smith (SSN: 0 4 9 - 4 5 - 5 9 2 8) for her outstanding performance in customer service. Jane has consistently received positive feedback from our clients. Furthermore, please remember that the open enrollment period for our employee benefits program is fast approaching. Should you have any questions or require assistance, please contact our HR representative, Michael Johnson (phone: 4 1
...
```

#### Into semi-structured documents that are optimized for vector search.

```
{
"topics": ["Security and Privacy", "HR Updates", "Marketing", "R&D"],
"summary": "The document discusses updates on security measures, HR, marketing initiatives, and R&D projects. It commends John Doe for enhancing network security, welcomes new team members, and recognizes Jane Smith for her customer service. It also mentions the open enrollment period for employee benefits, thanks Sarah Thompson for her social media efforts, and announces a product launch event on July 15th. David Rodriguez is acknowledged for his contributions to R&D. The document emphasizes the importance of confidentiality.",
"contact_info": [
{
"name": "John Doe",
"contact_info": {
"phone": "",
"email": "[email protected]"
}
},
{
"name": "Michael Johnson",
"contact_info": {
"phone": "418-492-3850",
"email": "[email protected]"
}
},
{
"name": "Sarah Thompson",
"contact_info": {
"phone": "415-555-1234",
"email": ""
}
}
],
"questions_and_answers": [
{
"question": "What is the purpose of this document?",
"answer": "The purpose of this document is to provide important updates and discuss various topics that require the team's attention."
},
{
"question": "Who is commended for enhancing the company's network security?",
"answer": "John Doe from the IT department is commended for enhancing the company's network security."
}
]
}
```

## Getting Started
`pip install doctran`

```python
from doctran import Doctran

doctran = Doctran(openai_api_key=OPENAI_API_KEY)
document = doctran.parse(content="your_content_as_string")
```

## Chaining transformations
Doctran is designed to make chaining document transformations easy. For example, you may want to first redact all PII from a document before sending it over to OpenAI to be summarized.

Ordering is important when chaining transformations - transformations that are invoked first will be executed first, and its result will be passed to the next transformation.

```python
document = await document.redact(entities=["EMAIL_ADDRESS", "PHONE_NUMBER"]).extract(properties).summarize().execute()
```

## Doctransformers

### Extract
Given any valid JSON schema, uses OpenAI function calling to extract structured data from a document.

```python
from doctran import ExtractProperty

properties = ExtractProperty(
name="millenial_or_boomer",
description="A prediction of whether this document was written by a millenial or boomer",
type="string",
enum=["millenial", "boomer"],
required=True
)
document = await document.extract(properties=properties).execute()
```

### Redact
Uses a spaCy model to remove names, emails, phone numbers and other sensitive information from a document. Runs locally to avoid sending sensitive data to third party APIs.

```python
document = await document.redact(entities=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "US_SSN"]).execute()
```

### Summarize
Summarize the information in a document. `token_limit` may be passed to configure the size of the summary, however it may not be respected by OpenAI.

```python
document = await document.summarize().execute()
```

### Refine
Remove all information from a document unless it's related to a specific set of topics.

```python
document = await document.refine(topics=['marketing', 'meetings']).execute()
```

### Translate
Translates text into another language

```python
document = await document.translate(language="spanish").execute()
```

### Interrogate
Convert information in a document into question and answer format. End user queries often take the form of a question, so converting information into questions and creating indexes from these questions often yields better results when using a vector database for context retrieval.

```python
document = await document.interrogate().execute()
```

## Contributing
Doctran is open to contributions! The best way to get started is to contribute a new document transformer. Transformers that don't rely on API calls (e.g. OpenAI) are especially valuable since they can run fast and don't require any external dependencies.

### Adding a new doctransformer
Contributing new transformers is straightforward.

1. Add a new class that extends `DocumentTransformer` or `OpenAIDocumentTransformer`
2. Implement the `__init__` and `transform` methods
3. Add corresponding methods to the `DocumentTransformationBuilder` and `Document` classes to enable chaining