Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/technologiestiftung/parla-document-processor
Pre-Processing of PDF documents for the "Parla" project
https://github.com/technologiestiftung/parla-document-processor
Last synced: 30 days ago
JSON representation
Pre-Processing of PDF documents for the "Parla" project
- Host: GitHub
- URL: https://github.com/technologiestiftung/parla-document-processor
- Owner: technologiestiftung
- Created: 2023-11-08T10:23:09.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-04-13T00:19:04.000Z (8 months ago)
- Last Synced: 2024-04-13T21:45:03.114Z (8 months ago)
- Language: TypeScript
- Homepage:
- Size: 3.3 MB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
![](https://img.shields.io/badge/Built%20with%20%E2%9D%A4%EF%B8%8F-at%20Technologiestiftung%20Berlin-blue)
[![All Contributors](https://img.shields.io/badge/all_contributors-2-orange.svg?style=flat-square)](#contributors-)
# parla-document-processor
This repository contains scripts for pre-processing PDF files for later use in the explorational project _Parla_. It offers a generic way of importing / registering and processing PDF documents. For the use case of _Parla_, the publicly accessible PDF documents of "Schriftliche Anfragen" and "Hauptausschussprotokolle" are used.
## Prerequisites / External Services
- Running and accessible Supabase database with the schema defined in https://github.com/technologiestiftung/parla-api
- [OpenAI](https://platform.openai.com/docs/overview) account and API key
- [LLamaParse](https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/) account and API key## Features
- Register relevant documents from various data sources, see [./src/importers](./src/importers). Registering documents means storing their download URL and possible metadata in the database.
- Process registered documents by1. Downloading the PDF
2. Extracting text (Markdown) content from the PDF via [LLamaParse API](https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/)
3. Generating a summary of the PDF content via OpenAI
4. Generating a list of tags describing the PDF content via OpenAI
5. Generating embedding vectors of each PDF page via OpenAI- Regenerate embeddings both for chunks and summaries. This is particularly useful when the used LLM (we use OpenAI) introduces a new embedding model as it happened in January 2024 (https://openai.com/blog/new-embedding-models-and-api-updates). Regenerating the embeddings is done in the `run_regenerate_embeddings.ts` script and performs the following steps:
- For each chunk in `processed_document_chunks`, generate embedding with the (new) model set in env variable `OPENAI_EMBEDDING_MODEL` and store in column `embedding_temp`.
- For each summary in `processed_document_summaries`, generate embedding with the (new) model set in env variable `OPENAI_EMBEDDING_MODEL` and store in column `summary_embedding_temp`.
- After doing so, the API (https://github.com/technologiestiftung/parla-api) must be changed to use the new model as well.
- The final migration must happen simultaneously with the API changes by renaming the columns:
```
ALTER TABLE processed_document_chunks rename column embedding to embedding_old;
ALTER TABLE processed_document_chunks rename column embedding_temp to embedding;
ALTER TABLE processed_document_chunks rename column embedding_old to embedding_temp;
```
and
```
ALTER TABLE processed_document_summaries rename column summary_embedding to summary_embedding_old;
ALTER TABLE processed_document_summaries rename column summary_embedding_temp to summary_embedding;
ALTER TABLE processed_document_summaries rename column summary_embedding_old to summary_embedding_temp;
```
- After swapping the columns, the indices must be regenerated, see section [**Periodically regenerate indices**]## Limitations
- Only PDF documents are supported
- The download URL of the documents must be publicly accessible## Environment variables
See [.env.sample](.env.sample)
```
# Supabase Configuration
SUPABASE_URL=
SUPABASE_SERVICE_ROLE_KEY=
SUPABASE_DB_CONNECTION=#OpenAI Configuration
OPENAI_API_KEY=
OPENAI_MODEL=
OPENAI_EMBEDDING_MODEL=# Directory for processing temporary files
PROCESSING_DIR=ALLOW_DELETION=false
# Max limit for the number of pages to process (with fallback strategy)
MAX_PAGES_LIMIT=5000# Limit for the number of pages to process with LLamaParse
MAX_PAGES_FOR_LLM_PARSE_LIMIT=128# LLamaParse Token (get via LlamaParse Cloud)
LLAMA_PARSE_TOKEN=# Max number of documents to process in one run (for limiting the maximum runtime)
MAX_DOCUMENTS_TO_PROCESS_IN_ONE_RUN=# Max number of documents to import in one run (for limiting the maximum runtime)
MAX_DOCUMENTS_TO_IMPORT_PER_DOCUMENT_TYPE=```
## Run locally
**⚠️ Warning: Running those scripts on many PDF documents will result in significant costs. ⚠️**
- Setup `.env` file based on `.env.sample`
- Run `npm ci` to install dependencies
- Run `npx tsx ./src/run_import.ts` to register the documents
- Run `npx tsx ./src/run_process.ts` to process all unprocessed documents## Periodically regenerate indices
The indices on the `processed_document_chunks` and `processed_document_summaries` tables need be regenerated upon arrival of new data.
This is because the `lists` parameter should be changed accordingly to https://github.com/pgvector/pgvector. To do this, we use the `pg_cron` extension available: https://github.com/citusdata/pg_cron. To schedule the regeneration of indices, we create two jobs which use functions defined in the API and database definition: https://github.com/technologiestiftung/parla-api.```
select cron.schedule (
'regenerate_embedding_indices_for_chunks',
'30 5 * * *',
$$ SELECT * from regenerate_embedding_indices_for_chunks() $$
);select cron.schedule (
'regenerate_embedding_indices_for_summaries',
'30 5 * * *',
$$ SELECT * from regenerate_embedding_indices_for_summaries() $$
);
```## Related repositories
- API and database definition: https://github.com/technologiestiftung/parla-api
- _Parla_ frontend: https://github.com/technologiestiftung/parla-frontend## Contributors ✨
Thanks goes to these wonderful people ([emoji key](https://allcontributors.org/docs/en/emoji-key)):
Fabian Morón Zirfas
💻 🤔
Jonas Jaszkowic
💻 🤔 🚇
This project follows the [all-contributors](https://github.com/all-contributors/all-contributors) specification. Contributions of any kind welcome!
## Credits
Made by
A project by
Supported by
## Related Projects
- https://github.com/technologiestiftung/parla-frontend
- https://github.com/technologiestiftung/parla-api