Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/technologiestiftung/parla-document-processor

Pre-Processing of PDF documents for the "Parla" project
https://github.com/technologiestiftung/parla-document-processor

Last synced: 30 days ago
JSON representation

Pre-Processing of PDF documents for the "Parla" project

Awesome Lists containing this project

README

        

![](https://img.shields.io/badge/Built%20with%20%E2%9D%A4%EF%B8%8F-at%20Technologiestiftung%20Berlin-blue)

[![All Contributors](https://img.shields.io/badge/all_contributors-2-orange.svg?style=flat-square)](#contributors-)

# parla-document-processor

This repository contains scripts for pre-processing PDF files for later use in the explorational project _Parla_. It offers a generic way of importing / registering and processing PDF documents. For the use case of _Parla_, the publicly accessible PDF documents of "Schriftliche Anfragen" and "Hauptausschussprotokolle" are used.

## Prerequisites / External Services

- Running and accessible Supabase database with the schema defined in https://github.com/technologiestiftung/parla-api
- [OpenAI](https://platform.openai.com/docs/overview) account and API key
- [LLamaParse](https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/) account and API key

## Features

- Register relevant documents from various data sources, see [./src/importers](./src/importers). Registering documents means storing their download URL and possible metadata in the database.
- Process registered documents by

1. Downloading the PDF
2. Extracting text (Markdown) content from the PDF via [LLamaParse API](https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/)
3. Generating a summary of the PDF content via OpenAI
4. Generating a list of tags describing the PDF content via OpenAI
5. Generating embedding vectors of each PDF page via OpenAI

- Regenerate embeddings both for chunks and summaries. This is particularly useful when the used LLM (we use OpenAI) introduces a new embedding model as it happened in January 2024 (https://openai.com/blog/new-embedding-models-and-api-updates). Regenerating the embeddings is done in the `run_regenerate_embeddings.ts` script and performs the following steps:

- For each chunk in `processed_document_chunks`, generate embedding with the (new) model set in env variable `OPENAI_EMBEDDING_MODEL` and store in column `embedding_temp`.
- For each summary in `processed_document_summaries`, generate embedding with the (new) model set in env variable `OPENAI_EMBEDDING_MODEL` and store in column `summary_embedding_temp`.
- After doing so, the API (https://github.com/technologiestiftung/parla-api) must be changed to use the new model as well.
- The final migration must happen simultaneously with the API changes by renaming the columns:
```
ALTER TABLE processed_document_chunks rename column embedding to embedding_old;
ALTER TABLE processed_document_chunks rename column embedding_temp to embedding;
ALTER TABLE processed_document_chunks rename column embedding_old to embedding_temp;
```
and
```
ALTER TABLE processed_document_summaries rename column summary_embedding to summary_embedding_old;
ALTER TABLE processed_document_summaries rename column summary_embedding_temp to summary_embedding;
ALTER TABLE processed_document_summaries rename column summary_embedding_old to summary_embedding_temp;
```
- After swapping the columns, the indices must be regenerated, see section [**Periodically regenerate indices**]

## Limitations

- Only PDF documents are supported
- The download URL of the documents must be publicly accessible

## Environment variables

See [.env.sample](.env.sample)

```
# Supabase Configuration
SUPABASE_URL=
SUPABASE_SERVICE_ROLE_KEY=
SUPABASE_DB_CONNECTION=

#OpenAI Configuration
OPENAI_API_KEY=
OPENAI_MODEL=
OPENAI_EMBEDDING_MODEL=

# Directory for processing temporary files
PROCESSING_DIR=

ALLOW_DELETION=false

# Max limit for the number of pages to process (with fallback strategy)
MAX_PAGES_LIMIT=5000

# Limit for the number of pages to process with LLamaParse
MAX_PAGES_FOR_LLM_PARSE_LIMIT=128

# LLamaParse Token (get via LlamaParse Cloud)
LLAMA_PARSE_TOKEN=

# Max number of documents to process in one run (for limiting the maximum runtime)
MAX_DOCUMENTS_TO_PROCESS_IN_ONE_RUN=

# Max number of documents to import in one run (for limiting the maximum runtime)
MAX_DOCUMENTS_TO_IMPORT_PER_DOCUMENT_TYPE=

```

## Run locally

**⚠️ Warning: Running those scripts on many PDF documents will result in significant costs. ⚠️**

- Setup `.env` file based on `.env.sample`
- Run `npm ci` to install dependencies
- Run `npx tsx ./src/run_import.ts` to register the documents
- Run `npx tsx ./src/run_process.ts` to process all unprocessed documents

## Periodically regenerate indices

The indices on the `processed_document_chunks` and `processed_document_summaries` tables need be regenerated upon arrival of new data.
This is because the `lists` parameter should be changed accordingly to https://github.com/pgvector/pgvector. To do this, we use the `pg_cron` extension available: https://github.com/citusdata/pg_cron. To schedule the regeneration of indices, we create two jobs which use functions defined in the API and database definition: https://github.com/technologiestiftung/parla-api.

```
select cron.schedule (
'regenerate_embedding_indices_for_chunks',
'30 5 * * *',
$$ SELECT * from regenerate_embedding_indices_for_chunks() $$
);

select cron.schedule (
'regenerate_embedding_indices_for_summaries',
'30 5 * * *',
$$ SELECT * from regenerate_embedding_indices_for_summaries() $$
);
```

## Related repositories

- API and database definition: https://github.com/technologiestiftung/parla-api
- _Parla_ frontend: https://github.com/technologiestiftung/parla-frontend

## Contributors ✨

Thanks goes to these wonderful people ([emoji key](https://allcontributors.org/docs/en/emoji-key)):



Fabian Morón Zirfas
Fabian Morón Zirfas

💻 🤔
Jonas Jaszkowic
Jonas Jaszkowic

💻 🤔 🚇

This project follows the [all-contributors](https://github.com/all-contributors/all-contributors) specification. Contributions of any kind welcome!

## Credits



Made by








A project by








Supported by







## Related Projects

- https://github.com/technologiestiftung/parla-frontend
- https://github.com/technologiestiftung/parla-api