{"id":19816932,"url":"https://github.com/flyingfathead/pdf-translator-openai-api","last_synced_at":"2025-07-14T10:34:14.918Z","repository":{"id":213959241,"uuid":"735333277","full_name":"FlyingFathead/PDF-translator-OpenAI-API","owner":"FlyingFathead","description":"Python-based PDF/plaintext translator framework for OpenAI API","archived":false,"fork":false,"pushed_at":"2024-07-07T19:32:47.000Z","size":654,"stargazers_count":7,"open_issues_count":0,"forks_count":3,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-22T17:17:53.614Z","etag":null,"topics":["machine-translation","nlp","nlp-parsing","openai","openai-api","python","python3","translation","translation-management","translator"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FlyingFathead.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-12-24T14:28:24.000Z","updated_at":"2025-03-28T18:08:56.000Z","dependencies_parsed_at":"2024-02-28T19:48:36.050Z","dependency_job_id":"c65efc9b-e51a-44ba-b9b2-17fa37bf4501","html_url":"https://github.com/FlyingFathead/PDF-translator-OpenAI-API","commit_stats":{"total_commits":40,"total_committers":2,"mean_commits":20.0,"dds":"0.17500000000000004","last_synced_commit":"79aeb8877bd90f0dfa0caf53164a0a4875806edd"},"previous_names":["flyingfathead/pdf-translator-openai-api"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FlyingFathead%2FPDF-translator-OpenAI-API","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FlyingFathead%2FPDF-translator-OpenAI-API/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FlyingFathead%2FPDF-translator-OpenAI-API/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FlyingFathead%2FPDF-translator-OpenAI-API/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FlyingFathead","download_url":"https://codeload.github.com/FlyingFathead/PDF-translator-OpenAI-API/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251860579,"owners_count":21655770,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-translation","nlp","nlp-parsing","openai","openai-api","python","python3","translation","translation-management","translator"],"created_at":"2024-11-12T10:11:06.463Z","updated_at":"2025-05-01T10:33:19.155Z","avatar_url":"https://github.com/FlyingFathead.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PDF-translator-OpenAI-API\nExperimental Python-based PDF/plaintext translator that utilizes the OpenAI API\n\n![GUI Screenshot](https://github.com/FlyingFathead/PDF-translator-OpenAI-API/blob/main/gui-screenshot.png)\n\n- Can be used to dump PDF content into JSON and further on to local databases or i.e. LLM supplementation with RAG\n\n- **NOTE: this is a highly experimental WIP pipeline for dumping PDF's into plaintext and getting them translated through the OpenAI API.**\n\n- **I do NOT recommend running it without first studying the code since the program is just an early trial at this point.**\n\n# Prerequisites\n- Text extraction from PDF files requires `pdfminer.six` -- install with: `pip install -U pdfminer.six`\n- Token counting (to calculate estimated API costs) requires `transformers` -- install with `pip install -U transformers`\n- Translation module when using OpenAI API requires the `openai` package (`pip install -U openai`) and a functioning OpenAI API key.\n- Put the OpenAI API key into your environmental variables as `OPENAI_API_KEY` or into a single line entry into `api_token.txt` in the program directory.\n\n# Functionalities so far / processing order\n\n0) `pdfget.py \u003cdirectory\u003e` will use `fitz` (PyMuPDF) in order to dump the text in a natural reading order by approximating the position on the page. The current version adds a page separator and page counter between each page and dumps the plaintext files to `txt_raw` subdirectory. Then, `page_fixing.py \u003cdirectory\u003e` can be used on the `txt_raw` directory to dump the formatting per page into a more concise format, keeping the page splits. The output directory is `txt_processed`. Keep in mind that all of these are trial-and-error type approaches that may not be applicable to all use case scenarios.\n\n1) `pdf_reader_splitter.py \u003cpdf file\u003e` to dump to splits by page straight from the pdf. Also supports cmdline option for setting split on chars. WIP, as usual.\n2) `openai_api_auto_translate.py \u003cdirectory name\u003e` to translate an entire directory (where you dumped your stuff into with `pdf_reader_splitter.py`). Edit `config.ini` to set your own parameters for translation.\n3) `combine_translation.py \u003cdirectory name\u003e` to combine the splits back into one piece.\n4) `post_process.py \u003ctextfile\u003e` for final touches, i.e. any paragraphs that are without an empty line in between, add one in, and trim multiple empty lines.\n\n## Text parsing with `spacy` (for specific use case scenarios only)\n- `pip install spacy` and then your needed packages like:\n- `python -m spacy download \u003cyour spacy package\u003e`\n\n# WIP\n- `gui-translator.py` - an early alpha GUI for side-by-side / A/B type comparison with a graphical user interface.\n\n# Other stuff\n\n- `pdfmine.py your_file.pdf` to dump the text layer of a PDF to plaintext.\n- `tokencounter.py` to estimate the amount of tokens that the text file has for a rough token usage estimate.\n- `splitter.py textfile.txt` to split the text file into pieces that are more suitable for LLM's such as GPT-3.5 or GPT-4. It splits at 5000 chars at newline by default, but can be adjusted from the `char_limit` variable.\n- `splitter.py` also tries to auto-sanitize tha pdf dump at the moment -- this might not be suitable for your use case scenario, so again -- look at the split dumps first before you run it through a LLM translation -- GIGO (garbage in, garbage out) applies to NLP translations as well.\n- (Coming soon) pipeline to automate the actual translation process.\n\n# Changelog\n- v0.14 - added `token_count_estimator.py` to run a token count estimate (with `spacy` and `tokenizer`)\n- v0.13 - added `pdfget.py` for natural reading order extraction using fitz (PyMuPDF)\n- v0.12 - early alpha test for the GUI; `gui-translator.py`\n- v0.11 - bugfixes\n- v0.10 - translation combining via `combine_translation.py`\n- v0.09 - token handling, naming policy\n- v0.08 - more changes to the API call functionality\n- v0.07 - API call updated and fixed for openai \u003ev1.0\n- v0.06 - fixes to the API call\n- v0.05 - calculate the cost approximation\n- v0.04 - calculate both tokens and chars\n- v0.03 and earlier: rudimentary sketches\n\n# Todo\n- More streamlined automation for the translation process\n- Perhaps an optional GUI with a PDF reader\n- Looking into PDF file layers to see if we could replace the contents in-place (get text block layer from PDF page =\u003e sanitize =\u003e LLM translate =\u003e insert back in-place)\n\n# About\n\n- Started as a Grindmas (= Code-Grinding Christmas) project for [Skrolli magazine](https://skrolli.fi)\n- [FlyingFathead](https://github.com/FlyingFathead) w/ code whispers from ChaosWhisperer\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fflyingfathead%2Fpdf-translator-openai-api","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fflyingfathead%2Fpdf-translator-openai-api","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fflyingfathead%2Fpdf-translator-openai-api/lists"}