{"id":31766625,"url":"https://github.com/devgateway/automatic-contract-summarizer","last_synced_at":"2025-10-10T00:52:46.736Z","repository":{"id":310532800,"uuid":"1038053271","full_name":"devgateway/automatic-contract-summarizer","owner":"devgateway","description":"Open Source project that uses AI to summarize contracts.","archived":false,"fork":false,"pushed_at":"2025-09-11T18:14:19.000Z","size":80,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-11T21:18:07.606Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/devgateway.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-08-14T14:42:53.000Z","updated_at":"2025-09-11T18:14:20.000Z","dependencies_parsed_at":"2025-08-18T19:09:00.356Z","dependency_job_id":"1f829059-3cbe-4c84-ba6e-4e1478398e0a","html_url":"https://github.com/devgateway/automatic-contract-summarizer","commit_stats":null,"previous_names":["devgateway/automatic-contract-summarizer"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/devgateway/automatic-contract-summarizer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devgateway%2Fautomatic-contract-summarizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devgateway%2Fautomatic-contract-summarizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devgateway%2Fautomatic-contract-summarizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devgateway%2Fautomatic-contract-summarizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/devgateway","download_url":"https://codeload.github.com/devgateway/automatic-contract-summarizer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devgateway%2Fautomatic-contract-summarizer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279002335,"owners_count":26083356,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-09T02:00:07.460Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-10-10T00:52:43.215Z","updated_at":"2025-10-10T00:52:46.727Z","avatar_url":"https://github.com/devgateway.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# AI Contract Summarizer\n### Summary\nThis project uses open-source LLM libraries to summarize .pdf and .docx documents with information about contracts: tenders data, dates, lists of goods and services, awards data, etc. Its purpose is to help individuals and organizations to process large amounts of contracts in a systemic way, the summary of each document follows the same structure or template that can be easily loaded into a database.\n\nWe defined the template as a list of key:value pairs with these options:\n```\ncontractName: \"\"\nid: \"\"\ntender.name: \"\"\ntender.procedure.law: \"\"\ntender.address: \"\"\ntender.phone: \"\"\ntender.endDate: \"\"\ntender.validity: \"\"\ntender.fundsSource: \"\"\ntender.completionPeriod: \"\"\ntender.email: \"\"\ntender.goods.description: \"\"\ntender.goods.quantity: 0\ntender.goods.unit: \"\"\naward.name: \"\"\naward.price: 0\naward.currency: \"\"\n```\nIts important to mention that this template can be adapted or modified to fit the necessities of your project.\n\nOne of the main goals of the project was to provide code for fine-tuning, testing and processing that can run on consumer grade hardware, like a RTX 2000/3000/4000 series graphics card, this means using as little GPU memory as possible. For that we selected models in different sizes that can fit into 8/12GB of VRAM.\nIf you have newer, bigger GPU or a cloud server then this code can be scaled up to use larger LLM models with more parameters (Billions instead of Millions).\n\n\n\n## How to use the code\n\n### Folder structure\nInside directory /src we have several folders:\n- **common:** util classes used to preprocess documents, cleanup outputs, unify results, connect to a database, etc.\n- **facebook / google folders:** each one contains the code to fine-tune and test a different model, in our case we found the best relation between **model size \u0026 accuracy** in **google_flan_t5_base** and that's the model we will talk about from now on.\n\n### Libraries needed\nThis is a python 3 project, after creating a new virtual environment install the libraries with ``` pip install -r requirements.txt ```, also is recommended to install the latest CUDA libraries in order to use the GPU instead of the CPU.\n\n### Train, test and process in bulk\n\n#### Train\n- Prepare a directory with the training data, each .pdf and .docx needs a .txt file     with the same name and the summarization. Its important that you copy \u0026 paste from the original file including special characters.\n\n- Execute ``` 00_google_flan_T5_base_train_dates.py ```, in our case we got good results and low training time using the scripts from **google_flan_t5_base** folder. This step will re-train the original model to improve how it parses dates.\n\n- Execute ``` 01_google_flan_T5_base_train_model.py ``` to re-train the model from the previous step. This process will read the pairs of documents + texts to learn how to summarize using our template.\n\n#### Test\n- We can test single documents by executing ``` 01_google_flan_T5_base_test_one.py ``` or ``` 01_google_flan_T5_base_test_model.py ```. We can change parameters like chunk_size, temperature, etc and test which combination produces the best results.\n\n#### Bulk process\n- By executing the script ``` 01_google_flan_T5_base_process_directory.py ``` we can summarize all .pdf and .docx documents in a directory.\n\n\n---\n# Using FLAN-T5 Fine-tuning \u0026 Inference Toolkit\n\nThis repository contains a **two-stage training pipeline** and a set of **helper utilities** for fine-tuning the open-source **[`google/flan-t5-base`](https://huggingface.co/google/flan-t5-base)** model on domain documents (PDF / DOCX) and generating structured results.\nEverything can run **locally** or inside a **CUDA-enabled Docker container**.\n\n---\n\n## ✨ Key Features\n\n* **Stage-0 pre-training on dates** – teaches the model to normalize natural-language dates to ISO-8601.\n* **Stage-1 domain fine-tuning** – ingests your own corpus of PDFs/DOCXs and expected outputs.\n* **Single-file, curated-batch, or full-directory inference** modes.\n* **GPU ready** – all scripts auto-detect `torch.cuda.is_available()`.\n* Minimal external dependencies; only `transformers`, `datasets`, and common Python data libraries.\n\n---\n\n## 🗂️ Repository Layout\n\n```\n.\n├── 00_google_flan_T5_base_train_dates.py       # Stage-0 training\n├── 01_google_flan_T5_base_train_model.py       # Stage-1 training\n├── 01_google_flan_T5_base_test_one.py          # Inference on one file\n├── 01_google_flan_T5_base_test_model.py        # Inference on a curated list\n├── 01_google_flan_T5_base_process_directory.py # Batch inference over a folder\n├── requirements.txt\n```\n\n---\n\n## 🔧 Prerequisites\n\n| Purpose                 | Package                                                 |\n| ----------------------- | ------------------------------------------------------- |\n| Core ML stack           | **PyTorch ≥ 2.0** (with CUDA if you have an NVIDIA GPU) |\n| Transformers \u0026 datasets | `transformers ≥ 4.40`, `datasets`                       |\n| Data helpers            | `pandas`, `python-docx`, `PyPDF2`                       |\n| Your project utils      | `src.common.*` must be importable                |\n\nInstall locally:\n\n```bash\npip install torch transformers datasets pandas python-docx PyPDF2\n```\n\n---\n\n## 🏋️ Training Workflow\n\n### 1. Stage 0 – Date normalization\n\n```bash\npython 00_google_flan_T5_base_train_dates.py\n```\n\n* Reads `Merged_and_Shuffled_Dates.csv` containing two columns:\n  `date_input` → natural date, `expected_output` → ISO date. This is a sample file autogenerated with different date representations and the expected output as DD/MM/YYYY.\n* Fine-tunes the base model for 5 epochs and saves to\n  `./fine-tuned-model_flan_t5_base_step_0`.\n\n### 2. Stage 1 – Domain fine-tuning\n\n```bash\npython 01_google_flan_T5_base_train_model.py\n```\n\n* Loads the Stage-0 checkpoint (`fine-tuned-model_flan_t5_base_step_0`).\n* Calls `prepare_training_tuples_from_directory()` to build training pairs from\n  `DB_DIRECTORY`.\n* Trains for 10 epochs (batch = 3) and saves to\n  `./fine-tuned-model_flan_t5_base_step_1_\u003cmax_len\u003e`.\n\n\u003e **Tip:** Increase `max_length_param` (default 512 tokens) for longer context; if \u003e 1024 the script automatically enables gradient-checkpointing.\n\n---\n\n## 🔍 Inference Modes\n\n| Script                                            | Use case                                         | How to run                                           |\n| ------------------------------------------------- | ------------------------------------------------ | ---------------------------------------------------- |\n| **`01_google_flan_T5_base_test_one.py`**          | Quick sanity check on one file (path hard-coded) | `python 01_google_flan_T5_base_test_one.py`          |\n| **`01_google_flan_T5_base_test_model.py`**        | Evaluate on a curated list of test paths         | `python 01_google_flan_T5_base_test_model.py`        |\n| **`01_google_flan_T5_base_process_directory.py`** | Batch-process every PDF/DOCX in a folder         | `python 01_google_flan_T5_base_process_directory.py` |\n\nAll three scripts:\n\n* Expect the Stage-1 checkpoint at\n  `.../fine-tuned-model_flan_t5_base_step_1_\u003cmax_len\u003e`.\n  Update the path or pass via environment variables if you reorganise.\n* Slice long files into `chunk_size_characters` (\\~⅔ of `max_length_param`).\n* Use beam-search + temperature sampling (`num_beams = 2`, `temperature = 0.7`).\n\nThe directory processor stores results as `\u003coriginal\u003e_result.txt` and **skips already-processed files**.\n\n---\n\n## 🛠️ Customising \u0026 Extending\n\n| What you want                        | Where to change                                               |\n| ------------------------------------ | ------------------------------------------------------------- |\n| **Different training data location** | `DB_DIRECTORY` (Stage-1)                                      |\n| **Different CSV for Stage-0**        | `CSV_FILE` in Stage-0 script                                  |\n| **Hyper-parameters**                 | `num_train_epochs`, `learning_rate`, `max_length_param`, etc. |\n| **Prompt engineering**               | Update `prompt` variable (all inference scripts)              |\n| **Generation strategy**              | `num_beams`, `temperature`, `do_sample` in inference scripts  |\n\nBecause each setting is a **Python constant near the top** of the script, you can also refactor to `argparse` if you prefer CLI flags.\n\n---\n\n## 🐛 Troubleshooting\n\n| Symptom                             | Fix                                                                                                |\n| ----------------------------------- | -------------------------------------------------------------------------------------------------- |\n| `No module named 'src.HC...'`       | Add the project root to `PYTHONPATH` or `pip install -e .` your utilities.                         |\n| CUDA present but script runs on CPU | Confirm you invoked Docker with `--gpus all` and that `torch.cuda.is_available()` prints **True**. |\n| `FileNotFoundError` on PDFs/DOCX    | Adjust `DB_DIRECTORY`, input paths, or mount the right host folders into Docker.                   |\n\n---\n\n## 🤝 Contributing\n\nPull requests are welcome! Please open an issue first to discuss significant changes such as:\n\n* Porting scripts to `argparse` or `typer`\n* Adding automated evaluation metrics\n* Supporting additional document formats\n\nRemember to run `black` and `ruff` before submitting code.\n\n---\n\n## 📄 License\n\nApache-2.0 license – see `LICENSE` for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdevgateway%2Fautomatic-contract-summarizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdevgateway%2Fautomatic-contract-summarizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdevgateway%2Fautomatic-contract-summarizer/lists"}