{"id":25911414,"url":"https://github.com/copyleftdev/pdf_ai_poc","last_synced_at":"2025-03-03T09:17:27.089Z","repository":{"id":249212225,"uuid":"830782347","full_name":"copyleftdev/pdf_ai_poc","owner":"copyleftdev","description":null,"archived":false,"fork":false,"pushed_at":"2024-07-19T01:59:36.000Z","size":18,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-07-19T10:32:10.810Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/copyleftdev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-19T01:51:33.000Z","updated_at":"2024-07-19T10:32:22.539Z","dependencies_parsed_at":"2024-07-19T10:42:41.771Z","dependency_job_id":null,"html_url":"https://github.com/copyleftdev/pdf_ai_poc","commit_stats":null,"previous_names":["copyleftdev/pdf_ai_poc"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/copyleftdev%2Fpdf_ai_poc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/copyleftdev%2Fpdf_ai_poc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/copyleftdev%2Fpdf_ai_poc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/copyleftdev%2Fpdf_ai_poc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/copyleftdev","download_url":"https://codeload.github.com/copyleftdev/pdf_ai_poc/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241637273,"owners_count":19994946,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-03T09:17:26.472Z","updated_at":"2025-03-03T09:17:27.080Z","avatar_url":"https://github.com/copyleftdev.png","language":"Python","readme":"# PDF Data Extractor\n\nThis project extracts data from a PDF file using OpenAI's API and maps it to a predefined JSON schema.\n\n## Project Structure\n\n```\npdf_extractor/\n│\n├── config/\n│   └── config.py\n│\n├── data/\n│   └── schema.json\n│\n├── src/\n│   ├── __init__.py\n│   ├── pdf_utils.py\n│   ├── openai_utils.py\n│   └── mapper.py\n│\n├── generate_sample_pdf.py\n├── main.py\n├── Pipfile\n├── Pipfile.lock\n├── ruff.toml\n└── README.md\n```\n\n## Setup\n\n### Prerequisites\n\n1. **Python 3.9 or higher**: Ensure you have Python 3.9+ installed.\n2. **Pipenv**: Ensure you have `pipenv` installed.\n\n### Installation\n\n1. Clone the repository:\n\n   ```bash\n   git clone https://github.com/yourusername/pdf_extractor.git\n   cd pdf_extractor\n   ```\n\n2. Install dependencies using `pipenv`:\n\n   ```bash\n   pipenv install\n   pipenv install reportlab  # For generating the sample PDF\n   pipenv install --dev ruff  # For linting\n   ```\n\n3. Set up the OpenAI API key:\n\n   - Create a `.env` file in the root directory and add your OpenAI API key:\n\n     ```\n     OPEN_API_KEY=your-openai-api-key\n     ```\n\n### Generate Sample PDF\n\nGenerate a sample PDF with test data to use for extraction:\n\n```bash\npipenv run python generate_sample_pdf.py\n```\n\n### Running the Code\n\nTo extract data from the PDF and map it to the JSON schema:\n\n```bash\npipenv run python main.py\n```\n\n### Linting with Ruff\n\nTo check your code for linting errors with `ruff`, run:\n\n```bash\npipenv run ruff check .\n```\n\nTo automatically fix linting errors with `ruff`, run:\n\n```bash\npipenv run ruff --fix .\n```\n\n## How It Works\n\n1. **Configuration**: The `config/config.py` file loads configuration settings and the OpenAI API key from environment variables.\n\n2. **PDF Generation**: The `generate_sample_pdf.py` script generates a sample PDF with email addresses, dates, and phone numbers.\n\n3. **PDF Text Extraction**: The `src/pdf_utils.py` file contains the `extract_text_from_pdf` function, which extracts text from the PDF.\n\n4. **Data Extraction Using OpenAI**: The `src/openai_utils.py` file contains the `extract_data_with_openai` function, which uses OpenAI's API to extract data from the extracted text based on predefined prompts.\n\n5. **Mapping Data to JSON Schema**: The `src/mapper.py` file contains the `load_json_schema` and `map_to_json_schema` functions, which load the JSON schema and map the extracted data to the schema.\n\n6. **Main Script**: The `main.py` script orchestrates the entire process: it loads the JSON schema, extracts text from the PDF, uses OpenAI's API to extract data, maps the data to the JSON schema, and prints the mapped data as JSON.\n\n## Example Output\n\nAfter running `main.py`, the output should be a JSON object containing the extracted email addresses, dates, and phone numbers from the sample PDF:\n\n```json\n{\n    \"email\": [\n        \"example1@example.com\",\n        \"example2@example.com\"\n    ],\n    \"date\": [\n        \"01/01/2023\",\n        \"02/02/2023\"\n    ],\n    \"phone\": [\n        \"(123) 456-7890\",\n        \"(987) 654-3210\"\n    ]\n}\n```\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcopyleftdev%2Fpdf_ai_poc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcopyleftdev%2Fpdf_ai_poc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcopyleftdev%2Fpdf_ai_poc/lists"}