{"id":25231485,"url":"https://github.com/notshrirang/data-extractor-app","last_synced_at":"2026-05-19T04:33:56.466Z","repository":{"id":220454697,"uuid":"750774085","full_name":"NotShrirang/Data-Extractor-App","owner":"NotShrirang","description":"This Python script is designed to extract structured data from PDF files containing information such as Company Identification Number (CIN), email addresses, PAN (Permanent Account Number), phone numbers, dates, and websites. The script utilizes the PyPDF2 library for PDF processing and multiprocessing for efficient extraction from multiple PDFs.","archived":false,"fork":false,"pushed_at":"2024-02-02T06:25:53.000Z","size":1661,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-16T05:13:43.292Z","etag":null,"topics":["multiprocessing","pypdf2","selenium"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NotShrirang.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-01-31T09:44:11.000Z","updated_at":"2024-02-03T12:48:34.000Z","dependencies_parsed_at":"2024-02-02T06:26:59.431Z","dependency_job_id":"da4f3dcc-701f-48f7-87fe-f636305584a9","html_url":"https://github.com/NotShrirang/Data-Extractor-App","commit_stats":null,"previous_names":["notshrirang/data-extractor-app"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/NotShrirang/Data-Extractor-App","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NotShrirang%2FData-Extractor-App","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NotShrirang%2FData-Extractor-App/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NotShrirang%2FData-Extractor-App/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NotShrirang%2FData-Extractor-App/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NotShrirang","download_url":"https://codeload.github.com/NotShrirang/Data-Extractor-App/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NotShrirang%2FData-Extractor-App/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33201923,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-18T09:27:30.708Z","status":"online","status_checked_at":"2026-05-19T02:00:06.763Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["multiprocessing","pypdf2","selenium"],"created_at":"2025-02-11T12:28:49.384Z","updated_at":"2026-05-19T04:33:56.436Z","avatar_url":"https://github.com/NotShrirang.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Extractor App\n\nThis Python script is designed to extract structured data from PDF files containing information such as Company Identification Number (CIN), email addresses, PAN (Permanent Account Number), phone numbers, dates, and websites. The script utilizes the PyPDF2 library for PDF processing and multiprocessing for efficient extraction from multiple PDFs.\n\n## Table of Contents\n\n- [Prerequisites](#prerequisites)\n- [Installation](#installation)\n- [Usage](#usage)\n- [Configuration](#configuration)\n\n## Prerequisites\n\n- Python 3.x\n- Required Python libraries (install via `pip install -r requirements.txt`):\n  - `selenium`\n  - `PyPDF2`\n\n## Installation\n\n1. Clone the repository:\n\n   ```bash\n   git clone https://github.com/NotShrirang/Data-Extractor-App.git\n   ```\n\n2. Navigate to the project directory:\n\n   ```bash\n   cd Data-Extractor-App\n   ```\n\n3. Install the required dependencies:\n\n   ```bash\n   pip install -r requirements.txt\n   ```\n\n## Usage\n\n1. Edit the `config.json` file to configure URLs for PDFs.\n\n2. Run the main script:\n\n   ```bash\n   python main.py\n   ```\n\n   To run with `multiprocessing`:\n   ```bash\n   python main.py multiprocessing\n   ```\n\n4. The extracted data will be saved as `output.json` in the project directory.\n\n## Configuration\n\n- **config.json**: This file contains the configuration for the script. It includes the list of URLs for PDFs and page_count.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnotshrirang%2Fdata-extractor-app","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnotshrirang%2Fdata-extractor-app","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnotshrirang%2Fdata-extractor-app/lists"}