{"id":29129052,"url":"https://github.com/metalshanked/pdf-extractor","last_synced_at":"2026-02-12T13:39:03.283Z","repository":{"id":300271030,"uuid":"1005738968","full_name":"metalshanked/pdf-extractor","owner":"metalshanked","description":"A Streamlit web application that extracts and displays metadata and text content from PDF files.","archived":false,"fork":false,"pushed_at":"2025-06-24T21:56:49.000Z","size":252,"stargazers_count":2,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-06-30T02:43:28.635Z","etag":null,"topics":["ai","document","llm","pdf","streamlit"],"latest_commit_sha":null,"homepage":"https://pdf-miner.streamlit.app/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/metalshanked.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-20T18:19:19.000Z","updated_at":"2025-06-24T21:56:52.000Z","dependencies_parsed_at":null,"dependency_job_id":"674fadcc-bec3-40b0-958f-71e5e50122b5","html_url":"https://github.com/metalshanked/pdf-extractor","commit_stats":null,"previous_names":["metalshanked/pdf-extractor"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/metalshanked/pdf-extractor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/metalshanked%2Fpdf-extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/metalshanked%2Fpdf-extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/metalshanked%2Fpdf-extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/metalshanked%2Fpdf-extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/metalshanked","download_url":"https://codeload.github.com/metalshanked/pdf-extractor/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/metalshanked%2Fpdf-extractor/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29367404,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-12T08:51:36.827Z","status":"ssl_error","status_checked_at":"2026-02-12T08:51:26.849Z","response_time":55,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","document","llm","pdf","streamlit"],"created_at":"2025-06-30T02:37:59.032Z","updated_at":"2026-02-12T13:39:03.272Z","avatar_url":"https://github.com/metalshanked.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PDF Extractor\n\nA Streamlit web application that extracts and displays metadata and text content from PDF files.\n\n![PDF Extractor Screenshot](screenshot1.png)\n\n## Features\n\n- **Upload Multiple PDFs**: Upload one or more PDF files through a simple interface\n- **Extract Metadata**: Automatically extract all available metadata from each PDF\n- **View PDF Content**: View the full text content of each PDF\n- **Tab Navigation**: Easily navigate between multiple PDFs using tabs\n- **Export to CSV**: Export all metadata to a CSV file for further analysis\n- **Clean UI**: Streamlined user interface with custom styling\n\n## Installation\n\n### Prerequisites\n\n- Python 3.7 or higher\n- pip (Python package installer)\n\n### Setup\n\n1. Clone this repository:\n   ```\n   git clone https://github.com/username/pdf-extractor.git\n   cd pdf-extractor\n   ```\n\n2. Install the required dependencies:\n   ```\n   pip install -r requirements.txt\n   ```\n\n## Usage\n\n### Running the Application\n\nRun the application with the following command:\n\n```\nstreamlit run pdf_extractor.py\n```\n\nFor deployment with a custom base URL path:\n\n```\nstreamlit run pdf_extractor.py --server.baseUrlPath=\"/pdf\"\n```\n\n### Using the Application\n\n1. **Upload PDF Files**:\n   - Click the \"Choose PDF files\" button in the sidebar\n   - Select one or more PDF files from your computer\n\n2. **View Metadata**:\n   - The application will automatically extract and display metadata for each PDF\n   - Navigate between PDFs using the tabs at the top\n\n3. **View PDF Content**:\n   - Click the \"PDF DATA\" expander to view the full text content of the PDF\n\n4. **Export Metadata**:\n   - Use the \"Export Metadata\" button in the sidebar to download a CSV file\n   - Optionally include the full PDF text content in the export\n\n## Docker Support\n\nA Dockerfile is included for containerized deployment:\n\n```\ndocker build -t pdf-extractor .\ndocker run -p 8501:8501 pdf-extractor\n```\n\nTo run the application with a custom base URL path in Docker:\n\n```\ndocker run -p 8501:8501 -e BASE_URL_PATH=\"/pdf\" pdf-extractor\n```\n\nThe BASE_URL_PATH environment variable is optional. If not specified, the application will run at the root path.\n\n## Technical Details\n\n### Dependencies\n\n- **streamlit**: Web application framework\n- **pdfminer.six**: PDF parsing and text extraction\n- **pandas**: Data manipulation and CSV export\n\n### Code Structure\n\n- **pdf_extractor.py**: Main application file containing:\n  - PDF metadata extraction functions\n  - Text content extraction\n  - Streamlit UI components\n  - CSV export functionality\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmetalshanked%2Fpdf-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmetalshanked%2Fpdf-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmetalshanked%2Fpdf-extractor/lists"}