{"id":13585643,"url":"https://github.com/metachris/pdfx","last_synced_at":"2025-09-28T14:31:00.338Z","repository":{"id":41456403,"uuid":"44324461","full_name":"metachris/pdfx","owner":"metachris","description":"Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.","archived":true,"fork":false,"pushed_at":"2023-06-15T04:37:39.000Z","size":1816,"stargazers_count":1069,"open_issues_count":27,"forks_count":115,"subscribers_count":39,"default_branch":"master","last_synced_at":"2025-09-25T16:12:51.741Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://www.metachris.com/pdfx","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/metachris.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2015-10-15T14:49:37.000Z","updated_at":"2025-09-25T05:44:44.000Z","dependencies_parsed_at":"2024-01-14T11:07:29.297Z","dependency_job_id":null,"html_url":"https://github.com/metachris/pdfx","commit_stats":{"total_commits":84,"total_committers":7,"mean_commits":12.0,"dds":0.25,"last_synced_commit":"9e6864c5f9bcc8801e12c63a64d6efdfd1960494"},"previous_names":[],"tags_count":12,"template":false,"template_full_name":null,"purl":"pkg:github/metachris/pdfx","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/metachris%2Fpdfx","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/metachris%2Fpdfx/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/metachris%2Fpdfx/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/metachris%2Fpdfx/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/metachris","download_url":"https://codeload.github.com/metachris/pdfx/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/metachris%2Fpdfx/sbom","scorecard":{"id":637599,"data":{"date":"2025-08-11","repo":{"name":"github.com/metachris/pdfx","commit":"9e6864c5f9bcc8801e12c63a64d6efdfd1960494"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":3.4,"checks":[{"name":"Code-Review","score":1,"reason":"Found 4/25 approved changesets -- score normalized to 1","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Dangerous-Workflow","score":10,"reason":"no dangerous workflow patterns detected","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Maintained","score":0,"reason":"project is archived","details":["Warn: Repository is archived."],"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Token-Permissions","score":0,"reason":"detected GitHub workflow tokens with excessive permissions","details":["Warn: no topLevel permission defined: .github/workflows/lint-and-test.yml:1","Warn: no topLevel permission defined: .github/workflows/publish-to-pypi.yml:1","Info: no jobLevel write permissions found"],"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Pinned-Dependencies","score":0,"reason":"dependency not pinned by hash detected -- score normalized to 0","details":["Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/lint-and-test.yml:21: update your workflow using https://app.stepsecurity.io/secureworkflow/metachris/pdfx/lint-and-test.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/lint-and-test.yml:24: update your workflow using https://app.stepsecurity.io/secureworkflow/metachris/pdfx/lint-and-test.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/publish-to-pypi.yml:12: update your workflow using https://app.stepsecurity.io/secureworkflow/metachris/pdfx/publish-to-pypi.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/publish-to-pypi.yml:14: update your workflow using https://app.stepsecurity.io/secureworkflow/metachris/pdfx/publish-to-pypi.yml/master?enable=pin","Warn: third-party GitHubAction not pinned by hash: .github/workflows/publish-to-pypi.yml:26: update your workflow using https://app.stepsecurity.io/secureworkflow/metachris/pdfx/publish-to-pypi.yml/master?enable=pin","Warn: pipCommand not pinned by hash: .github/workflows/lint-and-test.yml:30","Warn: pipCommand not pinned by hash: .github/workflows/lint-and-test.yml:31","Warn: pipCommand not pinned by hash: .github/workflows/lint-and-test.yml:32","Warn: pipCommand not pinned by hash: .github/workflows/publish-to-pypi.yml:21","Info:   0 out of   4 GitHub-owned GitHubAction dependencies pinned","Info:   0 out of   1 third-party GitHubAction dependencies pinned","Info:   0 out of   4 pipCommand dependencies pinned"],"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: Apache License 2.0: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Vulnerabilities","score":9,"reason":"1 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: PYSEC-2024-48 / GHSA-fj7x-q9j7-g6q6"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 9 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-21T09:39:09.455Z","repository_id":41456403,"created_at":"2025-08-21T09:39:09.455Z","updated_at":"2025-08-21T09:39:09.455Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":277065547,"owners_count":25754431,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-26T02:00:09.010Z","response_time":78,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T15:05:03.547Z","updated_at":"2025-09-28T14:31:00.027Z","avatar_url":"https://github.com/metachris.png","language":"Python","funding_links":[],"categories":["Python","PYTHON","others","HarmonyOS","[](#table-of-contents) Table of contents"],"sub_categories":["Windows Manager","[](#routinedata-extraction-automation)Routine/Data Extraction Automation"],"readme":"# PDFx\n\n![Build status for master branch](https://github.com/metachris/pdfx/workflows/Lint%20and%20test/badge.svg)\n[![image](https://badge.fury.io/py/pdfx.svg)](https://pypi.python.org/pypi/pdfx)\n[![image](https://img.shields.io/badge/license-Apache-blue.svg)](https://github.com/metachris/pdfx/blob/master/LICENSE)\n\n## Introduction\n\nExtract references (pdf, url, doi, arxiv) and metadata from a PDF.\nOptionally download all referenced PDFs and check for broken links.\n\n**Features**\n\n-   Extract references and metadata from a given PDF\n-   Detects pdf, url, arxiv and doi references\n-   **Fast, parallel download of all referenced PDFs**\n-   **Find broken hyperlinks** (using the `-c` flag)\n    ([more](https://www.metachris.com/2016/03/find-broken-hyperlinks-in-a-pdf-document-with-pdfx/))\n-   Output as text or JSON (using the `-j` flag)\n-   Extract the PDF text (using the `--text` flag)\n-   Use as command-line tool or Python package\n-   Compatible with Python 2 and 3\n-   Works with local and online pdfs\n\n## Getting Started\n\nGrab a copy of the code with `easy_install` or `pip`, and run it:\n\n    $ sudo easy_install -U pdfx\n    ...\n    $ pdfx \u003cpdf-file-or-url\u003e\n\nRun `pdfx -h` to see the help output:\n\n    $ pdfx -h\n    usage: pdfx [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE]\n                [--version]\n                pdf\n\n    Extract metadata and references from a PDF, and optionally download all\n    referenced PDFs. Visit https://www.metachris.com/pdfx for more information.\n\n    positional arguments:\n      pdf                   Filename or URL of a PDF file\n\n    optional arguments:\n      -h, --help            show this help message and exit\n      -d OUTPUT_DIRECTORY, --download-pdfs OUTPUT_DIRECTORY\n                            Download all referenced PDFs into specified directory\n      -c, --check-links     Check for broken links\n      -j, --json            Output infos as JSON (instead of plain text)\n      -v, --verbose         Print all references (instead of only PDFs)\n      -t, --text            Only extract text (no metadata or references)\n      -o OUTPUT_FILE, --output-file OUTPUT_FILE\n                            Output to specified file instead of console\n      --version             show program's version number and exit\n\n## Examples\n\nLets take a look at this paper:\n\u003chttps://weakdh.org/imperfect-forward-secrecy.pdf\u003e:\n\n    $ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf\n    Document infos:\n    - CreationDate = D:20150821110623-04'00'\n    - Creator = LaTeX with hyperref package\n    - ModDate = D:20150821110805-04'00'\n    - PTEX.Fullbanner = This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1\n    - Pages = 13\n    - Producer = pdfTeX-1.40.14\n    - Title = Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice\n    - Trapped = False\n    - dc = {'title': {'x-default': 'Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice'}, 'creator': [None], 'description': {'x-default': None}, 'format': 'application/pdf'}\n    - pdf = {'Keywords': None, 'Producer': 'pdfTeX-1.40.14', 'Trapped': 'False'}\n    - pdfx = {'PTEX.Fullbanner': 'This is pdfTeX, Version 3.1415926-2.5-1.40.14 (TeX Live 2013/Debian) kpathsea version 6.1.1'}\n    - xap = {'CreateDate': '2015-08-21T11:06:23-04:00', 'ModifyDate': '2015-08-21T11:08:05-04:00', 'CreatorTool': 'LaTeX with hyperref package', 'MetadataDate': '2015-08-21T11:08:05-04:00'}\n    - xapmm = {'InstanceID': 'uuid:4e570f88-cd0f-4488-85ad-03f4435a4048', 'DocumentID': 'uuid:98988d37-b43d-4c1a-965b-988dfb2944b6'}\n\n    References: 36\n    - URL: 18\n    - PDF: 18\n\n    PDF References:\n    - http://www.spiegel.de/media/media-35533.pdf\n    - http://www.spiegel.de/media/media-35513.pdf\n    - http://www.spiegel.de/media/media-35509.pdf\n    - http://www.spiegel.de/media/media-35529.pdf\n    - http://www.spiegel.de/media/media-35527.pdf\n    - http://cr.yp.to/factorization/smoothparts-20040510.pdf\n    - http://www.spiegel.de/media/media-35517.pdf\n    - http://www.spiegel.de/media/media-35526.pdf\n    - http://www.spiegel.de/media/media-35519.pdf\n    - http://www.spiegel.de/media/media-35522.pdf\n    - http://cryptome.org/2013/08/spy-budget-fy13.pdf\n    - http://www.spiegel.de/media/media-35515.pdf\n    - http://www.spiegel.de/media/media-35514.pdf\n    - http://www.hyperelliptic.org/tanja/SHARCS/talks06/thorsten.pdf\n    - http://www.spiegel.de/media/media-35528.pdf\n    - http://www.spiegel.de/media/media-35671.pdf\n    - http://www.spiegel.de/media/media-35520.pdf\n    - http://www.spiegel.de/media/media-35551.pdf\n\nYou can use the `-v` flag to output all references instead of just the\nPDFs.\n\n**Download all referenced pdfs** with `-d` (for `download-pdfs`) to the\nspecified directory (eg. to `/tmp/`):\n\n    $ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -d /tmp/\n    ...\n\nTo **extract text**, you can use the `-t` flag:\n\n    # Extract text to console\n    $ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -t\n\n    # Extract text to file\n    $ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -t -o pdf-text.txt\n\nTo **check for broken links** use the `-c` flag:\n\n    $ pdfx https://weakdh.org/imperfect-forward-secrecy.pdf -c\n\n\\[Example (with video) of checking for broken\nlinks\\](\u003chttps://www.metachris.com/2016/03/find-broken-hyperlinks-in-a-pdf-document-with-pdfx/\u003e).\n\n## Usage as Python library\n\n    \u003e\u003e\u003e import pdfx\n    \u003e\u003e\u003e pdf = pdfx.PDFx(\"filename-or-url.pdf\")\n    \u003e\u003e\u003e metadata = pdf.get_metadata()\n    \u003e\u003e\u003e references_list = pdf.get_references()\n    \u003e\u003e\u003e references_dict = pdf.get_references_as_dict()\n    \u003e\u003e\u003e pdf.download_pdfs(\"target-directory\")\n\n## Dev \u0026 Contributing\n\n```bash\n# Setup venv\npython3 -m venv\nvenv . venv/bin/activate\n\n# Install PDFx and dev deps\npip install -e .\npip install -r requirements_dev.txt\n\n# Run tests and checks\nmake test\nmake lint\nmake check\n\n# Format the code (with black)\nmake format\n```\n\n### Releasing\n\n* Update version number in `setup.py` and `pdfx/__init__.py`\n* Create a git tag starting with `v` (eg. `git tag v1.5.9`)\n* Push the tag to GitHub: `git push --tags`\n\nGitHub Actions is then publishing to PyPI.\n\n\n## Various\n\n- Author: Chris Hager [twitter.com/metachris](https://twitter.com/metachris)\n- Homepage: https://www.metachris.com/pdfx\n- License: Apache\n\nFeedback, ideas and pull requests are welcome!\n\n\n## Improvement Ideas\n\nPossible:\n\n- Timeout (see [#43](https://github.com/metachris/pdfx/issues/43))\n- Cuts off links that span two lines [#40](https://github.com/metachris/pdfx/issues/40)\n- Include Check-Links Results in Output [#39](https://github.com/metachris/pdfx/issues/39)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmetachris%2Fpdfx","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmetachris%2Fpdfx","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmetachris%2Fpdfx/lists"}