{"id":13625888,"url":"https://github.com/0xabu/pdfannots","last_synced_at":"2025-04-16T10:33:37.726Z","repository":{"id":39620621,"uuid":"80269237","full_name":"0xabu/pdfannots","owner":"0xabu","description":"Extracts and formats text annotations from a PDF file","archived":false,"fork":false,"pushed_at":"2025-01-08T11:09:04.000Z","size":1489,"stargazers_count":573,"open_issues_count":12,"forks_count":100,"subscribers_count":14,"default_branch":"main","last_synced_at":"2025-04-11T12:14:54.023Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/0xabu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-01-28T05:55:06.000Z","updated_at":"2025-04-10T19:50:13.000Z","dependencies_parsed_at":"2023-01-29T18:30:40.709Z","dependency_job_id":"b8ce25ec-762b-4a3e-8bdc-31a74f4a32e9","html_url":"https://github.com/0xabu/pdfannots","commit_stats":{"total_commits":160,"total_committers":7,"mean_commits":"22.857142857142858","dds":"0.11875000000000002","last_synced_commit":"57fd55da0aa2b984ed7ea6f1443d1d206516e53d"},"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0xabu%2Fpdfannots","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0xabu%2Fpdfannots/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0xabu%2Fpdfannots/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/0xabu%2Fpdfannots/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/0xabu","download_url":"https://codeload.github.com/0xabu/pdfannots/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249228273,"owners_count":21233852,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T21:02:04.654Z","updated_at":"2025-04-16T10:33:37.408Z","avatar_url":"https://github.com/0xabu.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"## pdfannots\n\n[![Build status](https://github.com/0xabu/pdfannots/actions/workflows/python-checks.yml/badge.svg)](https://github.com/0xabu/pdfannots/actions/workflows/python-checks.yml)\n[![PyPI version](https://img.shields.io/pypi/v/pdfannots)](https://pypi.org/project/pdfannots/)\n\nThis program extracts annotations (highlights, comments, etc.) from a PDF file,\nand formats them as Markdown or exports them to JSON. It is primarily intended\nfor use in reviewing submissions to scientific conferences/journals.\n\n![Sample/demo of pdfannots extracting Markdown from an annotated PDF](doc/demo.png)\n\nFor the default Markdown format, the output is as follows:\n\n * Highlights without an attached comment are output first, as\n   \"highlights\" with just the highlighted text included. Note that\n   these are not typically suitable for use in a review, since they're\n   unlikely to have any meaning to the recipient; they are just meant\n   to serve as a reminder to the reviewer.\n\n * Highlights with an attached comment, and text annotations (not\n   attached to any particular text/highlight) are output next, as\n   \"detailed comments\". Typically most comments on a reviewed paper\n   are of this form.\n\n * Underline, strikeout, and squiggly underline annotations are output\n   last, as \"Nits\", with or without an attached comment. The intention\n   of this is to easily separate formatting or grammatical corrections\n   from more substantial comments about the content of the document.\n\nFor each annotation, the page number is given, along with the associated\n(highlighted/underlined) text, if any. Additionally, if the document embeds\noutlines (aka bookmarks), such as those generated by the LaTeX\n[hyperref](https://ctan.org/pkg/hyperref) package, they are printed to help\nidentify to which section in the document the annotation refers.\n\n\n### Installation\n\nTo install the latest released version from PyPI, use a command such as:\n```\npython3 -m pip install pdfannots\n```\n\n\n### Usage\n\nSee `pdfannots --help` (in a source tree: `pdfannots.py --help`) for\noptions and invocation.\n\n\n### Dependencies\n\n * Python \u003e= 3.8\n * [pdfminer.six](https://github.com/pdfminer/pdfminer.six)\n\n\n### Known issues and limitations\n\n * While it is generally reliable, pdfminer (the underlying PDF parser) is\n   not infallible at extracting text from a PDF. It has been known to fail\n   in several different ways:\n\n    * Sometimes it misses or misplaces individual characters, resulting in\n      annotations with some or all of the text missing (in the latter case,\n      you'll see a warning).\n\n    * Sometimes the characters are captured, but not spaces between the words.\n      Tweaking the advanced layout analysis parameters (e.g., `--word-margin`)\n      may help with this.\n\n    * Sometimes it extracts all the text but renders it out of order, for\n      example, reporting that text at the top of a second column comes before\n      text at the end of the first column. This causes pdfannots to return the\n      annotations out of order, or to report the wrong outlines (section\n      headings) for annotations. You can mostly work around this issue by using\n      the `--cols` parameter to force a fixed page layout for the document\n      (e.g. `--cols=2` for a typical 2-column document).\n\n * If an annotation (such as a StrikeOut) covers solely whitespace, no text is\n   extracted for the annotation, and it will be skipped (with a warning). This\n   is an artifact of the way pdfminer reports whitespace with only an implicit\n   position defined by surrounding characters.\n\n * When extracting text, we remove all hyphens that immediately precede a line\n   break and join the adjacent words. This usually produces the best results\n   with LaTeX multi-column documents (e.g. \"soft-`\\n`ware\" becomes \"software\"),\n   but sometimes the hyphen needs to stay (e.g. \"memory-`\\n`mapped\", which will be\n   extracted as \"memorymapped\"), and we can't tell the difference. To disable\n   this behaviour, pass `--keep-hyphens`.\n\n\n### FAQ\n\n 1. I'd like to change how the output is formatted.\n\n    Some minor tweaks (e.g.: word wrap, skipping or reordering output sections)\n    can be accomplished via command-line arguments.\n\n    All of the output comes from the relevant `Printer` subclass; more elaborate\n    changes can be accomplished there. Pull requests to introduce new output\n    formats or variants as printers are welcomed.\n\n 2. I think I got a review generated by this tool...\n\n    I hope that it was a constructive review, and that the annotations\n    helped the reviewer give you more detailed feedback so you can improve\n    your paper. This is, after all, just a tool, and it should not be an\n    excuse for reviewer sloppiness.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F0xabu%2Fpdfannots","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F0xabu%2Fpdfannots","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F0xabu%2Fpdfannots/lists"}