{"id":13644020,"url":"https://github.com/jstrieb/paperify","last_synced_at":"2025-04-06T22:08:37.009Z","repository":{"id":192720913,"uuid":"687268920","full_name":"jstrieb/paperify","owner":"jstrieb","description":"Transform any document, web page, or eBook into a research paper (ChatGPT not required)","archived":false,"fork":false,"pushed_at":"2023-09-06T00:12:45.000Z","size":5494,"stargazers_count":374,"open_issues_count":0,"forks_count":17,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-03-30T20:11:24.937Z","etag":null,"topics":["latex","pandoc","posix-sh","research-paper","shell","shell-script"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jstrieb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-09-05T02:26:00.000Z","updated_at":"2025-03-29T19:00:11.000Z","dependencies_parsed_at":null,"dependency_job_id":"3ab52fd1-0fa5-4a80-a16b-4d6509c82ccf","html_url":"https://github.com/jstrieb/paperify","commit_stats":null,"previous_names":["jstrieb/paperify"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jstrieb%2Fpaperify","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jstrieb%2Fpaperify/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jstrieb%2Fpaperify/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jstrieb%2Fpaperify/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jstrieb","download_url":"https://codeload.github.com/jstrieb/paperify/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247557767,"owners_count":20958047,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["latex","pandoc","posix-sh","research-paper","shell","shell-script"],"created_at":"2024-08-02T01:01:56.562Z","updated_at":"2025-04-06T22:08:36.975Z","avatar_url":"https://github.com/jstrieb.png","language":"Shell","funding_links":[],"categories":["CLIs","Shell"],"sub_categories":[],"readme":"# Paperify\n\nPaperify transforms any document, web page, or ebook into a research paper.\n\nThe text of the generated paper is the same as the text of the original\ndocument, but figures and equations from real papers are interspersed\nthroughout. \n\nA paper title and abstract are added (optionally generated by ChatGPT, if you\nprovide an API key), and the entire paper is compiled with the IEEE $\\LaTeX$\ntemplate for added realism.\n\n\u003cdiv align=\"center\"\u003e\n\n![example](https://github.com/jstrieb/paperify/assets/7355528/6233c47e-fbff-4a71-8991-09ba3112f241)\n\n\u003c/div\u003e\n\n\n# Install\n\nFirst, install the dependencies (or [use Docker](#docker)):\n\n- curl\n- Python 3\n- Pandoc\n- jq\n- LaTeX (via TeXLive)\n- ImageMagick (optional)\n\nFor example, on Debian-based systems (_e.g._, Debian, Ubuntu, Kali, WSL):\n\n``` bash\nsudo apt update\nsudo apt install --no-install-recommends \\\n  pandoc \\\n  curl ca-certificates \\\n  jq \\\n  python3 \\\n  imagemagick \\\n  texlive texlive-publishers texlive-science lmodern texlive-latex-extra\n```\n\nThen, clone the repo (or directly pull the script), and execute it.\n\n``` bash\ncurl -L https://github.com/jstrieb/paperify/raw/master/paperify.sh \\\n  | sudo tee /usr/local/bin/paperify\nsudo chmod +x /usr/local/bin/paperify\n\npaperify -h\n```\n\n\n# Examples\n\n- [`examples/cox.pdf`](examples/cox.pdf)\n\n  Convert [Russ Cox's transcript of Doug McIlroy's talk on the history of Bell\n  Labs](https://research.swtch.com/bell-labs) into a paper saved to the `/tmp/`\n  directory as `article.pdf`. \n\n  ```\n  paperify \\\n    --from-format html \\\n    \"https://research.swtch.com/bell-labs\" \\\n    /tmp/article.pdf\n  ```\n\n- [`examples/london.pdf`](examples/london.pdf)\n  \n  Download figures and equations from the 1000 latest computer science papers\n  on `arXiv.org`. Intersperse the figures and equations into Jack London's\n  _Call of the Wild_ with a higher-than-default equation frequency. Use ChatGPT\n  to generate a paper title, author, abstract, and metadata for an imaginary\n  paper on soft body robotics. Save the file in the current directory as\n  `london.pdf`.\n\n  ```\n  paperify \\\n    --arxiv-category cs \\\n    --num-papers 1000 \\\n    --equation-frequency 18 \\\n    --chatgpt-token \"sk-[REDACTED]\" \\\n    --chatgpt-topic \"soft body robotics\" \\\n    \"https://standardebooks.org/ebooks/jack-london/the-call-of-the-wild/downloads/jack-london_the-call-of-the-wild.epub\" \\\n    london.pdf\n  ```\n\n## Docker\n\nAlternatively, run Paperify from within a Docker container. To run the first\nexample from within Docker and build to `./build/cox.pdf`:\n\n``` bash\ndocker run \\\n  --rm \\\n  -it \\\n  --volume \"$(pwd)/build\":/root/build \\\n  jstrieb/paperify \\\n    --from-format html \\\n    \"https://research.swtch.com/bell-labs\" \\\n    build/cox.pdf\n```\n\n\n# Usage\n\n```\nusage: paperify [OPTIONS] \u003cURL or path\u003e \u003coutput file\u003e\n\nOPTIONS:\n  --temp-dir \u003cDIR\u003e            Directory for assets (default: /tmp/paperify)\n  --from-format \u003cFORMAT\u003e      Format of input file (default: input suffix)\n  --arxiv-category \u003cCAT\u003e      arXiv.org paper category (default: math)\n  --num-papers \u003cNUM\u003e          Number of papers to download (default: 100)\n  --max-parallelism \u003cPROCS\u003e   Maximum simultaneous processes (default: 32)\n  --figure-frequency \u003cN\u003e      Chance of a figure is 1/N per paragraph (default: 25)\n  --equation-frequency \u003cN\u003e    Chance of an equation is 1/N per paragraph (default: 25)\n  --max-size \u003cBYTES\u003e          Max allowed image size in bytes (default 2500000)\n  --min-equation-length \u003cN\u003e   Minimum equation length in characters (default 5)\n  --max-equation-length \u003cN\u003e   Maximum equation length in characters (default 120)\n  --min-caption-length \u003cN\u003e    Minimum figure caption length in characters (default 20)\n  --chatgpt-token \u003cTOKEN\u003e     ChatGPT token to generate paper title, abstract, etc.\n  --chatgpt-topic \u003cTOPIC\u003e     Paper topic ChatGPT will generate metadta for\n  --quiet                     Don't log statuses\n  --skip-downloading          Don't download papers from arXiv.org\n  --skip-extracting           Don't extract equations and captions\n  --skip-metadata             Don't regenerate metadata\n  --skip-filtering            Don't filter out large files or non-diagram images\n```\n\nNote that the `--skip-*` flags are useful when you have already run the script\nonce and do not want to repeat the process of downloading and extracting data.\n\n\n# Known Issues\n\n- Images with query parameters in the `src` URL of some web pages are extracted\n  by Pandoc with the query parameters in the filename, and LaTeX gives errors\n  about \"unknown file extension\" when compiling.\n- Papers may contain images that are not diagrams, such as portraits of the\n  authors or institution logos. Paperify uses a highly imperfect heuristic to\n  remove these if the `convert` command line tool is present: only images with\n  white, nearly-white, or transparent pixels in the top left and bottom right\n  corners are kept. This works surprisingly well, but there are always some\n  false positives and false negatives.\n- Non-ASCII Unicode characters cannot be processed by `pdflatex`, and will be\n  stripped before the PDF is compiled.\n- Paperify uses Markdown as a (purposefully) lossy [intermediate\n  representation](https://en.wikipedia.org/wiki/Intermediate_representation)\n  for documents before they are converted to LaTeX. As a result, information\n  and styling from the original may be stripped.\n- A handful of papers contain huge numbers of images. The ones that do this\n  also tend to have some of the worst images. Images can be manually pruned\n  from the `/tmp/paperify/images` directory, and the same command can be re-run\n  with the `--skip-*` flags to rebuild the paper using new figures and\n  equations.\n- Different systems install different LaTeX packages. If you're missing\n  packages, you may want to bite the bullet and `apt install texlive-full`.\n  It's very big, but it's got everything you'll ever need in there.\n- Figure captions usually have nothing to do with figures themselves.\n- No matter how convincing a paper may appear, anyone looking over your\n  shoulder who actually reads the words will know very quickly that something\n  is off.\n- Side effects of reading the code include nausea, dizziness, confusion,\n  bleeding from the eyes, and deep love/hatred for the creators of Unix\n  pipelines.\n\n\n# How to Read the Code\n\nIn general, I'm a proponent of reading (or at least skimming) code before you\nrun it, when possible. Usually, my code is written to be read. In this case,\nnot so much.\n\nApologies in advance to anyone who tries to read the code. It started as four\nvery cursed lines of Bash (without line wrapping) that I attempted to clean up\na little. It is now many more than four lines of Bash, most of which remain\nvery cursed. The small Python portion is particularly hard on the eyes, though\nit may possess a grotesque beauty for true functional programmers.\n\nEverything is in `paperify.sh`. It can be read top-to-bottom or bottom-to-top,\nand there is a fat LaTeX template as a heredoc smack in the middle.\n\n\n# Project Status\n\nStrange as it may sound, this project is complete. I want to live in a world\nwhere working software doesn't always grow until it becomes a Lovecraftian\nspaghetti monster. \n\nI have added every feature that I wanted to add. It does what I wanted it to\ndo, as well as I wanted it to do it. No further development required. \n\nAs such, I will try to address issues opened on GitHub, but I do not expect to\naddress feature requests. I may merge pull requests.\n\nEven if there are no recent commits, I'm hopeful that this script will continue\nto work many years from now.\n\n\n# Greetz \u0026 Acknowledgments\n\nGreetz to several unnamed friends who offered helpful commentary prior to\nrelease. \n\nSpecial shout out to the friends who suggested, as a follow-up project, making\na browser extension to transform the current web page into a scientific paper.\nSort of like Firefox reader mode, but for viewing Twitter when someone looking\nover your shoulder expects you to be doing something else.\n\nThanks to [arXiv.org](https://arxiv.org) for hosting tons of papers with LaTeX\nsource to mine. \n\nGreetz to Project Gutenberg, Standard Ebooks, and Alexandra Elbakyan.\n\nLovingly released on Labor Day 2023; dedicated to procrastinating laborers of\nknowledge.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjstrieb%2Fpaperify","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjstrieb%2Fpaperify","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjstrieb%2Fpaperify/lists"}