{"id":26304362,"url":"https://github.com/neuralwork/arxiver","last_synced_at":"2025-10-25T22:50:11.919Z","repository":{"id":260698707,"uuid":"882084609","full_name":"neuralwork/arxiver","owner":"neuralwork","description":"Codebase for the arxiver dataset","archived":false,"fork":false,"pushed_at":"2024-11-29T16:39:59.000Z","size":31,"stargazers_count":14,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-05-12T18:54:50.327Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/neuralwork.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-11-01T21:11:51.000Z","updated_at":"2025-03-14T10:25:42.000Z","dependencies_parsed_at":"2024-11-01T23:18:18.510Z","dependency_job_id":"637d822e-d53d-4457-bb80-48f04b04882f","html_url":"https://github.com/neuralwork/arxiver","commit_stats":null,"previous_names":["neuralwork/arxiver"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/neuralwork/arxiver","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neuralwork%2Farxiver","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neuralwork%2Farxiver/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neuralwork%2Farxiver/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neuralwork%2Farxiver/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/neuralwork","download_url":"https://codeload.github.com/neuralwork/arxiver/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neuralwork%2Farxiver/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":281032293,"owners_count":26432755,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-25T02:00:06.499Z","response_time":81,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-15T08:16:05.327Z","updated_at":"2025-10-25T22:50:11.878Z","avatar_url":"https://github.com/neuralwork.png","language":"Python","funding_links":["https://ko-fi.com/Z8Z616R4PF'"],"categories":[],"sub_categories":[],"readme":"# Arxiver\n\nA toolkit for downloading and converting arXiv papers to multi markdown (.mmd) format with Nougat - a neural OCR. Our pipeline can extract LaTeX equations and includes post-processing tools to clean up and merge extracted data. See the [arxiver](https://huggingface.co/datasets/neuralwork/arxiver) dataset on Hugging Face Hub for sample results.\n\n## Project Structure\n```\narxiver/\n    arxiv-tools/          # Tools for downloading arXiv papers\n    utils/                # Utility files to check processed data, get article metadata, etc.\n    run_nougat.py         # Batch PDF processing script to extract text in .mmd format\n    job_status_server.py  # Web server to monitor extraction progress\n    postprocess.py        # Post-processing scripting to clean and merge Nougat outputs\n```\n\n## Downloading arXiv\n\nThe `arxiv-tools` folder contains scripts for downloading arXiv papers and computing useful statistics about the arXiv dataset. For detailed instructions, see the [arxiv-tools README](arxiv-tools/README.md). Downloading and extracting the dataset creates a hierarchical folder structure organized by publication year and month as follows:\n\n```\noutput_dir/\n    2310/           # October 2023\n        paper1.pdf\n        paper2.pdf\n    2311/           # November 2023\n        paper3.pdf\n        paper4.pdf\n```\n\n## Nougat Processing\n\nThe `run_nougat.py` script processes PDF files in batches using the [Nougat](https://arxiv.org/abs/2308.13418) neural OCR model:\n\n```bash\npython run_nougat.py \\\n    --input_dir /path/to/datadir \\\n    --output_dir /path/to/output \\\n    --gpu_id 0 \\\n    --batch_size 8\n```\n\nYou can run Nougat using the output data directory as an input argument. Running this script processes pdfs by batches on specified GPU and logs successful and failed jobs (Nougat is not 100% stable). Output structure maintains the same year-month-based subdirectory structure but saves each page separately:\n```\noutput_dir/\n    2310/\n        paper1_1.mmd    # Paper 1, page 1\n        paper1_2.mmd    # Paper 1, page 2\n        paper2_1.mmd\n    2311/\n        paper3_1.mmd\n        paper3_2.mmd\n        paper4_1.mmd\n```\n\n#### Progress Monitoring\nWe provide an optinoal script, `job_status_server.py` to provide a web interface to monitor processing progress:\n\n```bash\npython job_status_server.py \\\n    --input_dir /path/to/pdf/files \\\n    --output_dir /path/to/output \\\n    --port 8005\n```\n\n\n## Post-Processing\nThe post-processing pipeline includes several steps to validate and clean up the Nougat output. You can optionally check how many of the papers have been fully processed (all pages successfully extracted) by running:\n```bash\ncd utils\npython check_complete_results.py --pdf-dir /path/to/pdf/root/dir --mmd-dir /path/to/mmd/root/dir\n```\n\nYou can use the output .mmd files as they are or run post-processing to remove headers and references and merge multiple page MMD files into single documents operations. To do this, run the post-processing script:\n```bash\ncd ..\npython postprocess.py --input-dir /path/to/processed-data --output-dir /path/to/output\n```\n\nNote that this script preserves the original hierarchical folder structure organized by publication year and month.\n\n#### Metadata Extraction\nYou can optionally get article metadata by running:\n```bash\ncd utils\npython extract_metadata.py --input-dir /path/to/merged-mmd-folder\n```\n\n## Notes\n- GPU with CUDA support is required for efficient processing\n- Tested on an NVIDIA T4 GPU, processing speed depends on GPU memory and batch size\n- arxiv-tools/ is adapted from the original [repo](https://github.com/armancohan/arxiv-tools)\n\n\u003ca href='https://ko-fi.com/Z8Z616R4PF' target='_blank'\u003e\u003cimg height='36' style='border:0px;height:36px;' src='https://storage.ko-fi.com/cdn/kofi6.png?v=6' border='0' alt='Buy Me a Coffee at ko-fi.com' /\u003e\u003c/a\u003e\n\nFrom [neuralwork](https://neuralwork.ai/) with :heart:\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneuralwork%2Farxiver","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fneuralwork%2Farxiver","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneuralwork%2Farxiver/lists"}