{"id":20165755,"url":"https://github.com/allofphysicsgraph/latex-in-arxiv","last_synced_at":"2025-04-10T01:03:38.043Z","repository":{"id":37267792,"uuid":"267309311","full_name":"allofphysicsgraph/latex-in-arxiv","owner":"allofphysicsgraph","description":"extract math latex from content in arxiv","archived":false,"fork":false,"pushed_at":"2025-04-03T20:14:25.000Z","size":388712,"stargazers_count":4,"open_issues_count":25,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-03T20:33:51.762Z","etag":null,"topics":["latex"],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/allofphysicsgraph.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-05-27T12:09:22.000Z","updated_at":"2025-04-03T20:14:28.000Z","dependencies_parsed_at":"2025-03-24T01:19:43.001Z","dependency_job_id":"356841ea-8b40-42a2-9514-9ba0593e7ace","html_url":"https://github.com/allofphysicsgraph/latex-in-arxiv","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/allofphysicsgraph%2Flatex-in-arxiv","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/allofphysicsgraph%2Flatex-in-arxiv/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/allofphysicsgraph%2Flatex-in-arxiv/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/allofphysicsgraph%2Flatex-in-arxiv/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/allofphysicsgraph","download_url":"https://codeload.github.com/allofphysicsgraph/latex-in-arxiv/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248137894,"owners_count":21053775,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["latex"],"created_at":"2024-11-14T00:39:00.943Z","updated_at":"2025-04-10T01:03:38.037Z","avatar_url":"https://github.com/allofphysicsgraph.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Overview\n_Goal_: extract math Latex from `.tex` content available from arXiv. \n\n_Caveat when cloning this repo_: Total download size is 640 MB. \n\n## quick start\n\nRead `latex-in-arxiv/src/postings_list/query/README.md`\n\nEverything is containerized, so in this repo (`latex-in-arxiv/`) use\neither `make docker` (for linux) or `make docmac` (for Mac). \n\nTo run the application, within the Docker image run `/opt/scanner.out .`\n\nTo recompile the scanner, within the Docker image run\n```bash\ncd latex-in-arxiv/src/postings_list/query\nmake scanner \nmake read_tf_idf   \n./scanner.out .   \n./read_tf_idf.out tf_idf    # the vocabulary for TF-IDF uses the tokens from parsed Latex\n                            # TF-IDF is for identify the most relevant variable to find the definition for in a paper\n```\n\n## so what?\n\nSuppose you have a `.tex` file that contains math, like\n```latex\n\\documentclass{article}\n\\title{test}\n\\begin{document}\n\\maketitle\n\\section{Introduction}\nThis is a great paper.\n\\begin{equation}\n    a+b = c\n\\end{equation}\nWhere $c$ is some variable.\n\\end{document}\n```\nThere's an expression, `a+b=c` and an in-line variable `c`. \nHow can the expression and the variables be extracted? \n\nThere are a few options for parsing Latex; see \u003chttps://github.com/allofphysicsgraph/latex-in-arxiv/issues/14\u003e\nThe options that are decent in terms of quality of results are also slow.\n\nThis repo uses [`ragel`](https://www.colm.net/open-source/ragel/) to quickly parse Latex and find math. \n\n## get data\n\n### an option that's free is a few years of arxiv data\n\u003chttps://www.cs.cornell.edu/projects/kddcup/datasets.html\u003e\n\nIn the directory `latex-in-arxiv/get_sample_data` use\n```bash\nmake get_sample_data\n```\n### ArXiV API calls\n```\n# curl http://export.arxiv.org/api/query?search_query=all:rigorous%20derivation  \n```\n\n### bulk processing: another option is the full arxiv data available from an S3 bucket\nfor details, see \u003chttps://arxiv.org/help/bulk_data_s3\u003e\n```bash\n# s3cmd get s3://arxiv/src/arXiv_src_manifest.xml . --requester-pays  \n# s3cmd get s3://arxiv/src/arXiv_src_9912_001.tar . --requester-pays  \n```\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fallofphysicsgraph%2Flatex-in-arxiv","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fallofphysicsgraph%2Flatex-in-arxiv","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fallofphysicsgraph%2Flatex-in-arxiv/lists"}