{"id":17030718,"url":"https://github.com/vsoch/arxiv-equations","last_synced_at":"2026-05-04T14:35:13.711Z","repository":{"id":141667621,"uuid":"157988494","full_name":"vsoch/arxiv-equations","owner":"vsoch","description":"looking for patterns in equation use in arxiv papers","archived":false,"fork":false,"pushed_at":"2018-11-26T05:20:22.000Z","size":151244,"stargazers_count":2,"open_issues_count":1,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-22T20:31:30.196Z","etag":null,"topics":["arxiv","doc2vec","equations","word2vec"],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vsoch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-11-17T13:44:43.000Z","updated_at":"2019-06-21T03:14:06.000Z","dependencies_parsed_at":null,"dependency_job_id":"f8420632-79d2-42ff-a74d-3f6b6923dfd6","html_url":"https://github.com/vsoch/arxiv-equations","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/vsoch/arxiv-equations","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vsoch%2Farxiv-equations","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vsoch%2Farxiv-equations/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vsoch%2Farxiv-equations/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vsoch%2Farxiv-equations/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vsoch","download_url":"https://codeload.github.com/vsoch/arxiv-equations/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vsoch%2Farxiv-equations/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32611982,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-04T10:08:07.713Z","status":"ssl_error","status_checked_at":"2026-05-04T10:08:02.005Z","response_time":58,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arxiv","doc2vec","equations","word2vec"],"created_at":"2024-10-14T08:08:01.005Z","updated_at":"2026-05-04T14:35:13.684Z","avatar_url":"https://github.com/vsoch.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Equation Analysis\n\nThis is development of simple analysis to parse a set of papers from [arxiv](https://arxiv.org/help/bulk_data).\nOur goals are the following:\n\n 1. to classify equations into groups based on domains of knowledge (math or methods). We will do this by using equations from wikipedia methods pages as a gold standard, and then word2vec (or similar) to represent an equation as a vector.\n 2. to classify papers into groups based on the equations, the idea being that a paper mapped to a domain of knowledge can help us to understand:\n    - the domains that are assocated with different kinds of math and methods\n    - gaps / potential for working on a method for a domain that hasn't been tried yet\n    - understanding of what kinds of math are used (and to what degree) across domains, to drive development of Penrose\n 3. To develop a visualization (catalog) that can nicely portray how groups of papers map to domains of math / methods, by way of the equations they use (this will be the final result in the [arxiv-catalog](https://www.github.com/vsoch/arxiv-catalog) repository.\n \n\n## Step 1. Testing Extraction\nWe will use one *.tar, a collection of papers from a particular month and\nyear, extracted to a local folder `0801` (not included in the repository) to mean \nJanuary of 2008. The arxiv files were obtained in bulk and procesed both locally \nand on the Sherlock cluster. The folder was generated from the tar as follows:\n\n```bash\ntar -xvf 0801.tar\n```\n\nMore information about naming is [here](https://arxiv.org/help/bulk_data_s3). \nFor example, the description of the folder name:\n\n```bash\nTwo digit year and month of items in the tar package. Starts with 9108 for 1991-08, rolls past y2k to 0001 for 2000-01, 1008 for 2010-08 etc.\n```\n\nWithin the single folder, we have over 4000 files!\n\n```bash\n$ ls 0801 | wc -l\n4516\n```\n\nTo test parsing and extraction of a .tar.gz within, please reference the script\n[testExtract.py](testExtract.py). This first section extracts a single tex file,\nmeaning equations, tex, and metadata, and the second sections loops over the \nlogic to do the remaining. I did the loop extraction for one .tar.gz and it produced\n9009 .tar.gz within, meaning 9009 papers (each with a LaTeX file.\n\n## Step 2. Metadata and Text Extraction\n\nOnce the extraction method was reasonable (meaning that while I didn't get all\nequations with a regular expression, I did get most equations), I wanted to\nrun the analysis on the [Sherlock cluster](https://www.sherlock.stanford.edu/) at Stanford, and I used the scripts\n[clusterExtract.py](clusterExtract.py) and [run_clusterExtract.py](run_clusterExtract.py) \nto do this in parallel for all the .tar.gz. The files were uploaded to the cluster\nwith scp, and then extracted as follows:\n\n```bash\nfor tarfile in $(ls *.tar)\n    do\n       if [ ! -d \"${tarfile%.tar}\" ]; then\n           tar -xf $tarfile\n           echo \"Extracting $tarfile\"\n       fi\ndone\n```\n\nThe run script also runs \n[generatePage.py](generatePage.py) to generate markdown to populate the \n[arxiv catalog](https://vsoch.github.io/arxiv-catalog/). You can see\nexamples of the metadata extracted by looking at any of the markdown files\nin the [posts](https://github.com/vsoch/arxiv-catalog/tree/master/_posts) folder there.\nThe goal of the \"catalog\" is to eventually (visually) present summary metrics for\n each category described by archiv (each manuscript has one or more category \nlabels like `astro-ph`). I haven't decided how I want to do this yet, but will\ndevelop something after I do a first extraction. My thinking is that we\ncan hve simple metrics to describe the articles, for example:\n \n - **journal**: The name of the journal\n - **number of authors**: I don't see a logical reason for this to have any association with equations, but you never know.\n - **length of article**: We would need to normalize the number of equations based on the length.\n\nbut more interesting would be to do a further analysis to classify the equations\nto belong to one or more methods or domains of math. Thus, this is more of a fuzzy clustering. \nThis is the first step toward one of the goals outlined above.\n\n### What else are we interested in?\nWhen the above is done, we would be interested in:\n\n - a total list of categories, and description of metrics by category\n - a breakdown of equations by category\n - what kind of equations cluster together?\n - what groupings (types) of equations are associated with different topics?\n - how do topics compare with respect to equations used?\n\n### Notes about the data\n\n**Some entries are withdrawals**\n\nAnd this corresponds to no latex. For example,\n\n```\n\u003cTarInfo '0801.0528/p-withdraw' at 0x7f4ada4d8cc8\u003e\n```\nwould return `None`, and should thus be skipped. Another common pattern \nwas to find a txt file with a note about the paper being withdrawn:\n\n```\n'%auto-ignore\\r\\nThis paper has been withdrawn by the author,\\r\\ndue a publication.'\n```\n\n**Some LaTex files end in TEX**\n\nAnd so the function should convert to lowercase before any string checking.\n\n### Step 3: Equation Mapping\n\nI realized that in order to map equations to domains of math and methods, we need\nsome kind of gold standard. Why not use wikipedia? In that wikipedia pages have\nclearly defined LaTeX equations (much easier to parse than raw LaTex because they\nare in image tags) *and* a clear title for the page, it would be fairly easy\nto extract word2vec embeddings for the equations in a page, and then associate\nthem with the topic. I call this step the \"equation mapping\" because that is \nexactly what we are doing - mapping topics to equation vectors that will be useful\nin the next step of the archive analysis. To help with this, see the \n[wikipedia](wikipedia) folder. \n\n\n## Step 4: Topic Extraction\n\nOnce we have vector representations of equations for topics, we can use the\nword2vec model to derive vectors for each equation represented in the arxiv papers.\nThe vectors can then be used as features to calculate similarity of each paper\nto each wikipedia topic. I hope that we will be able to then say that a particular\npaper has some set of methods / domains of math overly represented at a rate\nunlikely due to chance. I haven't thought through the details here, but will\ndo so when I write the code. I can use the following steps on Sherlock to\nget an interactive node and load python modules for doing development \n(after cloning the repository into the present working directory).\n\n```bash\n# interactive node\nsrun --time 48:00:00 --mem 32000 --pty bash\n\n# python modules\nml python/3.6.1\nml py-pandas/0.23.0_py36\nml py-ipython/6.1.0_py36\n```\n\nI will use ipython to test and run the [extractMetrics.py](extractMetrics.py)\n(not written yet).\n\n\n## Challenges\n\nOne challenge I ran into was being able to extract **all** the equations from a \nparticular LaTeX document. I was able to derive regular expressions to get most\nof them, but never all of them. I was first very discouraged by this. But I realized\nthat we don't need to perfectly get them all as long as we can get a large sample,\nand use the sample to classify the paper.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvsoch%2Farxiv-equations","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvsoch%2Farxiv-equations","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvsoch%2Farxiv-equations/lists"}