https://github.com/datawraith/arxiv-frontpage
My personal ArXiv frontpage
https://github.com/datawraith/arxiv-frontpage
Last synced: 8 months ago
JSON representation
My personal ArXiv frontpage
- Host: GitHub
- URL: https://github.com/datawraith/arxiv-frontpage
- Owner: DataWraith
- License: mit
- Created: 2025-04-13T08:41:53.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-05T05:32:09.000Z (11 months ago)
- Last Synced: 2025-06-05T07:32:17.995Z (11 months ago)
- Language: HTML
- Size: 24.8 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# arxiv-frontpage
A tool that creates a personalized frontpage of arXiv computer science papers ranked by your interests.
## Demo
My frontpage for today can be viewed here:
## What does this do?
Inspired by , this project fetches new computer science papers from [arXiv](https://arxiv.org) and uses a classifier to infer tags from the paper metadata. Tags are displayed below each paper abstract once the classifier's confidence reaches a threshold.
Each tag is associated with an "interestingness" multiplier, and the final frontpage ranks papers by multiplying the confidence that a given tag is present with its interestingness modifier. The resulting score is then summed over all tags, giving you a personalized ranking of fresh papers.
The GitHub Actions automatically pull new data and regenerate the site once on every weekday -- if you fork the repo, you may need to change the repository settings to allow Actions to commit and push changes.
## How does it work?
1. **Tag Configuration**: Tags are defined in `data/tags.json` and mapped to their interestingness multiplier.
2. **Training Data**: Each tag must have an associated `.jsonl` file in the `data/train` directory.
3. **Paper Collection**: The system fetches recent papers from arXiv's CS categories via RSS feed.
4. **Classification**: A Probabilistic Label Tree classifier (via [napkinXC](https://napkinxc.readthedocs.io)) determines the relevance of each tag for each paper.
5. **Ranking**: Papers are scored and the frontpage is generated.
The generated frontpage includes a copy button that displays the JSON data you need to put into the training files to improve future classifications.
You can also run the project locally using [uv](https://github.com/astral-sh/uv) -- see the `Justfile` for the available commands.