Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mila-iqia/paperoni
Search for scientific papers on the command line
https://github.com/mila-iqia/paperoni
Last synced: 7 days ago
JSON representation
Search for scientific papers on the command line
- Host: GitHub
- URL: https://github.com/mila-iqia/paperoni
- Owner: mila-iqia
- License: mit
- Created: 2020-08-19T00:05:13.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2024-09-05T13:24:57.000Z (2 months ago)
- Last Synced: 2024-10-03T06:38:08.168Z (about 1 month ago)
- Language: Python
- Size: 4.67 MB
- Stars: 99
- Watchers: 7
- Forks: 16
- Open Issues: 9
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Paperoni
Paperoni is Mila's tool to collect publications from our researchers and generate HTML or reports from them.
## Install
First clone the repo, then:
```bash
pip install -e .
```## Configuration
Create a YAML configuration file named `config.yaml` in the directory where you want to put the data with the following content:
```yaml
paperoni:
paths:
database: papers.db
history: history
cache: cache
requests_cache: requests-cache
permanent_requests_cache: permanent-requests-cache
institution_patterns:
- pattern: ".*\\buniversit(y|é)\\b.*"
category: "academia"
```All paths are relative to the configuration file. Insitution patterns are regular expressions used to recognize affiliations when parsing PDFs (along with other heuristics).
Make sure to set the `$GIFNOC_FILE` environment variable to the path to that file.
## Start the web app
To start the web app on port 8888, execute the following command:
```bash
grizzlaxy -m paperoni.webapp --port 8888
```You can also add this section to the configuration file (same file as the paperoni config):
```yaml
grizzlaxy:
module: paperoni.webapp
port: 8888
```And then you would just need to run `grizzlaxy` or `grizzlaxy --config config-file.yaml`.
Once papers are in the system, the app can be used to validate them or perform searches. There are some steps to follow in order to populate the database:
## Add researchers
* Go to [http://127.0.0.1:8888/author-institution](http://127.0.0.1:8888/author-institution)
* Enter a researcher's name, role at the institution, as well as a start date. The end date can be left blank, and then click `Add/Edit`
* You can edit a row by clicking on it, changing e.g. the end date and clicking `Add/Edit`
* Then, add IDs on Semantic Scholar: click on the number in the `Semantic Scholar IDs` column, which will open a new window.
* This will query Semantic Scholar with the researcher's name. Each box represents a different Semantic Scholar ID. Select:
* `Yes` if the listed papers are indeed from the researcher. This ID will be scraped for this researcher.
* `No` if the listed papers are not from the researcher. This ID will not be scraped.Ignore OpenReview IDs for the time being, they might not work properly at the moment.
## Scrape
The scraping currently needs to be done on the command line.
```bash
# Scrape from semantic_scholar
paperoni acquire semantic_scholar# Get more information for the scraped papers
# E.g. download from arxiv and analyze author list to find affiliations
# It can be wise to use --limit to avoid hitting rate limits
paperoni acquire refine --limit 500# Merge entries for the same paper; paperoni acquire does not do it automatically
paperoni merge paper_link# Merge entries based on paper name
paperoni merge paper_name
```Other merging functions are `author_link` and `author_name` for authors (not papers) and `venue_link` for venues.
## Validate
Go to [http://127.0.0.1:8888/validation](http://127.0.0.1:8888/validation) to validate papers. Basically, you click "Yes" if the paper should be in the collection and "No" if it should not be according to your criteria (because it comes from a homonym of the researcher, is in the wrong field, is just not a paper, etc. -- it depends on your use case.)