https://github.com/moisutsu/realistic-citation-count-prediction

Official implementation: Realistic Citation Count Prediction Task for Newly Published Papers
https://github.com/moisutsu/realistic-citation-count-prediction

Last synced: about 1 year ago
JSON representation

Official implementation: Realistic Citation Count Prediction Task for Newly Published Papers

Host: GitHub
URL: https://github.com/moisutsu/realistic-citation-count-prediction
Owner: moisutsu
Created: 2025-03-23T13:31:37.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-03-23T13:39:55.000Z (about 1 year ago)
Last Synced: 2025-03-23T14:32:47.298Z (about 1 year ago)
Language: Python
Homepage: https://aclanthology.org/2023.findings-eacl.84/
Size: 23.4 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Realistic Citation Count Prediction Task for Newly Published Papers

This repository is the official implementation of our paper [Realistic Citation Count Prediction Task for Newly Published Papers](https://aclanthology.org/2023.findings-eacl.84/).

## Dataset Construction

### (1) Collect Paper IDs for Target Papers

Following the reference provided at [Semantic Scholar API documentation for Paper Data](https://api.semanticscholar.org/api-docs/graph#tag/Paper-Data/operation/get_graph_get_paper), collect the paper IDs of the target papers that are supported by Semantic Scholar.

### (2) Collect Paper IDs for Papers from Semantic Scholar to Retrieve

In order to calculate the citation counts for each month after a paper's publication, in addition to the target papers collected in (1), also collect the paper IDs of the papers that have cited the target papers from Semantic Scholar.

- **Input format:** A file containing only one paper ID per line
- **Output format:** A file containing only one paper ID per line

```bash
python scripts/make_dataset/fetch_papers_to_calculate_citation_counts.py \\
--ids_path \\
--output_s2_ids_to_calculate_citation_count \\
--output_input_ids_to_s2_ids_path \\
--prefix
```

Refer to [Semantic Scholar API documentation for Paper Data](https://api.semanticscholar.org/api-docs/graph#tag/Paper-Data/operation/get_graph_get_paper) for details regarding the `--prefix`.

### (3) Retrieve Detailed Paper Information from Semantic Scholar

Using the paper IDs collected in (2), retrieve detailed information such as the title and abstract from Semantic Scholar.

```bash
python scripts/make_dataset/fetch_paper_details_from_ids.py \\
--ids_path \\
--output_path \\
--prefix
```

### (4) Store the Retrieved Paper Information in a Database

Store the paper information retrieved in (3) into a database.

```bash
python scripts/make_dataset/store_fetched_paper_to_database.py \\
--input_path \\
--database_path
```

### (5) Generate the Dataset from the Database

Generate the dataset from the database created in (4).

```bash
YEAR_RANGE=5 # Number of years before the test paper publication to be used for training
TEST_YEAR=2021 # Publication year of the test papers
TEST_MONTH=4 # Publication month of the test papers
N_YEARS_AFTER=1 # Use the citation count N years after publication

OUTPUT_DIR="datasets/ccp/biorxiv/${YEAR_RANGE}_years/use_current_citation/test_${TEST_YEAR}-${TEST_MONTH}/${N_YEARS_AFTER}_year_later_citation_complemented" # Output directory

python scripts/make_dataset/create_dataset_from_db.py \
--paper_ids_path datasets/paper_ids/biorxiv/2014_1_17-2022_4_30-doi-plant.txt \ # Input file path for paper IDs collected in (1)
--convert_to_s2_id_path datasets/paper_ids/convert/biorxiv_2014_1_17-2022_4_30-doi-plant_to_s2_ids.json \ # Input file path for converting conference IDs to S2 IDs as output in (2)
--database_path /local2/hirako/s2.db \ # Input file path for the database created in (4)
--output_dir "$OUTPUT_DIR" \ # Output directory for the dataset
--oldest_date_for_train "$OLDEST_TRAIN_YEAR" "$TEST_MONTH" \ # Publication year and month of the oldest training paper
--test_date "$TEST_YEAR" "$TEST_MONTH" \ # Publication year and month of the test papers
--n_years_after "$N_YEARS_AFTER" \ # Use the citation count N years after publication
--mode_for_citation_counts_within_n_years_after_publication complement # Mode for utilizing recent papers during training
```

The following options can be specified for `mode_for_citation_counts_within_n_years_after_publication`:

- **use_full:** Utilize all citation counts, including future citation counts that would normally be unavailable.
- **not_use:** Do not use the most recent papers for training at all.
- **complement:** Complement the citation counts of recent papers and use them.
- **no_complement:** Use the citation counts of recent papers without complementing (i.e., use the citation counts as of the test paper’s publication date).

A reference shell script, `scripts/make_dataset/create_dataset_from_db.sh`, is provided for creating the dataset from the database.

*Note:* The contents of `valid.jsonl` and `test.jsonl` generated by this program are identical. Due to experimental constraints, a development dataset cannot normally be created for each dataset; however, to simplify the implementation of model training and evaluation, a pseudo-development set is generated. Therefore, no tuning should be performed on this development set.

## Model Training & Evaluation

### Running the Program

By executing `main.py`, the model will be trained and evaluated using the generated dataset.

Below is an example of how to run the program when using BERT to predict a paper's citation count from its title and abstract.

```bash
python main.py \\
experiment_count=3 \\ # Number of experiments to run with different random seeds
batch_size=32 \\ # Batch size
experiment_name="$experiment_name" \\ # Experiment name on MLflow
run_name="$run_name" \\ # Run name on MLflow
gpus="$gpus" \\ # GPU indices to use
bert_model="$bert_model" \\ # Model name from HuggingFace
dataset_name="$dataset_name" # Directory name of the dataset to be used for training and evaluation
```

For `dataset_name`, for example, if you have `train.jsonl`, `valid.jsonl`, and `test.jsonl` in a directory called `dataset/samples`, you should specify `samples` (i.e., the directory name excluding `dataset/`).

For other configurable hyperparameters, please refer to the `configs` directory.

### Checking Experiment Results

View the experiment results using MLflow:

```bash
mlflow ui
```

## Citation

```bibtex
@inproceedings{hirako-etal-2023-realistic,
title = "Realistic Citation Count Prediction Task for Newly Published Papers",
author = "Hirako, Jun and
Sasano, Ryohei and
Takeda, Koichi",
booktitle = "Findings of the Association for Computational Linguistics: EACL 2023",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-eacl.84/",
doi = "10.18653/v1/2023.findings-eacl.84",
pages = "1131--1141",
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/moisutsu/realistic-citation-count-prediction

Awesome Lists containing this project

README