{"id":26880851,"url":"https://github.com/moisutsu/realistic-citation-count-prediction","last_synced_at":"2025-03-31T14:50:35.827Z","repository":{"id":283987882,"uuid":"953483690","full_name":"moisutsu/realistic-citation-count-prediction","owner":"moisutsu","description":"Official implementation: Realistic Citation Count Prediction Task for Newly Published Papers","archived":false,"fork":false,"pushed_at":"2025-03-23T13:39:55.000Z","size":24,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-23T14:32:47.298Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://aclanthology.org/2023.findings-eacl.84/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/moisutsu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-23T13:31:37.000Z","updated_at":"2025-03-23T13:39:58.000Z","dependencies_parsed_at":"2025-03-23T14:43:38.489Z","dependency_job_id":null,"html_url":"https://github.com/moisutsu/realistic-citation-count-prediction","commit_stats":null,"previous_names":["moisutsu/realistic-citation-count-prediction"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moisutsu%2Frealistic-citation-count-prediction","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moisutsu%2Frealistic-citation-count-prediction/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moisutsu%2Frealistic-citation-count-prediction/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moisutsu%2Frealistic-citation-count-prediction/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/moisutsu","download_url":"https://codeload.github.com/moisutsu/realistic-citation-count-prediction/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246485890,"owners_count":20785418,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-31T14:50:35.325Z","updated_at":"2025-03-31T14:50:35.815Z","avatar_url":"https://github.com/moisutsu.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Realistic Citation Count Prediction Task for Newly Published Papers\n\nThis repository is the official implementation of our paper [Realistic Citation Count Prediction Task for Newly Published Papers](https://aclanthology.org/2023.findings-eacl.84/).\n\n## Dataset Construction\n\n### (1) Collect Paper IDs for Target Papers\n\nFollowing the reference provided at [Semantic Scholar API documentation for Paper Data](https://api.semanticscholar.org/api-docs/graph#tag/Paper-Data/operation/get_graph_get_paper), collect the paper IDs of the target papers that are supported by Semantic Scholar.\n\n### (2) Collect Paper IDs for Papers from Semantic Scholar to Retrieve\n\nIn order to calculate the citation counts for each month after a paper's publication, in addition to the target papers collected in (1), also collect the paper IDs of the papers that have cited the target papers from Semantic Scholar.\n\n- **Input format:** A file containing only one paper ID per line\n- **Output format:** A file containing only one paper ID per line\n\n```bash\npython scripts/make_dataset/fetch_papers_to_calculate_citation_counts.py \\\\\n    --ids_path \u003cInput file path containing paper IDs collected in (1)\u003e \\\\\n    --output_s2_ids_to_calculate_citation_count \u003cOutput file path for paper IDs for which to collect paper details\u003e \\\\\n    --output_input_ids_to_s2_ids_path \u003cOutput file path for converting input paper IDs to Semantic Scholar IDs\u003e \\\\\n    --prefix \u003cPrefix corresponding to each conference, default is 'ARXIV:'\u003e\n```\n\nRefer to [Semantic Scholar API documentation for Paper Data](https://api.semanticscholar.org/api-docs/graph#tag/Paper-Data/operation/get_graph_get_paper) for details regarding the `--prefix`.\n\n### (3) Retrieve Detailed Paper Information from Semantic Scholar\n\nUsing the paper IDs collected in (2), retrieve detailed information such as the title and abstract from Semantic Scholar.\n\n```bash\npython scripts/make_dataset/fetch_paper_details_from_ids.py \\\\\n    --ids_path \u003cInput file path of paper IDs output from (2)\u003e \\\\\n    --output_path \u003cOutput file path\u003e \\\\\n    --prefix \u003cThe same conference-specific prefix as used in (2)\u003e\n```\n\n### (4) Store the Retrieved Paper Information in a Database\n\nStore the paper information retrieved in (3) into a database.\n\n```bash\npython scripts/make_dataset/store_fetched_paper_to_database.py \\\\\n    --input_path \u003cInput file path of paper details output from (3)\u003e \\\\\n    --database_path \u003cOutput file path for the database\u003e\n```\n\n### (5) Generate the Dataset from the Database\n\nGenerate the dataset from the database created in (4).\n\n```bash\nYEAR_RANGE=5    # Number of years before the test paper publication to be used for training\nTEST_YEAR=2021  # Publication year of the test papers\nTEST_MONTH=4    # Publication month of the test papers\nN_YEARS_AFTER=1 # Use the citation count N years after publication\n\nOUTPUT_DIR=\"datasets/ccp/biorxiv/${YEAR_RANGE}_years/use_current_citation/test_${TEST_YEAR}-${TEST_MONTH}/${N_YEARS_AFTER}_year_later_citation_complemented\" # Output directory\n\npython scripts/make_dataset/create_dataset_from_db.py \\\n    --paper_ids_path datasets/paper_ids/biorxiv/2014_1_17-2022_4_30-doi-plant.txt \\ # Input file path for paper IDs collected in (1)\n    --convert_to_s2_id_path datasets/paper_ids/convert/biorxiv_2014_1_17-2022_4_30-doi-plant_to_s2_ids.json \\ # Input file path for converting conference IDs to S2 IDs as output in (2)\n    --database_path /local2/hirako/s2.db \\ # Input file path for the database created in (4)\n    --output_dir \"$OUTPUT_DIR\" \\ # Output directory for the dataset\n    --oldest_date_for_train \"$OLDEST_TRAIN_YEAR\" \"$TEST_MONTH\" \\  # Publication year and month of the oldest training paper\n    --test_date \"$TEST_YEAR\" \"$TEST_MONTH\" \\ # Publication year and month of the test papers\n    --n_years_after \"$N_YEARS_AFTER\" \\ # Use the citation count N years after publication\n    --mode_for_citation_counts_within_n_years_after_publication complement # Mode for utilizing recent papers during training\n```\n\nThe following options can be specified for `mode_for_citation_counts_within_n_years_after_publication`:\n\n- **use_full:** Utilize all citation counts, including future citation counts that would normally be unavailable.\n- **not_use:** Do not use the most recent papers for training at all.\n- **complement:** Complement the citation counts of recent papers and use them.\n- **no_complement:** Use the citation counts of recent papers without complementing (i.e., use the citation counts as of the test paper’s publication date).\n\nA reference shell script, `scripts/make_dataset/create_dataset_from_db.sh`, is provided for creating the dataset from the database.\n\n*Note:* The contents of `valid.jsonl` and `test.jsonl` generated by this program are identical. Due to experimental constraints, a development dataset cannot normally be created for each dataset; however, to simplify the implementation of model training and evaluation, a pseudo-development set is generated. Therefore, no tuning should be performed on this development set.\n\n## Model Training \u0026 Evaluation\n\n### Running the Program\n\nBy executing `main.py`, the model will be trained and evaluated using the generated dataset.\n\nBelow is an example of how to run the program when using BERT to predict a paper's citation count from its title and abstract.\n\n```bash\npython main.py \\\\\n    experiment_count=3 \\\\                 # Number of experiments to run with different random seeds\n    batch_size=32 \\\\                      # Batch size\n    experiment_name=\"$experiment_name\" \\\\ # Experiment name on MLflow\n    run_name=\"$run_name\" \\\\               # Run name on MLflow\n    gpus=\"$gpus\" \\\\                       # GPU indices to use\n    bert_model=\"$bert_model\" \\\\           # Model name from HuggingFace\n    dataset_name=\"$dataset_name\"          # Directory name of the dataset to be used for training and evaluation\n```\n\nFor `dataset_name`, for example, if you have `train.jsonl`, `valid.jsonl`, and `test.jsonl` in a directory called `dataset/samples`, you should specify `samples` (i.e., the directory name excluding `dataset/`).\n\nFor other configurable hyperparameters, please refer to the `configs` directory.\n\n### Checking Experiment Results\n\nView the experiment results using MLflow:\n\n```bash\nmlflow ui\n```\n\n## Citation\n\n```bibtex\n@inproceedings{hirako-etal-2023-realistic,\n    title = \"Realistic Citation Count Prediction Task for Newly Published Papers\",\n    author = \"Hirako, Jun  and\n      Sasano, Ryohei  and\n      Takeda, Koichi\",\n    booktitle = \"Findings of the Association for Computational Linguistics: EACL 2023\",\n    month = may,\n    year = \"2023\",\n    address = \"Dubrovnik, Croatia\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2023.findings-eacl.84/\",\n    doi = \"10.18653/v1/2023.findings-eacl.84\",\n    pages = \"1131--1141\",\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmoisutsu%2Frealistic-citation-count-prediction","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmoisutsu%2Frealistic-citation-count-prediction","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmoisutsu%2Frealistic-citation-count-prediction/lists"}