Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/songzy12/citation_count_prediction

https://www.biendata.xyz/competition/Tsinghua_course3/
https://github.com/songzy12/citation_count_prediction

Last synced: about 1 month ago
JSON representation

https://www.biendata.xyz/competition/Tsinghua_course3/

Host: GitHub
URL: https://github.com/songzy12/citation_count_prediction
Owner: songzy12
Created: 2017-04-21T10:17:50.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2023-04-21T12:49:20.000Z (over 1 year ago)
Last Synced: 2024-12-12T16:16:23.471Z (about 1 month ago)
Language: Python
Homepage:
Size: 21.5 KB
Stars: 1
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

## Citation Count Prediction

This is an implementation of paper [1] (not exactlty).
Most of the features used are enlighted by this paper.

## How to run

First set up the directory tree as follows, then put the raw data under `data`.
Just type the following command in your console to reproduce our results:

```
cd src && ./run.sh
```

## Project Structure

### data/

Raw data including author id map, paper information, training set, test set.

* author.txt
* paper.txt
* citation_train.txt
* citation_test.txt

### feature/

All the intermediate features extracted by `feature.py`, cached to accelerate for future use.

### log/

The log files used for debug.

### model/

The model trained for future use.

### result/

Predicted result for the test set, using the model trained on the features and the labeled training set.

### src/

* util: for read input from file, write output into file
* feature: get a bunch of features for specific author
* model: using training set to train the model
* predictor: use the model to predict results

### test/

Some test code to make sure each component works as expected.

## Feature

The feature we used can be classified into the following categories.

### Topic related

We run an LDA model on the abstracts of all the papers and extract 20 topics. Then the features related to topics include:

* The total and average topic perlexity of this author's papers.
* The weighed citation count of this author's average topic.
* The topic rank of this author's average topic.

### Recency

* The total and average paper age of this author's papers

### Venue

* The average venue rank of this author's papers' venues.

### Author

* The h-index of this author based on reference info.
* The rank of this author based on his reference info

### Paper

* The total number of papers of this author
* The total number of references in paper data
* The total number of citations in paper data

### Coauthor

* The total number of coauthors of this author
* The average paper citation count of his coauthors in training data

## Model

We use Random Forest with 100 estimators on the features we got. It performs better than Linear Regression. We also have intended to use SVM, but it just runs endlessly.

The time cost of our last run is around 44 min for feature extraction, 47 min for model training, 17 min for result prediction.

## References

[1] Yan, Rui, et al. "Citation count prediction: learning to estimate future citations for literature." Proceedings of the 20th ACM international conference on Information and knowledge management. ACM, 2011.