https://github.com/vtraag/journal-causal-effect-replication

Replication materials for inferring the causal effect of journals on citations
https://github.com/vtraag/journal-causal-effect-replication

Last synced: 3 months ago
JSON representation

Replication materials for inferring the causal effect of journals on citations

Host: GitHub
URL: https://github.com/vtraag/journal-causal-effect-replication
Owner: vtraag
Created: 2019-12-18T08:06:27.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2021-03-30T20:27:08.000Z (over 4 years ago)
Last Synced: 2025-04-13T11:45:44.349Z (6 months ago)
Language: Python
Size: 6.84 KB
Stars: 9
Watchers: 1
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

The material in this repository is meant to replicate the results of [1]. It contains the necessary source code to replicate our results. This repository is also archived at Zenodo:

[![DOI](https://zenodo.org/badge/228789557.svg)](https://zenodo.org/badge/latestdoi/228789557)

# Data

The data for replication is available from Zenodo:

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3582974.svg)](https://doi.org/10.5281/zenodo.3582974)

The dataset should be downloaded and extracted to a subdirectory of this repository called `data`. Please see the `README.md` in the data repository for more details regarding the data.

# Source code

All necessary source code is contained in the `src` directory. This code is executed using `python 3.6.6`, `pandas 0.23.4` and `pystan 2.18.0.0`.

The data is processed in subsets, per field and year. We first process the data to separate it into suitable subsets using `prepare_data.py`. We then create the `pystan` model using `cit_stan_create.py`, which is pickled to `cit_model.pkl`. The pickled model is then reused by `cit_stan_run.py`, which runs the `pystan` model on one particular subset. This is explained in more detail below.

1. Prepare the subsets. This should be done by executing the following from the `src` directory.

``python prepare_date.py``

This script assumes all data is contained in the `data` directory, and will create a directory in `data/subsets` for each subset that fulfills the conditions (i.e. at least 20 preprints published no sooner than 30 days after being posted on arXiv). The subsets are organised as `[subject]/[journal]/[year]`.
Note that it will not recreate a directory if it already exists. If a directory already exists but it shouldn't according to the criteria, it is removed.

2. Create the `pystan` model. This is done by executing the following from the `src` directory.

``python cit_stan_create.py``

This script will create a `cit_model.pkl` file in the current working directory, which contains the pickled `pystan` model. It takes quite a bit of time to compile the model, and we therefore seperately create the model and reuse it on each subset.

3. Run the `pystan` model on each subset. This is done by executing the following

``python cit_stan_run.py [source dir] [data subset dir] [result subset dir]``

The `[source dir]` should refer to the directory in which `cit_model.pkl` is available. If the previous step was simply run from the `src` directory, and this script is also run from the `src` directory, you can simply indicate the current directory (`.`). The `[data subset dir]` should refer to a specific subset for which you wish to run the `pystan` model, e.g. `../data/subsets/Astrophysics/12375/2004`. The `[result subset dir]` should refer to the directory in which you would like the results to be stored, e.g. `../results/subsets/Astrophysics/12375/2004`. If the directory does not yet exist it will be created (including intermediate directories). The result consists of two files: `fit.csv` and `stan_summary.txt`. The first contains the samples from the posterior distributions, and the latter contains a summary of the samples. Note that a [bug](https://github.com/stan-dev/pystan/issues/429) in the summary file in `pystan` may result in incorrectly aligned summary files. Existing files will be overwritten.

This setup allows to run the `pystan` model on all 3892 different subsets in parallel. For the original results, all calculations were performed on the Shark cluster of the LUMC.

# References

[1] Traag, V.A. (2020), Inferring the causal effect of journals on citations. Quantitative Science Studies, doi: [10.1162/qss_a_00128](https://doi.org/10.1162/qss_a_00128), arXiv:[1912.08648](https://arxiv.org/abs/1912.08648)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vtraag/journal-causal-effect-replication

Awesome Lists containing this project

README