https://github.com/sarthakjshetty/bias
Investigating biases in scientific publications
https://github.com/sarthakjshetty/bias
ai biology conservation datavisualization ecology evolution nlp scientific-publications topic-modelling
Last synced: 8 months ago
JSON representation
Investigating biases in scientific publications
- Host: GitHub
- URL: https://github.com/sarthakjshetty/bias
- Owner: SarthakJShetty
- Created: 2018-08-26T03:58:37.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2021-08-05T10:13:00.000Z (about 4 years ago)
- Last Synced: 2025-01-06T14:35:31.861Z (10 months ago)
- Topics: ai, biology, conservation, datavisualization, ecology, evolution, nlp, scientific-publications, topic-modelling
- Language: Python
- Homepage: https://github.com/SarthakJShetty/Bias
- Size: 130 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ***pyResearchThemes:*** Analyzing research themes from academic publications
:warning: Code is buggy :warning:
### Contents
[**1.0 Introduction**](https://github.com/SarthakJShetty/Bias#10-introduction)
[**2.0 Model Overview**](https://github.com/SarthakJShetty/Bias#20-model-overview)
[**3.0 How it works**](https://github.com/SarthakJShetty/Bias#30-how-it-works)
[**4.0 Installation Instructions**](https://github.com/SarthakJShetty/Bias#40-installation-instructions)
[**5.0 Results**](https://github.com/SarthakJShetty/Bias#50-results)
[**6.0 Citations**](https://github.com/SarthakJShetty/Bias#60-citations)
## 1.0 Introduction:
- Academic publishing has risen 2-fold in the past ten years, making it nearly impossible to sift through a large number of papers and identify broad areas of research within disciplines.
***Figure 1.1*** Increase in the number of scientific publications in the fields of physics and chemistry [1].
- In order to *understand* such vast volumes of research, there is a need for **automated text analysis tools**.
- However, existing tools such are **expensive and lack in-depth analysis of publications**.
- To address these issues, we developed ***pyResearchThemes***, an **open-source, automated text analysis tool** that:
- **Scrape** papers from scientific repositories,
- **Analyse** meta-data such as date and journal of publication,
- **Visualizes** themes of research using natural language processing.
- To demonstrate the ability of the tool, we have analyzed the research themes from the field of Ecology & Conservation.
### 1.1 About:
This project is a collaboration between Sarthak J. Shetty, from the Department of Aerospace Engineering, Indian Institute of Science and Vijay Ramesh, from the Department of Ecology, Evolution & Environmental Biology, Columbia University.
## 2.0 Model Overview:
- The model is made up of three parts:
1. Scraper: This component scrapes scientific repository for publications containing the specific combination of keywords.
2. Cleaner: This component cleans the corpus of text retreived from the repository and rids it of special characters that creep in during formatting and submission of manuscripts.
3. Analyzer: This component collects and measures the frequency of select keywords in the abstracts database.
4. NLP Engine: This component extracts insights from the abstracts collected by presenting topic modelling.
5. Visualizer: This component presents the results and data from the Analyzer to the end user.
## 3.0 How it works:

***Figure 3.1*** Diagramatic representation of pipeline for collecting papers and generating visualizations.
### 3.1 Scraper:
- The ```Scraper.py``` currently scrapes only the abstracts from Springer using the BeautifulSoup and urllib packages.
- A default URL is provided in the code. Once the keywords are provided, the URLs are queried and the resultant webpage is souped and ```abstract_id``` is scraped.
- A new ```abstract_id_database``` is prepared for each result page, and is referenced when a new paper is scraped.
- The ```abstract_database``` contains the abstract along with the title, author and a complete URL from where the full text can be downloaded. They are saved in a ```.txt``` file
- A ```status_logger``` is used to log the sequence of commands in the program.

**Figure 3.2** ```Scraper.py``` script grabbing the papers from Springer.
### 3.2 Cleaner:
- The ```Cleaner.py``` cleans the corpus scrapped from the repository, before the topic models are generated.
- This script creates a clean variant of the ```.txt``` corpus file that is then stored as ```_ANALYTICAL.txt```, for further analysis and modelling

**Figure 3.3** ```Cleaner.py``` script gets rid of formatting and special characters present in the corpus.
### 3.3 Analyzer:
- The ```Analyzer.py``` analyzes the frequency of different words used in the abstract, and stores it in the form of a pandas dataframe.
- It serves as an intermediary between the Scraper and the Visualizer, preparing the scraped data into a ```.csv```.
- This ```.csv``` file is then passed on to the ```Visualizer.py``` to generate the "Trends" chart.

**Figure 3.4** ```Analyzer.py``` script generates this ```.csv``` file for analysis by other parts of the pipeline.
### 3.4 NLP Engine:
- The NLP Engine is used to generate the topic modelling charts for the [Visualizer.py](https://github.com/SarthakJShetty/Bias/tree/master/Visualizer.py) script.
- The language models are generated from the corpus for analysis using gensim and spaCy packages that employ the Latent dirichlet allocation (LDA) method [2].
- The corpus and model generated are then passed to the [Visualizer.py](https://github.com/SarthakJShetty/Bias/tree/master/Visualizer.py) script.
- The top modelling chart can be pulled from here [here](https://github.com/SarthakJShetty/Bias/blob/journal/LOGS/LOG_2019-04-24_19_35_East_Melanesian_Islands/Data_Visualization_Topic_Modelling.html).
**Note:** The ```.html``` file linked above has to be downloaded and opened in a JavaScript enabled browser to be viewed.
### 3.5 Visualizer:
- The ```Visualizer.py``` code is responsible for generating the visualization associated with a specific search, using the gensim and spaCy for research themes and matplotlib library for the trends.
- The research theme visualization is functional are presented under the 5.0 Results section.
- The research themes data visualization is stored as a .html file in the LOGS directory and can be viewed in the browser.
## 4.0 Installation Instructions:
### 4.1 Common instructions:
Note: These instructions are common to both Ubuntu and Windows systems.
1. Clone this repository:
E:\>git clone https://github.com/SarthakJShetty/Bias.git
2. Change directory to the 'Bias' directory:
E:\>cd Bias
### 4.2 Virtualenv instructions:
1. Install ```virtualenv``` using ```pip```:
user@Ubuntu: pip install virtualenv
2. Create a ```virtualenv``` environment called "Bias" in the directory of your project:
user@Ubuntu: virtualenv --no-site-packages Bias
Note: This step usually takes about 30 seconds to a minute.
3. Activate the virtualenv enviroment:
user@Ubuntu: ~/Bias$ source Bias/bin/activate
You are now inside the ```Bias``` environment.
4. Install the requirements from ```ubuntu_requirements.txt```:
(Bias) user@Ubuntu: pip3 install -r ubuntu_requirements.txt
Note: This step usually takes a few minutes, depending on your network speed.
### 4.3 Conda instructions:
1. Create a new ```conda``` environment:
E:\Bias conda create --name Bias python=3.5
2. Enter the new ```Bias``` environment created:
E:\Bias activate Bias
3. Install the required packages from ```conda_requirements.txt```:
(Bias) E:\Bias conda install --yes --file conda_requirements.txt
Note: This step usually takes a few minutes, depending on your network speed.
To run the code and generate the topic distribution and trend of research graphs:
(Bias) E:\Bias python Bias.py --keywords="Western Ghats" --trends="Conservation"
- This command will scrape the abstracts from Springer that are related to "Western Ghats", and calculate the frequency with which the term "Conservation" appears in their abstract.
## 5.0 Results:
Currently, the results from the various biodiversity runs are stored as tarballs, in the LOGS folder, primarily to save space.
To view the logs, topic-modelling results & trends chart from the tarballs, run the following commands:
tar zxvf .tar.gz
**Example:**
To view the logs & results generated from the run on "East Melanesian Islands":
tar zxvf LOG_2019-04-24_19_35_East_Melanesian_Islands.tar.gz
### 5.1 Topic Modelling Results:
The ```NLP_Engine.py``` module creates topic modelling charts such as the one shown below.

***Figure 5.1*** Distribution of topics discussed in publications pulled from 8 conservation and ecology themed journals.
- Circles indicate topics generated from the ```.txt``` file supplied to the ```NLP_Engine.py```, as part of the ```Bias``` pipeline.
- Each topic is made of a number of top keywords that are seen on the right, with an adjustable relevancy metric on top.
- More details regarding the visualizations and the udnerlying mechanics can be checked out [here](https://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf).
### 5.2 Weights and Frequency Results:

***Figure 5.2*** Here, we plot the variation in the weights and frequency of keywords falling under topic one from the chart above.
- Here, "weights" is a proxy for the importance of a specific keyword to a highlighted topic. The weight of a keyword is calculated by: i) absolute frequency and, ii) frequency of occurance with other keywords in the same topic.
- Factors i) and ii) result in variable weights being assigned to different keywords and emphasize it's importance in the topic.
### 5.3 Trends Result *:

***Figure 5.3*** Variation in the frequency of a the term "Conservation" over time in the corpus of text scrapped.
- Here, abstracts pertaining to [Eastern Himalayas](https://github.com/SarthakJShetty/Bias/blob/master/LOGS/LOG_2019-02-27_15_23_Eastern_Himalayas.tar.gz) were scrapped and temporally trend of occurance for "Conservation" was checked.
- The frequency is presented alongisde the bubble for each year on the chart.
- * We are still working on how to effectively present the trends and usage variations temporally. This feature is not part of the main package.
## 6.0 Citations:
- **[1]** - Gabriela C. Nunez‐Mir Basil V. Iannone III. *Automated content analysis: addressing the big literature challenge in ecology and evolution*. Methods in Ecology and Evolution. *June, 2016*.
- **[2]** - David Blei, Andrew Y. Ng, Michael I. Jordan. *Latent dirichlet allocation*. The Journal of Machine Learning Research. *March 2003*.