{"id":19347269,"url":"https://github.com/sarthakjshetty/bias","last_synced_at":"2026-06-13T13:31:57.886Z","repository":{"id":75476060,"uuid":"146149799","full_name":"SarthakJShetty/Bias","owner":"SarthakJShetty","description":"Investigating biases in scientific publications","archived":false,"fork":false,"pushed_at":"2021-08-05T10:13:00.000Z","size":136780,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-01-06T14:35:31.861Z","etag":null,"topics":["ai","biology","conservation","datavisualization","ecology","evolution","nlp","scientific-publications","topic-modelling"],"latest_commit_sha":null,"homepage":"https://github.com/SarthakJShetty/Bias","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SarthakJShetty.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2018-08-26T03:58:37.000Z","updated_at":"2022-11-17T09:12:50.000Z","dependencies_parsed_at":"2023-06-06T14:45:15.062Z","dependency_job_id":null,"html_url":"https://github.com/SarthakJShetty/Bias","commit_stats":{"total_commits":506,"total_committers":2,"mean_commits":253.0,"dds":0.01976284584980237,"last_synced_commit":"a1da64a39001de11ba992f8386c80d597a2dbb4a"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SarthakJShetty%2FBias","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SarthakJShetty%2FBias/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SarthakJShetty%2FBias/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SarthakJShetty%2FBias/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SarthakJShetty","download_url":"https://codeload.github.com/SarthakJShetty/Bias/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240457940,"owners_count":19804489,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","biology","conservation","datavisualization","ecology","evolution","nlp","scientific-publications","topic-modelling"],"created_at":"2024-11-10T04:15:14.224Z","updated_at":"2026-06-13T13:31:57.829Z","avatar_url":"https://github.com/SarthakJShetty.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ***pyResearchThemes:*** Analyzing research themes from academic publications\n\n:warning: \u003cstrong\u003eCode is buggy\u003c/strong\u003e :warning:\n\n### Contents\n[**1.0 Introduction**](https://github.com/SarthakJShetty/Bias#10-introduction) \u003cbr\u003e\n\n[**2.0 Model Overview**](https://github.com/SarthakJShetty/Bias#20-model-overview) \u003cbr\u003e\n\n[**3.0 How it works**](https://github.com/SarthakJShetty/Bias#30-how-it-works) \u003cbr\u003e\n\n[**4.0 Installation Instructions**](https://github.com/SarthakJShetty/Bias#40-installation-instructions) \u003cbr\u003e\n\n[**5.0 Results**](https://github.com/SarthakJShetty/Bias#50-results) \u003cbr\u003e\n\n[**6.0 Citations**](https://github.com/SarthakJShetty/Bias#60-citations)\n\n## 1.0 Introduction:\n\n- Academic publishing has risen 2-fold in the past ten years, making it nearly impossible to sift through a large number of papers and identify broad areas of research within disciplines.\n\n\u003cdiv style=\"text-align:center\"\u003e\n\t\u003cimg src=\"assets/Increase.png\" alt=\"Increase in number of scientific publications\"\u003e\n\u003c/div\u003e\n\n\u003ci\u003e***Figure 1.1*** Increase in the number of scientific publications in the fields of physics and chemistry [1].\u003c/i\u003e\n\n- In order to *understand* such vast volumes of research, there is a need for **automated text analysis tools**.\n\n- However, existing tools such are **expensive and lack in-depth analysis of publications**.\n\n- To address these issues, we developed ***pyResearchThemes***, an **open-source, automated text analysis tool** that:\n\t- **Scrape** papers from scientific repositories,\n\t- **Analyse** meta-data such as date and journal of publication,\n\t- **Visualizes** themes of research using natural language processing.\n\n- To demonstrate the ability of the tool, we have analyzed the research themes from the field of Ecology \u0026 Conservation.\n\n### 1.1 About:\n\nThis project is a collaboration between \u003ca title=\"Sarthak\" href=\"https://SarthakJShetty.github.io\" target=\"_blank\"\u003e Sarthak J. Shetty\u003c/a\u003e, from the \u003ca title=\"Aerospace Engineering\" href=\"https://aero.iisc.ac.in\" \u003eDepartment of Aerospace Engineering\u003c/a\u003e, \u003ca title=\"IISc\" href=\"https://iisc.ac.in\" target=\"_blank\"\u003e Indian Institute of Science\u003c/a\u003e and \u003ca title=\"Vijay\" href=\"https://evolecol.weebly.com/\" target=\"_blank\"\u003e Vijay Ramesh\u003c/a\u003e, from the \u003ca title=\"E3B\" href=\"http://e3b.columbia.edu/\" target=\"_blank\"\u003eDepartment of Ecology, Evolution \u0026 Environmental Biology\u003c/a\u003e, \u003ca href=\"https://www.columbia.edu/\" title=\"Columbia University\" target=\"_blank\"\u003eColumbia University\u003c/a\u003e.\n\n## 2.0 Model Overview:\n\n- The model is made up of three parts:\n\n\t1. \u003cstrong\u003e\u003ca title=\"Scraper\" href=\"https://github.com/SarthakJShetty/Bias/tree/master/Scraper.py/\"\u003eScraper\u003c/a\u003e:\u003c/strong\u003e This component scrapes scientific repository for publications containing the specific combination of keywords.\n\n\t2. \u003cstrong\u003e\u003ca title=\"Cleaner\" href=\"https://github.com/SarthakJShetty/Bias/tree/master/Cleaner.py/\"\u003eCleaner\u003c/a\u003e:\u003c/strong\u003e This component cleans the corpus of text retreived from the repository and rids it of special characters that creep in during formatting and submission of manuscripts.\n\n\t3. \u003cstrong\u003e\u003ca title=\"Analyzer\" href=\"https://github.com/SarthakJShetty/Bias/tree/master/Analyzer.py/\"\u003eAnalyzer\u003c/a\u003e:\u003c/strong\u003e This component collects and measures the frequency of select keywords in the abstracts database.\n\n\t4. \u003cstrong\u003e\u003ca title=\"NLP Engine\" href=\"https://github.com/SarthakJShetty/Bias/tree/master/NLP_Engine.py/\"\u003eNLP Engine\u003c/a\u003e:\u003c/strong\u003e This component extracts insights from the abstracts collected by presenting topic modelling.\n\n\t5. \u003cstrong\u003e\u003ca title=\"Visualizer\" href=\"https://github.com/SarthakJShetty/Bias/tree/master/Visualizer.py/\"\u003eVisualizer\u003c/a\u003e:\u003c/strong\u003e This component presents the results and data from the Analyzer to the end user.\n\n## 3.0 How it works:\n\n\u003cimg src=\"assets/Bias.png\" alt=\"Bias Pipeline\"\u003e\n\n\u003ci\u003e***Figure 3.1*** Diagramatic representation of pipeline for collecting papers and generating visualizations.\u003c/i\u003e\n\n### 3.1 Scraper:\n- The \u003ca title=\"Scraper\" href=\"https://github.com/SarthakJShetty/Bias/blob/master/Scraper.py\"\u003e```Scraper.py```\u003c/a\u003e currently scrapes only the abstracts from \u003ca title=\"Springer\" href=\"https://www.link.Springer.com\" target=\"_blank\"\u003eSpringer\u003c/a\u003e  using the \u003ca title=\"BeautifulSoup\" href=\"https://www.crummy.com/software/BeautifulSoup/bs4/doc/\" target=\"_blank\"\u003eBeautifulSoup\u003c/a\u003e and \u003ca title=\"urllib\" href=\"https://docs.python.org/3/library/urllib.request.html#module-urllib.request\" target=\"_blank\"\u003eurllib\u003c/a\u003e packages.\n\n- A default URL is provided in the code. Once the keywords are provided, the URLs are queried and the resultant webpage is souped and ```abstract_id``` is scraped.\n\n- A new \u003ca title=\"Abstract ID\" target=\"_blank\" href=\"https://github.com/SarthakJShetty/Bias/blob/journal/LOGS/LOG_2019-04-24_19_35_East_Melanesian_Islands/Abstract_ID_Database_2019-04-24_19_35_1.txt\"\u003e```abstract_id_database```\u003c/a\u003e is prepared for each result page, and is referenced when a new paper is scraped.\n\n- The \u003ca title=\"Abstract Database\" target=\"_blank\" href=\"https://github.com/SarthakJShetty/Bias/blob/journal/LOGS/LOG_2019-04-24_19_35_East_Melanesian_Islands/Abstract_Database_2019-04-24_19_35.txt\"\u003e```abstract_database```\u003c/a\u003e contains the abstract along with the title, author and a complete URL from where the full text can be downloaded. They are saved in a ```.txt``` file\n\n- A \u003ca title=\"Status Logger\" href=\"https://github.com/SarthakJShetty/Bias/blob/journal/LOGS/LOG_2019-04-24_19_35_East_Melanesian_Islands/Status_Logger_2019-04-24_19_35.txt\" target=\"_blank\"\u003e```status_logger```\u003c/a\u003e is used to log the sequence of commands in the program.\n\n\u003cimg src=\"assets/Scraper.png\" alt=\"Scraper grabbing the papers from Springer\"\u003e\n\n\u003ci\u003e **Figure 3.2** \u003ca title=\"Scraper\" href=\"https://github.com/SarthakJShetty/Bias/blob/master/Scraper.py\"\u003e```Scraper.py```\u003c/a\u003e script grabbing the papers from \u003ca title=\"Springer\" href=\"https://www.link.Springer.com\" target=\"_blank\"\u003eSpringer\u003c/a\u003e.\u003c/i\u003e\n\n### 3.2 Cleaner:\n- The \u003ca title=\"Cleaner\" href=\"https://github.com/SarthakJShetty/Bias/tree/master/Cleaner.py/\"\u003e```Cleaner.py```\u003c/a\u003e cleans the corpus scrapped from the repository, before the  topic models are generated.\n\n- This script creates a clean variant of the ```.txt``` corpus file that is then stored as \u003ca href=\"https://github.com/SarthakJShetty/Bias/blob/journal/LOGS/LOG_2019-04-24_19_35_East_Melanesian_Islands/Abstract_Database_2019-04-24_19_35_ANALYTICAL.txt\" title=\"Analytical File\"\u003e```_ANALYTICAL.txt```\u003c/a\u003e, for further analysis and modelling\n\n\u003cimg src='assets/Cleaner.png' alt=\"Cleaner.py cleaned up text\"\u003e\n\n\u003ci\u003e **Figure 3.3** \u003ca title=\"Cleaner\" href=\"https://github.com/SarthakJShetty/Bias/tree/master/Cleaner.py/\"\u003e```Cleaner.py```\u003c/a\u003e script gets rid of formatting and special characters present in the corpus.\u003c/i\u003e\n\n### 3.3 Analyzer:\n- The \u003ca title=\"Analyzer\" href=\"https://github.com/SarthakJShetty/Bias/tree/master/Analyzer.py/\"\u003e```Analyzer.py```\u003c/a\u003e analyzes the frequency of different words used in the abstract, and stores it in the form of a \u003ca title=\"Pandas\" href=\"https://pandas.pydata.org/\"\u003epandas\u003c/a\u003e dataframe.\n\n- It serves as an intermediary between the Scraper and the Visualizer, preparing the scraped data into a \u003ca title=\"Analyzer CSV file\" href=\"https://github.com/SarthakJShetty/Bias/blob/journal/LOGS/LOG_2019-04-24_19_35_East_Melanesian_Islands/Abstract_Database_2019-04-24_19_35.csv\"\u003e```.csv```\u003c/a\u003e.\n\n- This ```.csv``` file is then passed on to the \u003ca title=\"Visualizer\" href=\"https://github.com/SarthakJShetty/Bias/blob/master/Visualizer.py\"\u003e```Visualizer.py```\u003c/a\u003e to generate the \"Trends\" \u003ca href=\"https://github.com/SarthakJShetty/Bias/tree/journal#53-trends-result-\" title=\"Trends Charts\"\u003echart\u003c/a\u003e.\n\n\u003cimg src=\"assets/Analyzer.png\" alt=\"Analyzer sorting the frequency of each word occuring in the corpus\"\u003e\n\n\u003ci\u003e**Figure 3.4** \u003ca title=\"Analyzer\" href=\"https://github.com/SarthakJShetty/Bias/tree/master/Analyzer.py/\"\u003e```Analyzer.py```\u003c/a\u003e script generates this ```.csv``` file for analysis by other parts of the pipeline.\u003c/i\u003e\n\n### 3.4 NLP Engine:\n\n- The NLP Engine is used to generate the topic modelling charts for the [Visualizer.py](https://github.com/SarthakJShetty/Bias/tree/master/Visualizer.py) script. \n\n- The language models are generated from the corpus for analysis using \u003ca title=\"Gensim\" href=\"https://pypi.org/project/gensim/\"\u003egensim\u003c/a\u003e and \u003ca title=\"spaCy\" href=\"https://spacy.io\"\u003espaCy\u003c/a\u003e packages that employ the \u003ca href=\"https://dl.acm.org/doi/10.5555/944919.944937\" title=\"LDA Modelling\"\u003eLatent dirichlet allocation (LDA)\u003c/a\u003e method \u003ca title=\"LDA Modelling\" href=\"\"\u003e[2]\u003c/a\u003e.\n\n- The corpus and model generated are then passed to the [Visualizer.py](https://github.com/SarthakJShetty/Bias/tree/master/Visualizer.py) script.\n\n- The top modelling chart can be pulled from here [here](https://github.com/SarthakJShetty/Bias/blob/journal/LOGS/LOG_2019-04-24_19_35_East_Melanesian_Islands/Data_Visualization_Topic_Modelling.html).\n\n\t**Note:** The \u003ca title=\"Topic Modelling .html\" href=\"https://github.com/SarthakJShetty/Bias/blob/journal/LOGS/LOG_2019-04-24_19_35_East_Melanesian_Islands/Data_Visualization_Topic_Modelling.html\"\u003e```.html```\u003c/a\u003e file linked above has to be downloaded and opened in a JavaScript enabled browser to be viewed.\n\n### 3.5 Visualizer:\n\n- The \u003ca title=\"Visualizer\" href=\"https://github.com/SarthakJShetty/Bias/blob/master/Visualizer.py\"\u003e```Visualizer.py```\u003c/a\u003e code is responsible for generating the visualization associated with a specific search, using the \u003ca title=\"Gensim\" href=\"https://pypi.org/project/gensim/\" target=\"_blank\"\u003egensim\u003c/a\u003e and \u003ca title=\"spaCy\" href=\"https://spacy.io\" target=\"_blank\"\u003espaCy\u003c/a\u003e for research themes and \u003ca title=\"Matplotlib\" href=\"https://http://matplotlib.org/\" target=\"_blank\"\u003ematplotlib\u003c/a\u003e library for the trends.\n\n- The research theme visualization is functional are presented under the \u003ca title=\"Results Section\" href=\"https://github.com/SarthakJShetty/Bias/tree/journal#50-results\"\u003e5.0 Results\u003c/a\u003e section.\n\n- The research themes data visualization is stored as a \u003ca title=\"Data Visualization\" href=\"https://github.com/SarthakJShetty/Bias/blob/journal/LOGS/LOG_2019-04-24_19_35_East_Melanesian_Islands/Data_Visualization_Topic_Modelling.html\"\u003e.html file\u003c/a\u003e in the LOGS directory and can be viewed in the browser.\n\n## 4.0 Installation Instructions:\n\n### 4.1 Common instructions:\n\n\u003cstrong\u003eNote:\u003c/strong\u003e These instructions are common to both Ubuntu and Windows systems. \n\n1.  Clone this repository:\n\n\t\tE:\\\u003egit clone https://github.com/SarthakJShetty/Bias.git\n\n2. Change directory to the 'Bias' directory:\n\n\t\tE:\\\u003ecd Bias\t\t\n\n### 4.2 Virtualenv instructions:\t\t\n\n1. Install ```virtualenv``` using ```pip```:\n\n\t\tuser@Ubuntu: pip install virtualenv\n\n2. Create a ```virtualenv``` environment called \"Bias\" in the directory of your project:\n\n\t\tuser@Ubuntu: virtualenv --no-site-packages Bias\n\t\n\t\u003cstrong\u003eNote:\u003c/strong\u003e This step usually takes about 30 seconds to a minute.\n\n3. Activate the virtualenv enviroment:\n\n\t\tuser@Ubuntu: ~/Bias$ source Bias/bin/activate\n\n\tYou are now inside the ```Bias``` environment.\n\n4. Install the requirements from \t\u003ca title=\"Ubuntu Requirements\" href=\"https://github.com/SarthakJShetty/Bias/blob/master/ubuntu_requirements.txt\"\u003e```ubuntu_requirements.txt```\u003c/a\u003e:\n\t\n\t\t(Bias) user@Ubuntu: pip3 install -r ubuntu_requirements.txt\n\t\t\n\t\u003cstrong\u003eNote:\u003c/strong\u003e This step usually takes a few minutes, depending on your network speed.\n\n### 4.3 Conda instructions:\n\n1. Create a new ```conda``` environment:\n\t\n\t\tE:\\Bias conda create --name Bias python=3.5\t\n\n2. Enter the new ```Bias``` environment created:\n\t\n\t\tE:\\Bias activate Bias\n\n3. Install the required packages from \u003ca href=\"https://github.com/SarthakJShetty/Bias/blob/master/conda_requirements.txt\"\u003e```conda_requirements.txt```\u003c/a\u003e:\n\t\t\n\t\t(Bias) E:\\Bias conda install --yes --file conda_requirements.txt\n\n\t\u003cstrong\u003eNote:\u003c/strong\u003e This step usually takes a few minutes, depending on your network speed.\n\n\nTo run the code and generate the topic distribution and trend of research graphs:\n\t\t\n\t\t(Bias) E:\\Bias python Bias.py --keywords=\"Western Ghats\" --trends=\"Conservation\"\n\n- This command will scrape the abstracts from \u003ca title=\"Springer\" href=\"https://link.springer.com/\" target=\"_blank\"\u003eSpringer\u003c/a\u003e that are related to \"Western Ghats\", and calculate the frequency with which the term \"Conservation\" appears in their abstract.\n\n## 5.0 Results:\n\nCurrently, the \u003ca title=\"LOGS\" href=\"https://github.com/SarthakJShetty/Bias/blob/master/LOGS/\" target=\"_blank\"\u003eresults\u003c/a\u003e from the various biodiversity runs are stored as tarballs, in the \u003ca title=\"LOGS\" href=\"https://github.com/SarthakJShetty/Bias/blob/master/LOGS/\" target=\"_blank\"\u003eLOGS\u003c/a\u003e  folder, primarily to save space.\n\nTo view the logs, topic-modelling results \u0026 trends chart from the tarballs, run the following commands:\n\n\t\ttar zxvf \u003clog_folder_to_be_unarchived\u003e.tar.gz\n\n**Example:**\n\nTo view the logs \u0026 results generated from the run on \u003ca title=\"east Melanesian Islands\" target=\"_blank\" href=\"https://github.com/SarthakJShetty/Bias/blob/master/LOGS/LOG_2019-04-24_19_35_East_Melanesian_Islands.tar.gz\"\u003e\"East Melanesian Islands\"\u003c/a\u003e:\n\n\t\ttar zxvf LOG_2019-04-24_19_35_East_Melanesian_Islands.tar.gz\n\n### 5.1 Topic Modelling Results:\n\nThe ```NLP_Engine.py``` module creates topic modelling charts such as the one shown below.\n\n\u003cimg src='assets/Topics.png' alt='Topic Modelling Chart'\u003e\n\n\u003ci\u003e***Figure 5.1*** Distribution of topics discussed in publications pulled from \u003ca title=\"Ecology Journals\" href=\"journals.md\"\u003e8 conservation and ecology themed journals\u003c/a\u003e\u003c/i\u003e.\n\n- Circles indicate topics generated from the ```.txt``` file supplied to the ```NLP_Engine.py```, as part of the ```Bias``` pipeline.\n- Each topic is made of a number of top keywords that are seen on the right, with an adjustable relevancy metric on top.\n- More details regarding the visualizations and the udnerlying mechanics can be checked out [here](https://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf).\n\n### 5.2 Weights and Frequency Results:\n\n\u003cimg src = 'assets/WeightsAndFrequency.png' alt= \"Weights and Frequncy\"\u003e\n\n\u003ci\u003e***Figure 5.2*** Here, we plot the variation in the weights and frequency of keywords falling under topic one from the chart \u003ca title=\"Link to Topic Modelling charts\" href=\"https://github.com/SarthakJShetty/Bias/tree/journal/#51-topic-modelling-results\"\u003eabove\u003c/a\u003e.\u003c/i\u003e\n\n- Here, \"weights\" is a proxy for the importance of a specific keyword to a highlighted topic. The weight of a keyword is calculated by: i) absolute frequency and, ii) frequency of occurance with other keywords in the same topic.\n\n- Factors i) and ii) result in variable weights being assigned to different keywords and emphasize it's importance in the topic.\n\n### 5.3 Trends Result *:\n\n\u003cimg src = \"assets/XKCD.png\" alt = 'Trends Chart for Eastern '\u003e\n\n\u003ci\u003e***Figure 5.3*** Variation in the frequency of a the term \"Conservation\" over time in the corpus of text scrapped.\u003c/i\u003e\n\n- Here, abstracts pertaining to [Eastern Himalayas](https://github.com/SarthakJShetty/Bias/blob/master/LOGS/LOG_2019-02-27_15_23_Eastern_Himalayas.tar.gz) were scrapped and temporally trend of occurance for \"Conservation\" was checked.\n- The frequency is presented alongisde the bubble for each year on the chart.\n- \u0026ast; We are still working on how to effectively present the trends and usage variations temporally. This feature is not part of the main package.\n\n## 6.0 Citations:\n\n- **[1]** - Gabriela C. Nunez‐Mir  Basil V. Iannone III. *Automated content analysis: addressing the big literature challenge in ecology and evolution*. Methods in Ecology and Evolution. *June, 2016*.\n- **[2]** - David Blei, Andrew Y. Ng, Michael I. Jordan. *Latent dirichlet allocation*. The Journal of Machine Learning Research. *March 2003*.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsarthakjshetty%2Fbias","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsarthakjshetty%2Fbias","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsarthakjshetty%2Fbias/lists"}