Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sarthakjshetty/red
Developing a database of species threats and stresses from the IUCN Red List. Published in Conservation Letters 2021.
https://github.com/sarthakjshetty/red
beautifulsoup bots iucn-red-list python3 scrapper selenium
Last synced: 14 days ago
JSON representation
Developing a database of species threats and stresses from the IUCN Red List. Published in Conservation Letters 2021.
- Host: GitHub
- URL: https://github.com/sarthakjshetty/red
- Owner: SarthakJShetty
- Created: 2020-02-17T06:54:16.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2024-03-05T19:21:13.000Z (11 months ago)
- Last Synced: 2024-11-10T04:15:07.968Z (2 months ago)
- Topics: beautifulsoup, bots, iucn-red-list, python3, scrapper, selenium
- Language: Python
- Homepage: https://github.com/SarthakJShetty/Red
- Size: 3.21 MB
- Stars: 0
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Red
:warning: **Code is buggy** :warning:
## 1.0 Introduction:
+ The aim of the project is to analyze correlations between the threat status of a particular species tracked on the [IUCN Red List](https://www.iucnredlist.org/ "IUCN Red List"), and their threats and stresses.
+ This repository is dedicated to scrapping the necessary datafields from the [IUCN Red List](https://www.iucnredlist.org/ "IUCN Red List") to prove such correlations.
+ This project is a collaboration with [Uttara Mendiratta](https://www.researchgate.net/profile/Uttara_Mendiratta "Uttara") and [Anand M Ossuri](https://www.ncf-india.org/author/675623/anand-osuri-2 "Anand") from the [Nature Conservation Foundation, India](http://ncf-india.org/ "NCF-India").
## 2.0 Implementation
1. The [```birds.csv```](https://github.com/SarthakJShetty/Red/tree/master/data/birds.csv) and [```mammals.csv```](https://github.com/SarthakJShetty/Red/tree/master/data/mammals.csv) contain the species for which the data has to be scrapped.
2. The permissions of the [```start.sh```](https://github.com/SarthakJShetty/Red/blob/master/start.sh) have to be changed before the first run of the code.
user@computer:~/Red chmod +X start.sh
3. The pipeline is triggered using the [```start.sh```](https://github.com/SarthakJShetty/Red/blob/master/start.sh) script, that in-turn triggers the [```scraper.py```](https://github.com/SarthakJShetty/Red/tree/master/scraper.py) code.
user@computer:~/Red ./start.sh
4. The scrapped data is stored to the disc in the form of a ```X_WORKING.csv``` file, a copy of the original ```.csv```, ensuring the originals are not tampered with.
## 3.0 Model Overview:
+ The model is made of two components: 1. [```interface.py```](https://github.com/SarthakJShetty/Red/tree/master/interface.py) and 2. [```scraper.py```](https://github.com/SarthakJShetty/Red/tree/master/scraper.py).
![alt text](assets/RedPipeline.png "Scrapping Pipeline")
Figure 2.1 Model to scrape data from IUCN Red List### 3.1 Interface
1. Disk write/read operations are handled by the [```interface.py```](https://github.com/SarthakJShetty/Red/tree/master/interface.py) code.
2. The [```pandas```](https://pandas.pydata.org/) dataframe is saved to the disc by the [```interface.py```](https://github.com/SarthakJShetty/Red/tree/master/interface.py) code after each run.
### 3.2 Scraper
1. The [```scraper.py```](https://github.com/SarthakJShetty/Red/tree/master/scraper.py) interacts with the webpage using the [Selenium](https://www.selenium.dev/) framework for performance testing.
2. The ```HTML``` ```tags``` contained in the ```page_source``` gathered by the [```Selenium```](https://www.selenium.dev/) middleware code is made searchable using [```BeautifulSoup```](https://www.crummy.com/software/BeautifulSoup/)
3. The [```scraper.py```](https://github.com/SarthakJShetty/Red/tree/master/scraper.py) pipeline collects the prescribed ```HTML``` tags from the website queried and updates a [```pandas```](https://pandas.pydata.org/) dataframe with the information.
4. The ```speciesCounter()``` of the [```scraper.py```](https://github.com/SarthakJShetty/Red/tree/master/scraper.py) script returns the ```sno``` of the last species that's missing the ```stable```, ```unknown``` or ```decline``` population trend tags, which all scrapped species must have.
## 4.0 Known Issues:
1. While writing elements to the [```pandas```](https://pandas.pydata.org/) dataframe an element maybe right-shifting a column(s). This error may lead to a [```pandas```](https://pandas.pydata.org/) memory warning, considreing entities of multiple datatypes occupy the same column.
2. Some species are not indexed by the [IUCN Red List](https://www.iucnredlist.org/ "IUCN Red List"). This may cause the [```start.sh```](https://github.com/SarthakJShetty/Red/blob/master/start.sh) script to loop while trying to collect the species ```URL``` from the searchpage.
## Citation:
If you decide to use our client, scraper or cleaner for your project, or as a means to interface with the IUCN database, please cite our [2021 Conservation Letters](https://conbio.onlinelibrary.wiley.com/doi/full/10.1111/conl.12815) paper!
```
@article{mendiratta2021mammal,
title={Mammal and bird species ranges overlap with armed conflicts and associated conservation threats},
author={Mendiratta, Uttara and Osuri, Anand M and Shetty, Sarthak J and Harihar, Abishek},
journal={Conservation Letters},
volume={14},
number={5},
pages={e12815},
year={2021},
publisher={Wiley Online Library}
}
```