https://github.com/prak112/data4wildlife
Instagram scraping algorithm for collecting json and images to identify wildlife trade of Slow Loris
https://github.com/prak112/data4wildlife
data-collection dataset instagram-scraper web-scraping wildlife-conservation
Last synced: 2 months ago
JSON representation
Instagram scraping algorithm for collecting json and images to identify wildlife trade of Slow Loris
- Host: GitHub
- URL: https://github.com/prak112/data4wildlife
- Owner: prak112
- License: mit
- Created: 2022-01-30T10:29:43.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-02-04T19:37:20.000Z (about 3 years ago)
- Last Synced: 2025-01-15T01:42:01.438Z (4 months ago)
- Topics: data-collection, dataset, instagram-scraper, web-scraping, wildlife-conservation
- Language: HTML
- Homepage:
- Size: 4.45 MB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Data 4 Wildlife Hackathon 🛠️
**Hackathon (29-30 Jan 2022) based on developing a digital solution to prevent illegal wildlife trade (IWT) on online social platforms.*** **Team** - Sean P. Rogers, Gabriela Youngken 👩🎓 👨🎓
* **Mentor** - Alastair Jamieson 👨🏫 (also API-keys holder 👛)### **Challenge**
* To build a **benchmark dataset** of possible instances of IWT & related information from online social platforms which could also be searched and analyzed 🔚
* According to challenge guidelines : [Challenge1_Guidelines](https://github.com/prak112/data4wildlife/files/8005154/Challenge.1.Guidance.Document.pdf)
* _A benchmark dataset is a public dataset which is designed and collected for studying real-world data science/research problems._
* _The benchmark dataset should be social media platform agnostic, as IWT happens across multiple platforms such as Instagram and YouTube._## Our Task
* **Collect instagram posts with images related to _Slow Loris_ hashtags (slowloris, slowlorisforsale) to build a benchmark dataset** 🏛️* **Task Duration** - 26 hours 🏃⏲️
## Our Approach 🏗️
- Manually identify _Slow Loris_ hashtags 🐵 for example data
- Call instagram api (RapidAPI, instagram85) for hashtag related feed
- Collect json (first page only), extract images & label images by user id
- Save images in folder labelled by language (_see **Future Prospects**_)
- Iterate api calls & collect images
- Import json to webpage, [index.html](updated_(code-webpage)/index.html), for human validation of images
- Manually validate images and export csv file with information from comments### Future Prospects 👀
- Call api recursively with 'next_page_id' to collect all pages
- Depending on image volume, project can evolve into Image Recognition for automation## _Key Takeaways_
* _Focus on the bigger picture_ 🌄
* _Build one-block-at-a-time_ 🧱
* _Have consistent breaks_ 😌