https://github.com/lixx21/coursera-data-scarping
End-to-end project to scraping courses in coursera
https://github.com/lixx21/coursera-data-scarping
data-science-projects data-scraping scrapy streamlit
Last synced: about 2 months ago
JSON representation
End-to-end project to scraping courses in coursera
- Host: GitHub
- URL: https://github.com/lixx21/coursera-data-scarping
- Owner: lixx21
- Created: 2023-08-30T02:44:45.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2023-09-15T01:57:07.000Z (almost 3 years ago)
- Last Synced: 2025-04-03T23:45:18.560Z (about 1 year ago)
- Topics: data-science-projects, data-scraping, scrapy, streamlit
- Language: Python
- Homepage:
- Size: 20.5 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Scraping Coursera Courses
## APP OVERVIEW

## TECH STACK
1. [Scrapy](https://docs.scrapy.org/en/latest/topics/exporters.html)
2. [Streamlit](https://docs.streamlit.io/)
3. [Pandas](https://pandas.pydata.org/docs/)
## RUN CRAWLING
1. Clone this repository ```https://github.com/lixx21/coursera-data-scarping.git```
2. Go to spiders directory path with ```cd coursera_data_scraping/coursera_data_scraping/spiders```
3. Start running streamlit app with following command ```streamlit run streamlit_app.py```
## NOTES
1. In this case, we need to install all libraries in ```requirements.txt```
2. Start Scrapy project with this command ```scrapy startproject {folder name}``` in this case, I used folder name = coursera_data_scraping so the comman will be like this ```scrapy startproject coursera_data_scraping```
3. Because Scraping in Coursera will return this error ```DEBUG: Forbidden by robots.txt:``` Therefore, in ```settings.py``` we set ```ROBOTSTXT_OBEY = False```
4. We also need to define our ```USER_AGENT``` in ```settings.py```
5. Because we want to store (append) data in csv, we need to create a csv first named ```output.csv``` in ```coursera_data_scraping/coursera_data_scraping/spiders``` directory and filled the header and keep the rest empty. **Just filled the header**