https://github.com/lixx21/coursera-data-scarping

End-to-end project to scraping courses in coursera
https://github.com/lixx21/coursera-data-scarping

data-science-projects data-scraping scrapy streamlit

Last synced: about 2 months ago
JSON representation

End-to-end project to scraping courses in coursera

Host: GitHub
URL: https://github.com/lixx21/coursera-data-scarping
Owner: lixx21
Created: 2023-08-30T02:44:45.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2023-09-15T01:57:07.000Z (almost 3 years ago)
Last Synced: 2025-04-03T23:45:18.560Z (about 1 year ago)
Topics: data-science-projects, data-scraping, scrapy, streamlit
Language: Python
Homepage:
Size: 20.5 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Scraping Coursera Courses

## APP OVERVIEW

![image](https://github.com/lixx21/coursera-data-scarping/assets/91602612/bf83e562-4193-460c-b821-6824825952ad)

## TECH STACK

1. [Scrapy](https://docs.scrapy.org/en/latest/topics/exporters.html)
2. [Streamlit](https://docs.streamlit.io/)
3. [Pandas](https://pandas.pydata.org/docs/)

## RUN CRAWLING

1. Clone this repository ```https://github.com/lixx21/coursera-data-scarping.git```
2. Go to spiders directory path with ```cd coursera_data_scraping/coursera_data_scraping/spiders```
3. Start running streamlit app with following command ```streamlit run streamlit_app.py```

## NOTES

1. In this case, we need to install all libraries in ```requirements.txt```
2. Start Scrapy project with this command ```scrapy startproject {folder name}``` in this case, I used folder name = coursera_data_scraping so the comman will be like this ```scrapy startproject coursera_data_scraping```
3. Because Scraping in Coursera will return this error ```DEBUG: Forbidden by robots.txt:``` Therefore, in ```settings.py``` we set ```ROBOTSTXT_OBEY = False```
4. We also need to define our ```USER_AGENT``` in ```settings.py```
5. Because we want to store (append) data in csv, we need to create a csv first named ```output.csv``` in ```coursera_data_scraping/coursera_data_scraping/spiders``` directory and filled the header and keep the rest empty. **Just filled the header**

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lixx21/coursera-data-scarping

Awesome Lists containing this project

README