https://github.com/sadmanca/imdb-scraper
Scrapes IMDb's movie database and outputs the data to CSV files.
https://github.com/sadmanca/imdb-scraper
beautifulsoup data-scraping imdb numpy pandas python requests
Last synced: about 2 months ago
JSON representation
Scrapes IMDb's movie database and outputs the data to CSV files.
- Host: GitHub
- URL: https://github.com/sadmanca/imdb-scraper
- Owner: sadmanca
- License: mit
- Created: 2021-02-02T02:14:50.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2021-02-13T01:35:32.000Z (over 5 years ago)
- Last Synced: 2025-03-27T13:43:37.057Z (about 1 year ago)
- Topics: beautifulsoup, data-scraping, imdb, numpy, pandas, python, requests
- Language: Python
- Homepage:
- Size: 28.3 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# IMDB Top 1000 Movies
This project scrapes the [IMDb "Top 1000" movies (sorted by popularity)](https://www.imdb.com/search/title/?groups=top_1000) webpages and outputs the data to a CSV file.
## Sample Output
| | movie | year | runtime | imdb | metascore | votes | grossMillions |
| -- | -------------------------- | ---- | ------- | ---- | --------- | ------- | ------------- |
| 0 | Dara of Jasenovac | 2020 | 130 | 8.7 | | 51892 | |
| 1 | Soul | 2020 | 100 | 8.1 | 83.0 | 172275 | |
| 2 | Groundhog Day | 1993 | 101 | 8.0 | 72.0 | 580305 | 70.91 |
| 3 | The Sound of Music | 1965 | 172 | 8.0 | 63.0 | 206581 | 163.21 |
| 4 | Avengers: Endgame | 2019 | 181 | 8.4 | 78.0 | 815967 | 858.37 |
| 5 | Deadpool 2 | 2018 | 119 | 7.7 | 66.0 | 480793 | 324.59 |
| ...| ... | ... | ... | ... | ... | ... | ... |
### Downloads
* [imdb_scraper.py](imdb_scraper.py) - main program that scrapes the IMDb webpages
* [movies.csv](movies.csv) - outputted csv file
## Built With
* [Requests](https://requests.readthedocs.io) - library for making HTTP requests
* [Beautiful Soup 4](https://pypi.org/project/beautifulsoup4/) - library for scraping information from web pages
* [NumPy](https://numpy.org) - high performance library for multi-dimensional arrays
* [Pandas](https://pandas.pydata.org) - provides tools for manipulating tables