https://github.com/sameeramin/cnn-crawler
A web crawler that crawls CNN based on the categories provided to it
https://github.com/sameeramin/cnn-crawler
Last synced: 10 months ago
JSON representation
A web crawler that crawls CNN based on the categories provided to it
- Host: GitHub
- URL: https://github.com/sameeramin/cnn-crawler
- Owner: sameeramin
- Created: 2023-07-08T11:18:31.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2023-07-09T16:57:12.000Z (almost 3 years ago)
- Last Synced: 2025-01-12T19:45:18.713Z (over 1 year ago)
- Language: Python
- Size: 20.5 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# CNN Crawler
## Introduction
CNN Web Crawler is a web crawler that ceawls the CNN new website based on the categories provided while running it.
## Requirements
- Python 3.9
- BeautifulSoup4
- requests
## How to run
1. Clone the repository
```bash
git clone https://github.com/sameeramin/cnn-crawler.git
cd cnn-crawler
```
2. Install the requirements
```bash
pip install -r requirements.txt
```
3. Run the crawler file
```bash
python cnn_crawler.py --category world --page-limit 5 --output cnn_news.json
```
## Arguments
- `--category` (required): The category of the news to crawl.
- `--page-limit`: The number of pages to crawl. The default value is `5`.
- `--output`: The output file name. The default value is `cnn_news.json`.