https://github.com/sameeramin/cnn-crawler

A web crawler that crawls CNN based on the categories provided to it
https://github.com/sameeramin/cnn-crawler

Last synced: 10 months ago
JSON representation

A web crawler that crawls CNN based on the categories provided to it

Host: GitHub
URL: https://github.com/sameeramin/cnn-crawler
Owner: sameeramin
Created: 2023-07-08T11:18:31.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2023-07-09T16:57:12.000Z (almost 3 years ago)
Last Synced: 2025-01-12T19:45:18.713Z (over 1 year ago)
Language: Python
Size: 20.5 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# CNN Crawler
## Introduction
CNN Web Crawler is a web crawler that ceawls the CNN new website based on the categories provided while running it.

## Requirements
- Python 3.9
- BeautifulSoup4
- requests

## How to run
1. Clone the repository
```bash
git clone https://github.com/sameeramin/cnn-crawler.git
cd cnn-crawler
```
2. Install the requirements
```bash
pip install -r requirements.txt
```
3. Run the crawler file
```bash
python cnn_crawler.py --category world --page-limit 5 --output cnn_news.json
```

## Arguments
- `--category` (required): The category of the news to crawl.
- `--page-limit`: The number of pages to crawl. The default value is `5`.
- `--output`: The output file name. The default value is `cnn_news.json`.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sameeramin/cnn-crawler

Awesome Lists containing this project

README