Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/hoan02/novel-crawler

Tool cào dữ liệu truyện để phục vụ cho doctruyen.io.vn
https://github.com/hoan02/novel-crawler

crawler python

Last synced: 22 days ago
JSON representation

Tool cào dữ liệu truyện để phục vụ cho doctruyen.io.vn

Host: GitHub
URL: https://github.com/hoan02/novel-crawler
Owner: hoan02
Created: 2024-05-30T18:18:41.000Z (9 months ago)
Default Branch: main
Last Pushed: 2024-05-31T17:10:15.000Z (9 months ago)
Last Synced: 2024-11-19T05:48:36.433Z (3 months ago)
Topics: crawler, python
Language: Python
Homepage: https://doctruyen.io.vn/
Size: 8.79 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Novel Crawler

This project is a web crawler designed to collect novels and their chapters from the website truyenfull.vn. The collected data is then stored in a MongoDB database for use in the development of the online novel reading platform doctruyen.io.vn.

## Prerequisites

- Python 3.7+
- MongoDB
- Required Python packages (see below)

## Installation

1. Clone the repository:

```bash
git clone https://github.com/yourusername/novel-crawler.git
cd novel-crawler
```
2. Create a virtual environment and activate it:

```bash
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
```
3. Install the required packages:

```bash
pip install -r requirements.txt
```
4. Create a .env file in the project root directory and add your MongoDB URI:

```bash
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
```
5. Create a novel_data.txt file in the project root directory and add the URLs and total chapters of the novels to crawl:

```bash
url_1 total_chapters_1
url_2 total_chapters_2
```
## Usage
Run the crawler:
```bash
python crawl_novel_multi_threaded.py # Or crawl_novel_single_threaded.py
```
## License
This project is licensed under the MIT License. See the LICENSE file for more details.

## Contact
Facebook: [Lê Công Hoan](https://www.facebook.com/hoanit02/)