Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/hoan02/novel-crawler
Tool cào dữ liệu truyện để phục vụ cho doctruyen.io.vn
https://github.com/hoan02/novel-crawler
crawler python
Last synced: about 2 months ago
JSON representation
Tool cào dữ liệu truyện để phục vụ cho doctruyen.io.vn
- Host: GitHub
- URL: https://github.com/hoan02/novel-crawler
- Owner: hoan02
- Created: 2024-05-30T18:18:41.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2024-05-31T17:10:15.000Z (7 months ago)
- Last Synced: 2024-06-01T18:43:56.951Z (7 months ago)
- Topics: crawler, python
- Language: Python
- Homepage: https://doctruyen.io.vn/
- Size: 8.79 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Novel Crawler
This project is a web crawler designed to collect novels and their chapters from the website truyenfull.vn. The collected data is then stored in a MongoDB database for use in the development of the online novel reading platform doctruyen.io.vn.
## Prerequisites
- Python 3.7+
- MongoDB
- Required Python packages (see below)## Installation
1. Clone the repository:
```bash
git clone https://github.com/yourusername/novel-crawler.git
cd novel-crawler
```
2. Create a virtual environment and activate it:```bash
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
```
3. Install the required packages:```bash
pip install -r requirements.txt
```
4. Create a .env file in the project root directory and add your MongoDB URI:```bash
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
```
5. Create a novel_data.txt file in the project root directory and add the URLs and total chapters of the novels to crawl:```bash
url_1 total_chapters_1
url_2 total_chapters_2
```
## Usage
Run the crawler:
```bash
python crawl_novel_multi_threaded.py # Or crawl_novel_single_threaded.py
```
## License
This project is licensed under the MIT License. See the LICENSE file for more details.## Copyright
Copyright © 2024 Hoan Cu Te## Contact
Facebook: [Lê Công Hoan](https://www.facebook.com/hoanit02/)