Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dev-chenxing/jjwxc-crawler

A simple tool to scrape and download non-V chapters of any novel from jjwxc.net in .docx format, built with Python and Scrapy | 基于Scrapy开发的晋江爬虫，根据书号下载小说非V章节，生成可编辑的Word文档
https://github.com/dev-chenxing/jjwxc-crawler

chinese cli crawler docx download jjwxc open-source python scraping scrapy terminal word

Last synced: 3 months ago
JSON representation

A simple tool to scrape and download non-V chapters of any novel from jjwxc.net in .docx format, built with Python and Scrapy | 基于Scrapy开发的晋江爬虫，根据书号下载小说非V章节，生成可编辑的Word文档

Host: GitHub
URL: https://github.com/dev-chenxing/jjwxc-crawler
Owner: dev-chenxing
Created: 2024-03-11T01:34:06.000Z (11 months ago)
Default Branch: main
Last Pushed: 2024-11-08T10:23:15.000Z (3 months ago)
Last Synced: 2024-11-08T11:25:35.890Z (3 months ago)
Topics: chinese, cli, crawler, docx, download, jjwxc, open-source, python, scraping, scrapy, terminal, word
Language: Python
Homepage:
Size: 7.05 MB
Stars: 12
Watchers: 1
Forks: 4
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

《重生之我在绿江爪爪巴》

一键下载
晋江文学城 (https://www.jjwxc.net)
网站小说非 V 章节

last commit

简体中文 |
English

### 特点功能

- 命令行界面
- 支持输出 DOCX 和 TXT 格式
- 可自定义输出路径
- ...................

有建议或 bug 可以提 issue.

命令行界面使用命令行 UI 库[Rich](https://github.com/Textualize/rich)编写。

界面样例：

# 安装文档

### 下载文件

点击 Code - Download ZIP，下载后解压缩得到文件夹，建议重命名为`jjwxc-crawler`

### 环境配置

- Python 3.9.15
- Windows

安装 Python 后，第一步，打开所在目录的命令行，输入以下命令创建并激活虚拟环境

```powershell
python -m venv venv # 创建名为venv的Python虚拟环境
venv\Scripts\activate # Windows系统下激活虚拟环境venv
```

在Linux系统下，

```bash
chmod +x venv/bin/activate
source venv/bin/activate
```

此时命令行前应显示有`(venv)`，表示当前已激活虚拟环境`venv`

第二步，在虚拟环境内安装 Scrapy 和其他依赖

```powershell
pip install -r requirements.txt
```

### 运行小程序

```powershell
# 进入程序所在目录
cd jjcrawler

# 运行爬虫命令，其中ID为书号
scrapy crawl novel -a id=ID

# 例如，我要下载书号为2的测试文，则运行以下命令行
scrapy crawl novel -a id=2
```

下载章节将保存至根目录下的 novels 文件夹

默认输出格式为.docx，如果要更改为.txt 格式输出，可编辑`\jjcrawler\jjcrawler\spiders\config.py`中参数

```python
# docx | txt
format = "txt"
```

下载一整页的小说

```bash
scrapy crawl novellist -a xx=3 -a sd=4 -a bq=39,45,124,313,314
```

**[⬆ 回到顶部](#特点功能)**