Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tychozzz/article_crawler
✨ Article Crawler is a package used to crawl articles with Markdown format from a specific webpage and store them locally in HTML / Markdown formats.
https://github.com/tychozzz/article_crawler
article crawler html markdown pypi python
Last synced: 3 months ago
JSON representation
✨ Article Crawler is a package used to crawl articles with Markdown format from a specific webpage and store them locally in HTML / Markdown formats.
- Host: GitHub
- URL: https://github.com/tychozzz/article_crawler
- Owner: tychozzz
- License: mit
- Created: 2023-08-05T11:54:52.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2023-08-12T07:36:46.000Z (over 1 year ago)
- Last Synced: 2024-09-14T18:29:50.033Z (4 months ago)
- Topics: article, crawler, html, markdown, pypi, python
- Language: Python
- Homepage:
- Size: 13.7 KB
- Stars: 27
- Watchers: 1
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Article Crawler
[![PyPI Latest Release](https://img.shields.io/pypi/v/article-crawler.svg)](https://pypi.org/project/article-crawler/)
[![PyPI Downloads](https://img.shields.io/pypi/dm/article-crawler?label=PyPI%20downloads)](https://pypi.org/project/article-crawler/)
[![](https://img.shields.io/github/v/release/ltyzzzxxx/article_crawler?display_name=tag)](https://github.com/ltyzzzxxx/article_crawler/releases/tag/v0.0.1)
[![](https://img.shields.io/github/stars/ltyzzzxxx/article_crawler)](https://github.com/ltyzzzxxx/article_crawler)
[![](https://img.shields.io/github/forks/ltyzzzxxx/article_crawler)](https://github.com/ltyzzzxxx/article_crawler)
[![](https://img.shields.io/github/issues/ltyzzzxxx/article_crawler)](https://github.com/ltyzzzxxx/article_crawler/issues)
[![](https://img.shields.io/badge/license-MIT%20-yellow.svg)](https://github.com/ltyzzzxxx/article_crawler/issues)[English Doc](./README_EN.md) | [中文文档](./README_CN.md)
## ✨ Introduction
Article Crawler is a package used to crawl articles with Markdown format from a specific webpage and store them locally in HTML / Markdown formats.
## 🚀 Quick Start
1. Install through `pip`
```python
pip install article-crawler
```
2. UsageUsage: `python3 -m article_crawler -u [url] -t [type] -o [output_folder] -c [class_] -i [id]`
```
`
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-u URL, --url=URL crawled url (required)
-t TYPE, --type=TYPE crawled article type [csdn] | [juejin] | [zhihu] | [jianshu]
-o OUTPUT_FOLDER, --output_folder=OUTPUT_FOLDER
output html / markdown / pdf folder (required)
-w WEBSITE_TAG, --website_tag=WEBSITE_TAG
position of the article content in HTML (not required if 'type' is specified)
-c CLASS_, --class=CLASS_
position of the article content in HTML (not required if 'type' is specified)
-i ID, --id=ID position of the article content in HTML (not required if 'type' is specified)
```
- type: Specific websites, currently supported are CSDN, Zhihu, Juejin, and Jianshu.
- website_tag / class_ / id:
e.g. `
- In this element, `website_tag`, `class_`, `id` is `div`, `article_content clearfix`, `article_content` respectively.
> 1. You don't need to specify `type` when you specify `website_tag / class_ / id`.
> 2. You need to use the web console to locate the position of the article.
> 3. `website_tag / class_ / id` is used to locate the position of the article in HTML. It is possible to only use one or two of them instead of all.## Open Source License
MIT License see https://opensource.org/license/mit/