Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/tychozzz/article_crawler

✨ Article Crawler is a package used to crawl articles with Markdown format from a specific webpage and store them locally in HTML / Markdown formats.
https://github.com/tychozzz/article_crawler

article crawler html markdown pypi python

Last synced: 3 months ago
JSON representation

✨ Article Crawler is a package used to crawl articles with Markdown format from a specific webpage and store them locally in HTML / Markdown formats.

Host: GitHub
URL: https://github.com/tychozzz/article_crawler
Owner: tychozzz
License: mit
Created: 2023-08-05T11:54:52.000Z (over 1 year ago)
Default Branch: master
Last Pushed: 2023-08-12T07:36:46.000Z (over 1 year ago)
Last Synced: 2024-09-14T18:29:50.033Z (4 months ago)
Topics: article, crawler, html, markdown, pypi, python
Language: Python
Homepage:
Size: 13.7 KB
Stars: 27
Watchers: 1
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Article Crawler

[![PyPI Latest Release](https://img.shields.io/pypi/v/article-crawler.svg)](https://pypi.org/project/article-crawler/)

[![PyPI Downloads](https://img.shields.io/pypi/dm/article-crawler?label=PyPI%20downloads)](https://pypi.org/project/article-crawler/)

[![](https://img.shields.io/github/v/release/ltyzzzxxx/article_crawler?display_name=tag)](https://github.com/ltyzzzxxx/article_crawler/releases/tag/v0.0.1)

[![](https://img.shields.io/github/stars/ltyzzzxxx/article_crawler)](https://github.com/ltyzzzxxx/article_crawler)

[![](https://img.shields.io/github/forks/ltyzzzxxx/article_crawler)](https://github.com/ltyzzzxxx/article_crawler)

[![](https://img.shields.io/github/issues/ltyzzzxxx/article_crawler)](https://github.com/ltyzzzxxx/article_crawler/issues)

[![](https://img.shields.io/badge/license-MIT%20-yellow.svg)](https://github.com/ltyzzzxxx/article_crawler/issues)

[English Doc](./README_EN.md) | [中文文档](./README_CN.md)

## ✨ Introduction

Article Crawler is a package used to crawl articles with Markdown format from a specific webpage and store them locally in HTML / Markdown formats.

## 🚀 Quick Start

1. Install through `pip`

    ```python

    pip install article-crawler

    ```

2. Usage

    Usage: `python3 -m article_crawler -u [url] -t [type] -o [output_folder] -c [class_] -i [id]`

    ```

    Options:

      --version             show program's version number and exit

      -h, --help            show this help message and exit

      -u URL, --url=URL     crawled url (required)

      -t TYPE, --type=TYPE  crawled article type [csdn] | [juejin] | [zhihu] | [jianshu]

      -o OUTPUT_FOLDER, --output_folder=OUTPUT_FOLDER

                            output html / markdown / pdf folder (required)

      -w WEBSITE_TAG, --website_tag=WEBSITE_TAG

                            position of the article content in HTML (not required if 'type' is specified)

      -c CLASS_, --class=CLASS_

                            position of the article content in HTML (not required if 'type' is specified)

      -i ID, --id=ID        position of the article content in HTML (not required if 'type' is specified)

    ```

    - type: Specific websites, currently supported are CSDN, Zhihu, Juejin, and Jianshu.

    - website_tag / class_ / id:

   

      e.g. `
`

   

      - In this element, `website_tag`, `class_`, `id` is `div`, `article_content clearfix`, `article_content` respectively.

      

      > 1. You don't need to specify `type` when you specify `website_tag / class_ / id`.

      > 2. You need to use the web console to locate the position of the article.

      > 3. `website_tag / class_ / id` is used to locate the position of the article in HTML. It is possible to only use one or two of them instead of all.

## Open Source License

MIT License see https://opensource.org/license/mit/