{"id":19784273,"url":"https://github.com/tychozzz/article_crawler","last_synced_at":"2025-04-30T22:32:04.281Z","repository":{"id":186317320,"uuid":"674987282","full_name":"tychozzz/article_crawler","owner":"tychozzz","description":"✨ Article Crawler is a package used to crawl articles with Markdown format from a specific webpage and store them locally in HTML / Markdown formats.","archived":false,"fork":false,"pushed_at":"2023-08-12T07:36:46.000Z","size":14,"stargazers_count":30,"open_issues_count":0,"forks_count":4,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-18T12:18:09.636Z","etag":null,"topics":["article","crawler","html","markdown","pypi","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tychozzz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-08-05T11:54:52.000Z","updated_at":"2025-03-06T08:01:27.000Z","dependencies_parsed_at":"2023-12-11T10:42:30.510Z","dependency_job_id":null,"html_url":"https://github.com/tychozzz/article_crawler","commit_stats":null,"previous_names":["ltyzzzxxx/article_crawler","tychozzz/article_crawler"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tychozzz%2Farticle_crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tychozzz%2Farticle_crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tychozzz%2Farticle_crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tychozzz%2Farticle_crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tychozzz","download_url":"https://codeload.github.com/tychozzz/article_crawler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251791730,"owners_count":21644447,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["article","crawler","html","markdown","pypi","python"],"created_at":"2024-11-12T06:11:00.081Z","updated_at":"2025-04-30T22:32:04.057Z","avatar_url":"https://github.com/tychozzz.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Article Crawler\n\n[![PyPI Latest Release](https://img.shields.io/pypi/v/article-crawler.svg)](https://pypi.org/project/article-crawler/)\n[![PyPI Downloads](https://img.shields.io/pypi/dm/article-crawler?label=PyPI%20downloads)](https://pypi.org/project/article-crawler/)\n[![](https://img.shields.io/github/v/release/ltyzzzxxx/article_crawler?display_name=tag)](https://github.com/ltyzzzxxx/article_crawler/releases/tag/v0.0.1)\n[![](https://img.shields.io/github/stars/ltyzzzxxx/article_crawler)](https://github.com/ltyzzzxxx/article_crawler)\n[![](https://img.shields.io/github/forks/ltyzzzxxx/article_crawler)](https://github.com/ltyzzzxxx/article_crawler)\n[![](https://img.shields.io/github/issues/ltyzzzxxx/article_crawler)](https://github.com/ltyzzzxxx/article_crawler/issues)\n[![](https://img.shields.io/badge/license-MIT%20-yellow.svg)](https://github.com/ltyzzzxxx/article_crawler/issues)\n\n[English Doc](./README_EN.md) | [中文文档](./README_CN.md)\n\n## ✨ Introduction\n\nArticle Crawler is a package used to crawl articles with Markdown format from a specific webpage and store them locally in HTML / Markdown formats.\n\n## 🚀 Quick Start\n\n1. Install through `pip`\n\n    ```python\n    pip install article-crawler\n    ```\n2. Usage\n\n    Usage: `python3 -m article_crawler -u [url] -t [type] -o [output_folder] -c [class_] -i [id]`\n\n    ```\n    Options:\n      --version             show program's version number and exit\n      -h, --help            show this help message and exit\n      -u URL, --url=URL     crawled url (required)\n      -t TYPE, --type=TYPE  crawled article type [csdn] | [juejin] | [zhihu] | [jianshu]\n      -o OUTPUT_FOLDER, --output_folder=OUTPUT_FOLDER\n                            output html / markdown / pdf folder (required)\n      -w WEBSITE_TAG, --website_tag=WEBSITE_TAG\n                            position of the article content in HTML (not required if 'type' is specified)\n      -c CLASS_, --class=CLASS_\n                            position of the article content in HTML (not required if 'type' is specified)\n      -i ID, --id=ID        position of the article content in HTML (not required if 'type' is specified)\n    ```\n    - type: Specific websites, currently supported are CSDN, Zhihu, Juejin, and Jianshu.\n    - website_tag / class_ / id:\n   \n      e.g. `\u003cdiv id=\"article_content\" class=\"article_content clearfix\"\u003e\u003c/div\u003e`\n   \n      - In this element, `website_tag`, `class_`, `id` is `div`, `article_content clearfix`, `article_content` respectively.\n      \n      \u003e 1. You don't need to specify `type` when you specify `website_tag / class_ / id`.\n      \u003e 2. You need to use the web console to locate the position of the article.\n      \u003e 3. `website_tag / class_ / id` is used to locate the position of the article in HTML. It is possible to only use one or two of them instead of all.\n\n## Open Source License\n\nMIT License see https://opensource.org/license/mit/\n       \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftychozzz%2Farticle_crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftychozzz%2Farticle_crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftychozzz%2Farticle_crawler/lists"}