{"id":37545219,"url":"https://github.com/kfstorm/carnivore","last_synced_at":"2026-01-16T08:50:18.013Z","repository":{"id":260488336,"uuid":"881435689","full_name":"kfstorm/carnivore","owner":"kfstorm","description":"Web page archive tool","archived":false,"fork":false,"pushed_at":"2025-09-18T01:09:58.000Z","size":168,"stargazers_count":27,"open_issues_count":5,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-09-18T02:31:28.319Z","etag":null,"topics":["cli","clipper","markdown","telegram"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kfstorm.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-10-31T15:10:14.000Z","updated_at":"2025-09-18T01:10:03.000Z","dependencies_parsed_at":"2024-12-24T11:32:09.411Z","dependency_job_id":"82026c6d-5201-4e4a-bfdf-261e72c9cbad","html_url":"https://github.com/kfstorm/carnivore","commit_stats":null,"previous_names":["kfstorm/markclipper","kfstorm/carnivore"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/kfstorm/carnivore","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kfstorm%2Fcarnivore","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kfstorm%2Fcarnivore/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kfstorm%2Fcarnivore/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kfstorm%2Fcarnivore/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kfstorm","download_url":"https://codeload.github.com/kfstorm/carnivore/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kfstorm%2Fcarnivore/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28478048,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-16T06:30:42.265Z","status":"ssl_error","status_checked_at":"2026-01-16T06:30:16.248Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","clipper","markdown","telegram"],"created_at":"2026-01-16T08:50:17.913Z","updated_at":"2026-01-16T08:50:17.982Z","avatar_url":"https://github.com/kfstorm.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Carnivore\n\n**NOTE: This project is still in early development. Contributions to this project are greatly welcome.**\n\nCarnivore is a simple tool that listens to your web page article archiving needs, removes clutter in the web pages, converts to various file formats, and does whatever you like to deal with converted files. You can combine this tool with your favorite document reader to read, comment, and modify articles.\n\n**Owning your data is important. Saving your data with open formats is also important.**\n\n## Features\n\nMain process:\n\n1. Trigger web page archiving by various methods.\n    - Paste a URL to the interactive CLI.\n    - Send a URL to a Telegram bot or a Telegram channel with a Telegram bot involved.\n    - (More triggering methods could be added as needed.)\n2. Archive the web page with various formats.\n    - A single HTML file with all CSS/JavaScript/image/... resources included. Looks exactly like the original web page.\n    - A polished version of the above HTML file that removes clutter and only keeps the article content.\n    - A Markdown version of the polished web page.\n    - A PDF document of the original web page.\n    - (More formats like whole page image could be added as needed.)\n3. Process the generated files the way you like.\n    - Upload files to a GitHub repo.\n    - Call a customized post-processing script written by yourself.\n    - (More post-processing methods could be added as needed.)\n\nOther features:\n\n- Bypass bot detection by using services provided by Zenrows or OxyLabs.\n- Bypass paywalls with the help of chrome extension Bypass Paywalls Clean.\n\nSupported output formats:\n\n- `markdown`: The article content in Markdown format.\n- `html`: The article content in HTML format.\n- `full_html`: The full web page in HTML format.\n- `pdf`: The full web page in PDF format.\n\nOutput formats could be customized by setting the `CARNIVORE_OUTPUT_FORMATS` environment variable. e.g. `markdown,html,full_html` (split by `,`). Default: `markdown`.\n\n## Usage\n\nThere are multiple ways to use Carnivore. Here are some examples:\n\n### Run Carnivore as an interactive CLI tool\n\n1. Start Carnivore.\n\n    ```sh\n    git clone https://github.com/kfstorm/carnivore.git\n    cd carnivore\n    docker run --rm -it -v ./data:/app/data $(docker build . --quiet)\n    ```\n\n2. Paste a URL to the interactive CLI. The bot will process the URL and save the web page in Markdown format in the `data` directory.\n\n### Run Carnivore as a Telegram bot\n\n1. Start Carnivore.\n\n    ```sh\n    git clone https://github.com/kfstorm/carnivore.git\n    cd carnivore\n    args=(\n        -e CARNIVORE_APPLICATION=telegram-bot\n        -e CARNIVORE_TELEGRAM_TOKEN=...\n        -e CARNIVORE_TELEGRAM_CHANNEL_ID=... # optional. If you want to restrict the bot to a specific channel.\n    )\n    docker run --rm -it \"${args[@]}\" -v ./data:/app/data $(docker build . --quiet)\n    ```\n\n2. Send a URL to the Telegram bot or a channel with the Telegram bot. The bot will process the URL and save the web page in Markdown format in the `data` directory.\n\n## Post-processing Customization\n\nYou can customize the post-processing by:\n\n1. Choose a pre-defined post-processing command.\n2. Write your post-processing command and mount it into the container.\n\nTo configure the post-processing command, set the `CARNIVORE_POST_PROCESS_COMMAND` environment variable. The command should be a shell command.\n\ne.g. To use the pre-defined post-processing command to upload the generated files to a GitHub repository:\n\n```bash\nargs=(\n    -e CARNIVORE_POST_PROCESS_COMMAND=post-process/upload_to_github.sh\n    -e CARNIVORE_GITHUB_REPO=username/repo_name\n    -e CARNIVORE_GITHUB_BRANCH=master # optional.\n    -e CARNIVORE_GITHUB_REPO_DIR=path/in/repo\n    -e CARNIVORE_GITHUB_TOKEN=...\n    -e CARNIVORE_OUTPUT_FORMATS=\"markdown,html,full_html,pdf\" # optional. upload multiple formats of the web page.\n    -e CARNIVORE_MARKDOWN_FRONTMATTER_KEY_MAPPING=\"url:url,title:title\" # optional. you may want to add frontmatter at the beginning of the Markdown file.\n    -e CARNIVORE_MARKDOWN_FRONTMATTER_ADDITIONAL_ARGS=\"--timestamp-key date-created\" # optional. you may want to add the timestamp to the frontmatter.\n    -e TZ=Asia/Shanghai # optional. you may want to customize the timezone.\n)\ndocker run --rm -it \"${args[@]}\" $(docker build . --quiet)\n```\n\n## Arguments\n\nCommon arguments:\n\n- `CARNIVORE_APPLICATION`: Optional. The application to run. Default: `interactive-cli`.\n- `CARNIVORE_OUTPUT_DIR`: Optional. The directory to save the generated files. Default: `data`.\n- `CARNIVORE_OUTPUT_FORMATS`: Optional. The output formats to generate. Default: `markdown`. Split by `,`.\n- `CARNIVORE_POST_PROCESS_COMMAND`: Optional. The post-processing command to run. Default: `post-process/update_files.sh`.\n- `CARNIVORE_MARKDOWN_FRONTMATTER_KEY_MAPPING`: Optional. The key mapping for the frontmatter in the Markdown file. The format is `metadata_key1:frontmatter_key1,metadata_key2:frontmatter_key2`. e.g.: `url:url,title:title`.\n- `CARNIVORE_MARKDOWN_FRONTMATTER_ADDITIONAL_ARGS`: Optional. Additional arguments for the frontmatter in the Markdown file. e.g. `--timestamp-key date-created --timestamp-format %Y-%m-%d %H:%M:%S`.\n\nTelegram-related arguments (Optional. Only used when the application is `telegram-bot`):\n\n- `CARNIVORE_TELEGRAM_TOKEN`: The Telegram bot token.\n- `CARNIVORE_TELEGRAM_CHANNEL_ID`: Optional. The Telegram channel ID to restrict the bot to.\n\nGitHub-related arguments (Optional. Only used when the post-processing command is `post-process/upload_to_github.sh`):\n\n- `CARNIVORE_GITHUB_REPO`: The GitHub repository to upload the generated files.\n- `CARNIVORE_GITHUB_BRANCH`: Optional. The branch to upload the generated files. Default: `master`.\n- `CARNIVORE_GITHUB_REPO_DIR`: The directory in the GitHub repository to upload the generated files.\n- `CARNIVORE_GITHUB_TOKEN`: The GitHub token to upload the generated files.\n\nZenrows-related arguments (Optional. For bypassing bot detection such as Cloudflare DDOS protection):\n\n- `CARNIVORE_ZENROWS_API_KEY`: The Zenrows API key.\n- `CARNIVORE_ZENROWS_PREMIUM_PROXIES`: Optional. Set to `true` to enable premium proxies.\n- `CARNIVORE_ZENROWS_JS_RENDERING`: Optional. Set to `true` to enable JS rendering.\n\nOxyLabs-related arguments (Optional. For bypassing bot detection such as Cloudflare DDOS protection):\n\n- `CARNIVORE_OXYLABS_USER`: The OxyLabs username and password in the format `username:password`.\n- `CARNIVORE_OXYLABS_JS_RENDERING`: Optional. Set to `true` to enable JS rendering.\n\n## Components\n\n### Applications\n\n- **applications/interactive-cli**: An interactive CLI tool that reads URLs pasted in the terminal, archives webpages using **Carnivore Lib**, and invokes a post-processing command for further processing.\n\n- **applications/telegram-bot**: A Telegram bot that listens for URLs in messages sent to the bot or sent to a channel with the bot, archives webpages using **Carnivore Lib**, and invokes a post-processing command for further processing.\n\n### Carnivore Lib\n\n- **carnivore-lib/**: The main code for web page archiving purposes. It converts web pages to various formats.\n- Tools used:\n  - [monolith](https://github.com/Y2Z/monolith): Save a web page as a single HTML with all resources embedded. The saved HTML page looks exactly like the online version.\n  - [readability](https://github.com/mozilla/readability): Extract the article content from a web page.\n  - [pandoc](https://github.com/jgm/pandoc): Convert between various formats, including HTML and Markdown.\n\n### Post-process Scripts\n\n- [post-process/update_files.sh](process/update_files.sh): A script that updates the content of the generated files (mainly used to add frontmatter to the generated Markdown file).\n- [post-process/upload_to_github.sh](post-process/upload_to_github.sh): A script that uploads the generated files to a GitHub repository.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkfstorm%2Fcarnivore","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkfstorm%2Fcarnivore","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkfstorm%2Fcarnivore/lists"}