https://github.com/kepano/defuddle-cli
Command line utility to extract clean html, markdown and metadata from web pages.
https://github.com/kepano/defuddle-cli
Last synced: about 1 year ago
JSON representation
Command line utility to extract clean html, markdown and metadata from web pages.
- Host: GitHub
- URL: https://github.com/kepano/defuddle-cli
- Owner: kepano
- License: mit
- Created: 2025-03-24T17:28:52.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-16T20:34:27.000Z (about 1 year ago)
- Last Synced: 2025-04-29T17:19:15.471Z (about 1 year ago)
- Language: JavaScript
- Homepage:
- Size: 66.4 KB
- Stars: 243
- Watchers: 1
- Forks: 8
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- StarryDivineSky - kepano/defuddle-cli - cli是一个命令行工具,用于从网页中提取干净的HTML、Markdown和元数据。它可以帮助你去除网页中的广告、导航栏和其他干扰元素,只保留核心内容。该工具使用Readability算法来识别网页的主要内容区域,并将其转换为干净的格式。你可以使用它来创建自己的文章存档、生成摘要或进行文本分析。Defuddle-cli支持自定义CSS选择器和XPath表达式,以更精确地控制内容提取。它还能够提取网页的标题、作者、发布日期等元数据。安装简单,使用方便,是网页内容提取的利器。 (网络信息服务 / 网络爬虫)
README
# Defuddle CLI
Command line interface for [Defuddle](https://github.com/kepano/defuddle). Extract clean HTML or Markdown from pages.
## Installation
```bash
npm install -g defuddle-cli
```
## Usage
```bash
defuddle parse [options]
```
### Arguments
- `source`: HTML file path or URL to parse
### Options
- `-o, --output `: Output file path (default: stdout)
- `-m, --markdown, --md`: Convert content to markdown
- `-j, --json`: Output as JSON with both HTML and markdown content
- `-p, --property `: Extract a specific property (e.g., title, description, domain)
- `--debug`: Enable debug mode
- `-h, --help`: Display help for command
### Examples
Parse a local HTML file (outputs HTML):
```bash
defuddle parse article.html
```
Parse a URL and convert to markdown:
```bash
defuddle parse https://example.com/article --md
```
Parse and get the full JSON response from Defuddle:
```bash
defuddle parse article.html --json
```
Save markdown output to a file:
```bash
defuddle parse article.html --md -o output.md
```
Extract specific properties:
```bash
# Get just the title
defuddle parse article.html --property title
# Get the description
defuddle parse article.html -p description
# Get the domain
defuddle parse article.html --property domain
```
## Development
```bash
# Install dependencies
npm install
# Build
npm run build
# Run in development mode
npm run dev
```