Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/supercoderhawk/html-body-extractor
《基于行块分布函数的通用网页正文抽取》的Python3实现
https://github.com/supercoderhawk/html-body-extractor
Last synced: about 2 months ago
JSON representation
《基于行块分布函数的通用网页正文抽取》的Python3实现
- Host: GitHub
- URL: https://github.com/supercoderhawk/html-body-extractor
- Owner: supercoderhawk
- Created: 2016-11-14T14:43:30.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2023-07-31T09:25:08.000Z (over 1 year ago)
- Last Synced: 2024-11-02T08:51:54.272Z (2 months ago)
- Language: Python
- Homepage:
- Size: 756 KB
- Stars: 2
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: ReadMe.md
Awesome Lists containing this project
README
# 网页正文抽取
[![PyPI](https://img.shields.io/pypi/v/body-extractor-py3.svg)](https://pypi.python.org/pypi/body-extractor-py3) [![PyPI](https://img.shields.io/pypi/dm/body-extractor-py3.svg)](https://pypi.python.org/pypi/body-extractor-py3)
论文《基于行块分布函数的通用网页正文抽取》的Python实现。
## 安装
```bash
pip install body-extractor-py3
```## 使用方法
```python
from body_extractor import BodyExtractor
import requestsurl = 'http://md.tech-ex.com/ired/2016/47848.html'
res = requests.get(url)
extractor = BodyExtractor(res.content.decode(res.encoding))
print(extractor.content) # 抽取的正文部分
print(extractor.title) # 抽取的title标签,即网页标题```
## TodoList
- [ ] 支持url参数
- [ ] 保留图片
- [ ] 生成带图片的word文档