https://github.com/dataelement/bisheng-unstructured
bisheng-unstructured library
https://github.com/dataelement/bisheng-unstructured
etl4llms
Last synced: 11 months ago
JSON representation
bisheng-unstructured library
- Host: GitHub
- URL: https://github.com/dataelement/bisheng-unstructured
- Owner: dataelement
- License: apache-2.0
- Created: 2023-08-15T01:34:40.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2025-03-27T08:42:35.000Z (11 months ago)
- Last Synced: 2025-03-30T11:06:32.395Z (11 months ago)
- Topics: etl4llms
- Language: Python
- Homepage: https://m7a7tqsztt.feishu.cn/wiki/CTXNwpqGKiMs5FkKlPJcylfonuD
- Size: 42.3 MB
- Stars: 42
- Watchers: 6
- Forks: 18
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
## What is bisheng-unstructured?
Bisheng-unstructured is an open-source unstructured data parsing library built to
power LLM applications like pretrain, finetune, prompting engineering.
Bisheng-unstructured makes the unstructured data porcessing more easily and provides a consistent user experience regardless of any file types.
The project is a sub-project of [bisheng](https://github.com/dataelement/bisheng).
## Key features
- High precision pdf layout parser
- High precision table structure recovering
- High precision OCR ability
- More friendly for token prossing for the visual text element, like table, list
## Quick start
### Start With Bisheng Platform
Use as a chain node [ElemUnstructureLoader](https://m7a7tqsztt.feishu.cn/wiki/VpyNwTt7ZiypbdkoPuJcn5w2nxf)
### Start with DataElem Services.
We provide a open cloud service for easily use. See [free trial](https://m7a7tqsztt.feishu.cn/wiki/CTXNwpqGKiMs5FkKlPJcylfonuD).
### Install bisheng-unstructured
- Install from pip: `pip install bisheng-unstructured`
- [Quick Start Guide](https://m7a7tqsztt.feishu.cn/wiki/CTXNwpqGKiMs5FkKlPJcylfonuD)
### Using from pre-builded image
## Documentation
For guidance on installation, development, deployment, and administration,
check out [bisheng-unstructured Docs](https://m7a7tqsztt.feishu.cn/wiki/CTXNwpqGKiMs5FkKlPJcylfonuD).
## Issues
Reporting problems, asking questions
We appreciate any feedback, questions or bug reporting regarding this project.
User can posting [Issues](https://github.com/dataelement/bisheng/issues),
follow the process outlined in the [Stack Overflow document](https://stackoverflow.com/help/mcve).
For questions, we recommend posting in our community GitHub [Discussions](https://github.com/dataelement/bisheng/discussions).
## Acknowledgments
bisheng-unstructured adopts dependencies from the following:
- Thanks to [unstructured](https://github.com/Unstructured-IO/unstructured) for the main framework.