https://github.com/agentgill/unstructured-local
https://github.com/agentgill/unstructured-local
macos pdf python unstructured
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/agentgill/unstructured-local
- Owner: agentgill
- Created: 2024-08-26T20:34:10.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-08-26T20:39:33.000Z (9 months ago)
- Last Synced: 2024-10-06T04:05:26.128Z (8 months ago)
- Topics: macos, pdf, python, unstructured
- Language: Python
- Homepage: https://unstructured.io
- Size: 105 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Unstructured.io Local
## Resources
- [Official Documentation](https://docs.unstructured.io)
- [Open Source Documentation](https://docs.unstructured.io/open-source/introduction/overview)
- [Full-installation](https://docs.unstructured.io/open-source/installation/full-installation)
- [Unstructured Repo](https://github.com/Unstructured-IO/unstructured)## Installation
Install System Dependencies (macOS using brew)
```bash
brew install libmagic
brew install poppler
brew install tesseract
brew install pandoc
brew install libreoffice
```Create Venv
```bash
python3.12 -m venv .venv && source .venv/bin/activate
```Install unstructured
```bash
pip install unstructured
```## Usage
Turn unstructured data (pdf) into strunctured data (json)
```python
#!/usr/bin/env python3
import jsonfrom unstructured.partition.pdf import partition_pdf
# Fake Patient Data
elements = partition_pdf(filename="document.pdf")element_dicts = [element.to_dict() for element in elements]
json_elements = json.dumps(element_dicts, indent=2)print(json_elements)
with open("output.json", "w") as file:
file.write(json_elements)
```## Unstanding structured data
[Document Element Types](https://docs.unstructured.io/open-source/concepts/document-elements#element-type)