Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mylxsw/extractor
extractor is an HTTP service used to convert PDF, Markdown, HTML, Docx, Xlsx, CSV and other files into plain text output. It is used in RAG implementation to read external documents for vectorization.
https://github.com/mylxsw/extractor
docx pdf rag
Last synced: 6 days ago
JSON representation
extractor is an HTTP service used to convert PDF, Markdown, HTML, Docx, Xlsx, CSV and other files into plain text output. It is used in RAG implementation to read external documents for vectorization.
- Host: GitHub
- URL: https://github.com/mylxsw/extractor
- Owner: mylxsw
- License: mit
- Created: 2024-03-01T09:59:51.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-03-01T17:17:35.000Z (10 months ago)
- Last Synced: 2024-10-12T00:25:28.676Z (2 months ago)
- Topics: docx, pdf, rag
- Language: Python
- Homepage:
- Size: 22.5 KB
- Stars: 3
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Extractor
Extractor is an HTTP service used to convert PDF, Markdown, HTML, Docx, Xlsx, CSV and other files into plain text output. It is used in RAG implementation to read external documents for vectorization.
Demo: https://extractor.gulu.ai
> The access speed may be slow. It is for trial use only and should not be used in production.
## Installation
Start the application server using Docker
```bash
docker run -d --restart=always --name extractor \
-p 8080:80 \
mylxsw/extractor:1.0.0
```## API
Convert PDF document to plain text
```bash
curl -s -X POST http://127.0.0.1:8080/v1/extractor/file -F file=@'test.pdf'
```Automatically download the document of the URL and convert it to plain text
```bash
curl -s -X POST http://127.0.0.1:8080/v1/extractor/url -d 'url=https://example.com/test.pdf'
```## License
MIT
Copyright (c) 2024,mylxsw