Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mylxsw/extractor

extractor is an HTTP service used to convert PDF, Markdown, HTML, Docx, Xlsx, CSV and other files into plain text output. It is used in RAG implementation to read external documents for vectorization.
https://github.com/mylxsw/extractor

docx pdf rag

Last synced: 6 days ago
JSON representation

Host: GitHub
URL: https://github.com/mylxsw/extractor
Owner: mylxsw
License: mit
Created: 2024-03-01T09:59:51.000Z (10 months ago)
Default Branch: main
Last Pushed: 2024-03-01T17:17:35.000Z (10 months ago)
Last Synced: 2024-10-12T00:25:28.676Z (2 months ago)
Topics: docx, pdf, rag
Language: Python
Homepage:
Size: 22.5 KB
Stars: 3
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Extractor

Extractor is an HTTP service used to convert PDF, Markdown, HTML, Docx, Xlsx, CSV and other files into plain text output. It is used in RAG implementation to read external documents for vectorization.

Demo: https://extractor.gulu.ai

> The access speed may be slow. It is for trial use only and should not be used in production.

## Installation

Start the application server using Docker

```bash
docker run -d --restart=always --name extractor \
-p 8080:80 \
mylxsw/extractor:1.0.0
```

## API

Convert PDF document to plain text

```bash
curl -s -X POST http://127.0.0.1:8080/v1/extractor/file -F file=@'test.pdf'
```

Automatically download the document of the URL and convert it to plain text

```bash
curl -s -X POST http://127.0.0.1:8080/v1/extractor/url -d 'url=https://example.com/test.pdf'
```

## License

MIT