https://github.com/zmh-program/blob-service

📦 Out-Of-The-Box & Powerful File Parsing Service, support Text/Pdf/Docx/Pptx/Xlsx/Image/Audio parsing, support OCR, support Base64/Local/S3/R2/TG/MinIO storage.
https://github.com/zmh-program/blob-service

blob fileparser ocr storage

Last synced: 3 months ago
JSON representation

📦 Out-Of-The-Box & Powerful File Parsing Service, support Text/Pdf/Docx/Pptx/Xlsx/Image/Audio parsing, support OCR, support Base64/Local/S3/R2/TG/MinIO storage.

Host: GitHub
URL: https://github.com/zmh-program/blob-service
Owner: zmh-program
License: apache-2.0
Created: 2023-10-30T10:19:02.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-12-04T01:16:39.000Z (7 months ago)
Last Synced: 2025-03-04T14:25:16.055Z (4 months ago)
Topics: blob, fileparser, ocr, storage
Language: Python
Homepage: https://blob.chatnio.net
Size: 64.5 KB
Stars: 101
Watchers: 4
Forks: 40
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        


    

# 📦 Chat Nio Blob Service

### **🤯 File Service for Chat Nio**

[![Deploy to Vercel](https://vercel.com/button)](https://vercel.com/new/clone?repository-url=https://github.com/Deeptrain-Community/chatnio-blob-service)

[![Deploy on Zeabur](https://zeabur.com/button.svg)](https://zeabur.com/templates/RWGFOH)



## Features

- ⚡ **Out-of-the-Box**: No External Dependencies Required & Support Vercel/Render One-Click Deployment

- ⭐ **Multiple File Types**: Support Text, Pdf, Docx, Excel, Image, Audio etc.

- 📦 **Multiple Storage Options**: Base64, Local, S3, Cloudflare R2, Min IO, Telegram CDN etc.

- 🔍 **OCR Support**: Extract Text from Image (Require Paddle OCR API)

- 🔊 **Audio Support**: Convert Audio to Text (Require Azure Speech to Text Service)

## Supported File Types

- Text

- Image (_require vision models_)

- Audio (_require Azure Speech to Text Service_)

- Docx (_not support .doc_)

- Pdf

- Pptx (_not support .ppt_)

- Xlsx (_support .xls_)

## Deploy by Docker

> Image: `programzmh/chatnio-blob-service`

```shell

docker run -p 8000:8000 programzmh/chatnio-blob-service

# with environment variables

# docker run -p 8000:8000 -e AZURE_SPEECH_KEY="..." -e AZURE_SPEECH_REGION="..." programzmh/chatnio-blob-service

# if you are using `local` storage type, you need to mount volume (/app/static) to the host

# docker run -p 8000:8000 -v /path/to/static:/app/static programzmh/chatnio-blob-service

```

> Deploy to [Render.com](https://render.com)

> 

> [![Deploy to Render](https://render.com/images/deploy-to-render-button.svg)](https://dashboard.render.com/select-image?type=web&image=programzmh%2Fchatnio-blob-service)

>

> 

> Select **Web Service** and **Docker** Image, then input the image `programzmh/chatnio-blob-service` and click **Create Web Service**.

> > ⭐ Render.com Includes Free **750 Hours** of Usage per Month 

> 

## Deploy by Source Code

The service will be running on `http://localhost:8000`

## Run

```shell

git clone --branch=main https://github.com/Deeptrain-Community/chatnio-blob-service

cd chatnio-blob-service

pip install -r requirements.txt

uvicorn main:app

# enable hot reload

# uvicorn main:app --reload

```

## API

`POST` `/upload` Upload a file

```json

{

    "file": "[file]",

    "enable_ocr": false,

    "enable_vision": true,

    "save_all": false

}

```

| Parameter       | Type    | Description                                                                          |

|-----------------|---------|--------------------------------------------------------------------------------------|

| `file`          | *File   | File to Upload                                                                       |

| `enable_ocr`    | Boolean | Enable OCR (Default: `false`) 
**should configure OCR config*                    |

| `enable_vision` | Boolean | Enable Vision (Default: `true`) 
**skip if `enable_ocr` is true*                 |

| `save_all`      | Boolean | Save All Images (Default: `false`) 
**store all types of files without handling* |

Response

```json

{

  "status": true,

  "type": "pdf",

  "content": "...",

  "error": ""

}

```

| Parameter       | Type     | Description    |

|-----------------|----------|----------------|

| `status`        | Boolean  | Request Status |

| `type`          | String   | File Type      |

| `content`       | String   | File Data      |

| `error`         | String   | Error Message  |

## Environment Variables

### `1` 🎨 General Config (Optional)

- `PDF_MAX_IMAGES`: Max Images Extracted from a PDF File (Default: `10`)

    - **0**: Never Extract Images

    - **-1**: Extract All Images

    - **other**: Extract Top N Images

    - *Tips: The extracted images will be **treated as a normal image** file and directly processed*.

- `MAX_FILE_SIZE`: Max Uploaded File Size MiB (Default: `-1`, No Limit)

  - *Tips: Size limit is also depend on the server configuration (e.g. Nginx/Apache Config, Vercel Free Plan Limit **5MB** Body Size)*

- `CORS_ALLOW_ORIGINS`: CORS Allow Origins (Default: `*`)

  - e.g.: *http://localhost:3000,https://example.com*

### `2` 🔊 Audio Config (Optional)

- `AZURE_SPEECH_KEY`: Azure Speech to Text Service Key (Required for Audio Support)

- `AZURE_SPEECH_REGION`: Azure Speech to Text Service Region (Required for Audio Support)

### `3` 🖼 Storage Config (Optional)

> [!NOTE]

> Storage Config Apply to **Image** Files And `Save All` Option Only.

1. ✨ No Storage (Default)

   - [x] **No Storage Required & No External Dependencies**

   - [x] Base64 Encoding/Decoding

   - [x] Do **Not** Store Anything

   - [x] Support Serverless Deployment **Without Storage** (e.g. Vercel)

   - [ ] No Direct URL Access *(Base64 not support models like `gpt-4-all`)*

2. 📁 Local Storage

   - [ ] **Require Server Environment** (e.g. VPS, Docker)

   - [x] Support Direct URL Access

   - [x] Payless Storage Cost

   - Config:

     - set env `STORAGE_TYPE` to `local` (e.g. `STORAGE_TYPE=local`)

     - set env `LOCAL_STORAGE_DOMAIN` to your deployment domain (e.g. `LOCAL_STORAGE_DOMAIN=http://blob-service.onrender.com`)

     - if you are using Docker, you need to mount volume `/app/static` to the host (e.g. `-v /path/to/static:/app/static`)

     

3. 🚀 [AWS S3](https://aws.amazon.com/s3)

   - [ ] **Payment Storage Cost**

   - [x] Support Direct URL Access

   - [x] China Mainland User Friendly

   - Config:

     - set env `STORAGE_TYPE` to `s3` (e.g. `STORAGE_TYPE=s3`)

     - set env `S3_ACCESS_KEY` to your AWS Access Key ID

     - set env `S3_SECRET_KEY` to your AWS Secret Access Key

     - set env `S3_BUCKET` to your AWS S3 Bucket Name

     - set env `S3_REGION` to your AWS S3 Region

4. 🔔 [Cloudflare R2](https://www.cloudflare.com/zh-cn/developer-platform/r2)

   - [x] **Free Storage Quota ([10GB Storage & Zero Outbound Cost]((https://developers.cloudflare.com/r2/pricing/)))**

   - [x] Support Direct URL Access

   - Config *(S3 Compatible)*:

     - set env `STORAGE_TYPE` to `s3` (e.g. `STORAGE_TYPE=s3`)

     - set env `S3_ACCESS_KEY` to your Cloudflare R2 Access Key ID

     - set env `S3_SECRET_KEY` to your Cloudflare R2 Secret Access Key

     - set env `S3_BUCKET` to your Cloudflare R2 Bucket Name

     - set env `S3_DOMAIN` to your Cloudflare R2 Domain Name (e.g. `https://.r2.cloudflarestorage.com`)

     - set env `S3_DIRECT_URL_DOMAIN` to your Cloudflare R2 Public URL Access Domain Name ([Open Public URL Access](https://developers.cloudflare.com/r2/buckets/public-buckets/), e.g. `https://pub-xxx.r2.dev`)

5. 📦 [Min IO](https://min.io)

    - [x] **Self Hosted**

    - [x] Reliable & Flexible Storage

    - Config *(S3 Compatible)*:

      - set env `STORAGE_TYPE` to `s3` (e.g. `STORAGE_TYPE=s3`)

      - set env `S3_SIGN_VERSION` to `s3v4` (e.g. `S3_SIGN_VERSION=s3v4`)

      - set env `S3_ACCESS_KEY` to your Min IO Access Key ID

      - set env `S3_SECRET_KEY` to your Min IO Secret Access Key

      - set env `S3_BUCKET` to your Min IO Bucket Name

      - set env `S3_DOMAIN` to your Min IO Domain Name (e.g. `https://oss.example.com`)

      - *[Optional] If you are using CDN, you can set `S3_DIRECT_URL_DOMAIN` to your Min IO Public URL Access Domain Name (e.g. `https://cdn-hk.example.com`)*

6. ❤ [Telegram CDN](https://github.com/csznet/tgState)

    - [x] **Free Storage (Rate Limit)**

    - [x] Support Direct URL Access *(China Mainland User Unfriendly)*

    - [x] **Limited** File Type & Format

    - [x] Config:

      - set env `STORAGE_TYPE` to `tg` (e.g. `STORAGE_TYPE=tg`)

      - set env `TG_ENDPOINT` to your TG-STATE Endpoint (e.g. `TG_ENDPOINT=https://tgstate.vercel.app`)

      - *[Optional] if you are using password authentication, you can set `TG_PASSWORD` to your TG-STATE Password*

    

### `4` 🔍 OCR Config (Optional)

> [!NOTE]

> OCR Support is based on 👉 [PaddleOCR API](https://github.com/cgcel/PaddleOCRFastAPI) (✔ Self Hosted ✔ Open Source)

- `OCR_ENDPOINT` Paddle OCR Endpoint

    - *e.g.: *http://example.com:8000*

## Common Errors

- *Cannot Use `Save All` Options Without Storage Config*:

    - This error occurs when you enable `save_all` option without storage config. You need to set `STORAGE_TYPE` to `local` or other storage type to use this option.

- *Trying to upload image with Vision disabled. Enable Vision or OCR to process image*:

    - This error occurs when you disable `enable_vision` and `enable_ocr` at the same time. You need to enable at least one of them to process image files.

- *.ppt files are not supported, only .pptx files are supported*:

    - This error occurs when you upload a old version of Office PowerPoint file. You need to convert it to `.pptx` format to process it.

- *.doc files are not supported, only .docx files are supported*:

    - This error occurs when you upload a old version of Office Word file. You need to convert it to `.docx` format to process it.

- *File Size Limit Exceeded*:

    - This error occurs when you upload a file that exceeds the `MAX_FILE_SIZE` limit. You need to reduce the file size to upload it.

## Development

- **~/config.py**: Env Config

- **~/main.py**: Entry Point

- **~/utils.py**: Utilities

- **~/handlers**: File Handlers

- **~/store**: Storage Handlers

- **~/static**: Static Files (if using **local** storage)

## Tech Stack

- Python & FastAPI

## License

Apache License 2.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zmh-program/blob-service

Awesome Lists containing this project

README