https://github.com/concaption/school-info-parser
Extract structured data about courses, accommodations, and pricing from school prospectuses
https://github.com/concaption/school-info-parser
gpt-vision ocr pdf-parsing
Last synced: about 2 months ago
JSON representation
Extract structured data about courses, accommodations, and pricing from school prospectuses
- Host: GitHub
- URL: https://github.com/concaption/school-info-parser
- Owner: concaption
- Created: 2025-01-26T04:00:47.000Z (4 months ago)
- Default Branch: api
- Last Pushed: 2025-03-21T01:07:31.000Z (3 months ago)
- Last Synced: 2025-03-29T18:11:53.706Z (2 months ago)
- Topics: gpt-vision, ocr, pdf-parsing
- Language: Jupyter Notebook
- Homepage:
- Size: 108 MB
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# School Information Parser
A FastAPI application that processes PDF files containing language school information using OpenAI's GPT-4 Vision API. The application extracts structured data about courses, accommodations, and pricing.
[](https://codespaces.new/concaption/school-info-parser)
Read [Notion.md](notion.md) for more details.
## Features
- Asynchronous PDF processing with background jobs
- Redis-based job queue system
- Colored logging with file and console output
- Docker containerization
- Callback support for job completion notifications
- Structured data extraction using Pydantic models
- Automatic API documentation with Swagger UI## Prerequisites
- Python 3.9+
- Docker and Docker Compose
- OpenAI API key
- Redis server## Installation
1. Clone the repository:
```bash
git clone https://github.com/concaption/school-info-parser.git
cd school-info-parser
```2. Create and populate .env file:
```bash
OPENAI_API_KEY=your_api_key_here
REDIS_HOST=redis
```3. Build and run with Docker Compose:
```bash
docker-compose up --build
```## API Endpoints
- `GET /` - Redirects to API documentation
- `POST /submit-job/` - Submit PDFs for processing
- `GET /job/{job_id}` - Check job status and results## Usage
1. Access the API documentation:
```
http://localhost:8000/docs
```2. Submit a PDF file for processing:
```bash
curl -X POST "http://localhost:8000/submit-job/" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "files=@your_pdf_file.pdf"
```3. Check job status:
```bash
curl -X GET "http://localhost:8000/job/{job_id}"
```## Development
1. Create a virtual environment:
```bash
python -m venv .venv
source .venv/bin/activate # Linux/Mac
.venv\Scripts\activate # Windows
```2. Install dependencies:
```bash
pip install -r requirements.txt
```3. Run tests:
```bash
pytest
```## Project Structure
```
school-info-parser/
├── src/
│ ├── parser.py # PDF processing logic
│ ├── schema.py # Pydantic models
│ ├── logger.py # Logging configuration
│ ├── prompts.py # OpenAI system prompts
│ └── utils.py # Utility functions
├── logs/ # Application logs
├── main.py # FastAPI application
├── Dockerfile # Container definition
└── docker-compose.yml # Container orchestration
```## Architecture
### System Architecture
```mermaid
graph TB
Client[Client] --> API[FastAPI Application]
API --> Redis[(Redis Queue)]
API --> Logger[Logger System]
subgraph Worker Processing
Redis --> Worker[Background Worker]
Worker --> PDFProcessor[PDF Processor]
PDFProcessor --> OpenAI[OpenAI GPT-4V API]
PDFProcessor --> Storage[File Storage]
end
Logger --> FileSystem[File System Logs]
Logger --> Console[Console Output]
Worker --> Callback[Callback URL]
Worker --> Results[(Results Storage)]
```### Workflow Diagram
```mermaid
sequenceDiagram
participant C as Client
participant A as FastAPI
participant R as Redis
participant W as Worker
participant P as PDF Processor
participant O as OpenAI API
participant CB as Callback URLC->>A: POST /submit-job/ (PDF files)
A->>A: Generate job_id
A->>R: Store initial job status
A->>C: Return job_id
activate W
W->>R: Poll for new jobs
R-->>W: Job details
W->>P: Process PDF
loop Each Page
P->>O: Send image for analysis
O-->>P: Return structured data
P->>P: Merge results
end
W->>R: Update job status
opt If callback_url provided
W->>CB: Send results
end
deactivate W
C->>A: GET /job/{job_id}
A->>R: Get job status
R-->>A: Return results
A->>C: Return job status/results
```## Contributing
1. Fork the repository
2. Create a feature branch
3. Commit your changes
4. Push to the branch
5. Create a Pull Request## License
MIT License - see LICENSE file for details
## Acknowledgments
- OpenAI for GPT-4 Vision API
- FastAPI for the web framework
- PyMuPDF for PDF processing