{"id":24998919,"url":"https://github.com/concaption/school-info-parser","last_synced_at":"2026-02-16T04:31:30.224Z","repository":{"id":275629412,"uuid":"922388715","full_name":"concaption/school-info-parser","owner":"concaption","description":"Extract structured data about courses, accommodations, and pricing from school prospectuses","archived":false,"fork":false,"pushed_at":"2025-03-21T01:07:31.000Z","size":113567,"stargazers_count":1,"open_issues_count":1,"forks_count":1,"subscribers_count":1,"default_branch":"api","last_synced_at":"2025-07-25T05:28:11.354Z","etag":null,"topics":["gpt-vision","ocr","pdf-parsing"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/concaption.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-01-26T04:00:47.000Z","updated_at":"2025-03-06T19:41:25.000Z","dependencies_parsed_at":"2025-04-22T20:04:39.949Z","dependency_job_id":null,"html_url":"https://github.com/concaption/school-info-parser","commit_stats":null,"previous_names":["concaption/school-info-parser"],"tags_count":0,"template":false,"template_full_name":"concaption/python-template","purl":"pkg:github/concaption/school-info-parser","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/concaption%2Fschool-info-parser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/concaption%2Fschool-info-parser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/concaption%2Fschool-info-parser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/concaption%2Fschool-info-parser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/concaption","download_url":"https://codeload.github.com/concaption/school-info-parser/tar.gz/refs/heads/api","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/concaption%2Fschool-info-parser/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29500320,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-16T03:57:51.541Z","status":"ssl_error","status_checked_at":"2026-02-16T03:55:59.854Z","response_time":115,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gpt-vision","ocr","pdf-parsing"],"created_at":"2025-02-04T18:52:18.937Z","updated_at":"2026-02-16T04:31:30.212Z","avatar_url":"https://github.com/concaption.png","language":"Jupyter Notebook","readme":"# School Information Parser\n\nA FastAPI application that processes PDF files containing language school information using OpenAI's GPT-4 Vision API. The application extracts structured data about courses, accommodations, and pricing.\n\n[![Open in GitHub Codespaces](https://github.com/codespaces/badge.svg)](https://codespaces.new/concaption/school-info-parser)\n\n\u003cdiv\u003e\n    \u003ca href=\"https://www.loom.com/share/d018d31a1bd34387874f94361a5c8ffa\"\u003e\n      \u003cp\u003eSchool Information Parser - Watch Video\u003c/p\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://www.loom.com/share/d018d31a1bd34387874f94361a5c8ffa\"\u003e\n      \u003cimg style=\"max-width:300px;\" src=\"https://cdn.loom.com/sessions/thumbnails/d018d31a1bd34387874f94361a5c8ffa-f98922728e9badf7-full-play.gif\"\u003e\n    \u003c/a\u003e\n  \u003c/div\u003e\n\nRead [Notion.md](notion.md) for more details. \n\n## Features\n\n- Asynchronous PDF processing with background jobs\n- Redis-based job queue system\n- Colored logging with file and console output\n- Docker containerization\n- Callback support for job completion notifications\n- Structured data extraction using Pydantic models\n- Automatic API documentation with Swagger UI\n\n## Prerequisites\n\n- Python 3.9+\n- Docker and Docker Compose\n- OpenAI API key\n- Redis server\n\n## Installation\n\n1. Clone the repository:\n```bash\ngit clone https://github.com/concaption/school-info-parser.git\ncd school-info-parser\n```\n\n2. Create and populate .env file:\n```bash\nOPENAI_API_KEY=your_api_key_here\nREDIS_HOST=redis\n```\n\n3. Build and run with Docker Compose:\n```bash\ndocker-compose up --build\n```\n\n## API Endpoints\n\n- `GET /` - Redirects to API documentation\n- `POST /submit-job/` - Submit PDFs for processing\n- `GET /job/{job_id}` - Check job status and results\n\n## Usage\n\n1. Access the API documentation:\n```\nhttp://localhost:8000/docs\n```\n\n2. Submit a PDF file for processing:\n```bash\ncurl -X POST \"http://localhost:8000/submit-job/\" \\\n     -H \"accept: application/json\" \\\n     -H \"Content-Type: multipart/form-data\" \\\n     -F \"files=@your_pdf_file.pdf\"\n```\n\n3. Check job status:\n```bash\ncurl -X GET \"http://localhost:8000/job/{job_id}\"\n```\n\n## Development\n\n1. Create a virtual environment:\n```bash\npython -m venv .venv\nsource .venv/bin/activate  # Linux/Mac\n.venv\\Scripts\\activate     # Windows\n```\n\n2. Install dependencies:\n```bash\npip install -r requirements.txt\n```\n\n3. Run tests:\n```bash\npytest\n```\n\n## Project Structure\n\n```\nschool-info-parser/\n├── src/\n│   ├── parser.py      # PDF processing logic\n│   ├── schema.py      # Pydantic models\n│   ├── logger.py      # Logging configuration\n│   ├── prompts.py     # OpenAI system prompts\n│   └── utils.py       # Utility functions\n├── logs/              # Application logs\n├── main.py           # FastAPI application\n├── Dockerfile        # Container definition\n└── docker-compose.yml # Container orchestration\n```\n\n## Architecture\n\n### System Architecture\n```mermaid\ngraph TB\n    Client[Client] --\u003e API[FastAPI Application]\n    API --\u003e Redis[(Redis Queue)]\n    API --\u003e Logger[Logger System]\n    \n    subgraph Worker Processing\n        Redis --\u003e Worker[Background Worker]\n        Worker --\u003e PDFProcessor[PDF Processor]\n        PDFProcessor --\u003e OpenAI[OpenAI GPT-4V API]\n        PDFProcessor --\u003e Storage[File Storage]\n    end\n    \n    Logger --\u003e FileSystem[File System Logs]\n    Logger --\u003e Console[Console Output]\n    \n    Worker --\u003e Callback[Callback URL]\n    Worker --\u003e Results[(Results Storage)]\n```\n\n### Workflow Diagram\n```mermaid\nsequenceDiagram\n    participant C as Client\n    participant A as FastAPI\n    participant R as Redis\n    participant W as Worker\n    participant P as PDF Processor\n    participant O as OpenAI API\n    participant CB as Callback URL\n\n    C-\u003e\u003eA: POST /submit-job/ (PDF files)\n    A-\u003e\u003eA: Generate job_id\n    A-\u003e\u003eR: Store initial job status\n    A-\u003e\u003eC: Return job_id\n    \n    activate W\n    W-\u003e\u003eR: Poll for new jobs\n    R--\u003e\u003eW: Job details\n    W-\u003e\u003eP: Process PDF\n    \n    loop Each Page\n        P-\u003e\u003eO: Send image for analysis\n        O--\u003e\u003eP: Return structured data\n        P-\u003e\u003eP: Merge results\n    end\n    \n    W-\u003e\u003eR: Update job status\n    \n    opt If callback_url provided\n        W-\u003e\u003eCB: Send results\n    end\n    deactivate W\n    \n    C-\u003e\u003eA: GET /job/{job_id}\n    A-\u003e\u003eR: Get job status\n    R--\u003e\u003eA: Return results\n    A-\u003e\u003eC: Return job status/results\n```\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Commit your changes\n4. Push to the branch\n5. Create a Pull Request\n\n## License\n\nMIT License - see LICENSE file for details\n\n## Acknowledgments\n\n- OpenAI for GPT-4 Vision API\n- FastAPI for the web framework\n- PyMuPDF for PDF processing\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fconcaption%2Fschool-info-parser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fconcaption%2Fschool-info-parser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fconcaption%2Fschool-info-parser/lists"}