https://github.com/e-d-i-n-i/ai-data-extraction
AI-driven system for structured data extraction, storage, and vector search, leveraging Crawl4AI, PydanticAI, and Supabase to enable efficient retrieval and RAG-based AI applications.
https://github.com/e-d-i-n-i/ai-data-extraction
api crawl4ai nextjs pdf-data-extraction pydantic-ai python sitemap-scraping supabase vector-embeddings web-scraping
Last synced: 13 days ago
JSON representation
AI-driven system for structured data extraction, storage, and vector search, leveraging Crawl4AI, PydanticAI, and Supabase to enable efficient retrieval and RAG-based AI applications.
- Host: GitHub
- URL: https://github.com/e-d-i-n-i/ai-data-extraction
- Owner: e-d-i-n-i
- License: mit
- Created: 2025-04-05T13:53:14.000Z (18 days ago)
- Default Branch: main
- Last Pushed: 2025-04-05T14:17:49.000Z (18 days ago)
- Last Synced: 2025-04-05T15:20:14.066Z (18 days ago)
- Topics: api, crawl4ai, nextjs, pdf-data-extraction, pydantic-ai, python, sitemap-scraping, supabase, vector-embeddings, web-scraping
- Homepage:
- Size: 2.93 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# AI-Powered Data Extraction and Storage System
## Overview
This AI-powered system is designed to extract, validate, structure, and store large volumes of unstructured data efficiently. It uses semantic search with vectorized storage to enable fast and intelligent information retrieval, making it ideal for Retrieval-Augmented Generation (RAG) applications.
## Features
- **Automated Data Extraction:** Extracts data in chunks using intelligent crawling techniques.
- **Structured Storage:** Stores extracted data in a format enriched with metadata for easy retrieval.
- **Semantic Search:** Integrates vector search using Supabase to enable context-aware information lookup.
- **Data Validation:** Ensures consistency and accuracy of extracted data using PydanticAI.
- **RAG-Ready:** Supports downstream AI tasks like document-based question answering and summarization.## Tech Stack
- **Backend:** Python
- **Data Extraction:** Crawl4AI
- **Data Validation & Structuring:** PydanticAI
- **Storage:** Supabase (with vector column)
- **AI Integration:** Retrieval-Augmented Generation (RAG), Semantic Search## Installation
### Prerequisites
- Python 3.9+
- Supabase Account### Setup Instructions
1. Clone the repository:
```sh
git clone https://github.com/e-d-i-n-i/ai-data-extraction.git
cd ai-data-extraction
```2. Create a virtual environment and activate it:
```sh
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
```3. Install the dependencies:
```sh
pip install -r requirements.txt
```4. Set up your environment variables in a `.env` file:
```env
SUPABASE_URL=your_supabase_url
SUPABASE_KEY=your_supabase_api_key
```5. Run the system:
```sh
python main.py
```## Usage
1. Configure data sources in the system.
2. Run the extractor to crawl and fetch unstructured data.
3. Validate and structure data using PydanticAI.
4. Store structured data in Supabase with vector embedding.
5. Perform semantic search or integrate with RAG pipelines for intelligent applications.## Contributing
We welcome your contributions!
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Submit a pull request## License
This project is licensed under the MIT License.
## Contact
For questions or suggestions, contact **Edini Amare** at [[email protected]] or visit [www.edini.dev].