{"id":22878552,"url":"https://github.com/sandesh300/data-processing-pipeline-api","last_synced_at":"2026-05-10T06:54:10.076Z","repository":{"id":265383500,"uuid":"895571406","full_name":"sandesh300/Data-Processing-Pipeline-API","owner":"sandesh300","description":"Integrate Gemini API and build a custom data processing pipeline","archived":false,"fork":false,"pushed_at":"2024-11-29T08:42:24.000Z","size":981,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-06T20:39:31.404Z","etag":null,"topics":["fastapi","gemini-api","llms","postgresql","python","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sandesh300.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-28T13:08:12.000Z","updated_at":"2024-11-30T03:00:04.000Z","dependencies_parsed_at":"2025-02-06T20:35:38.544Z","dependency_job_id":"46dbbf07-4878-4a21-96f1-28f3913fa72c","html_url":"https://github.com/sandesh300/Data-Processing-Pipeline-API","commit_stats":null,"previous_names":["sandesh300/data-processing-pipeline-api"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sandesh300%2FData-Processing-Pipeline-API","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sandesh300%2FData-Processing-Pipeline-API/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sandesh300%2FData-Processing-Pipeline-API/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sandesh300%2FData-Processing-Pipeline-API/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sandesh300","download_url":"https://codeload.github.com/sandesh300/Data-Processing-Pipeline-API/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246482174,"owners_count":20784652,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fastapi","gemini-api","llms","postgresql","python","transformers"],"created_at":"2024-12-13T16:29:45.330Z","updated_at":"2026-05-10T06:54:05.030Z","avatar_url":"https://github.com/sandesh300.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Processing Pipeline with APIs\n\n## Objective\nIntegrate Gemini API and build a custom data processing pipeline.\n\n## Requirements\n\n### Setup the Pipeline:\n- Use Gemini API to process text data.\n- The input is raw, unstructured text, and the output should be structured JSON with specified fields.\n\n### Prompt Engineering:\n- Design a well-defined prompt for the API to ensure accurate outputs.\n- Use Pydantic models to validate the API response structure and ensure data integrity.\n\n### Local Model Integration:\n- Set up a locally hosted LLM (e.g., LLaMA or similar) and process the same data pipeline using it instead of the external API.\n- Compare the outputs of the external API and the locally hosted model.\n\n### Error Handling:\n- Handle API failures, rate limits, and invalid responses gracefully without relying on built-in batch-processing commands.\n\n## Deliverable\n- Pipeline code with prompt engineering and validation scripts.\n- Comparison report of outputs from external and local models.\n\n## Project Structure\n\n```\ndata-processing-pipeline/\n│\n├── main.py               # Main application file with FastAPI endpoints\n├── models.py             # Database models and operations using SQLAlchemy\n├── local_llm.py          # Local model processing using the transformers library\n├── download_model.py     # Script to download and save the local model\n├── requirements.txt      # Python dependencies\n└── README.md             # Project documentation\n```\n\n## Setup\n\n### Prerequisites\n- Python 3.8+\n- PostgreSQL\n- FastAPI\n- SQLAlchemy\n- Transformers library\n\n### requirements.txt\n```\nfastapi\nuvicorn\npydantic\nrequests\nsqlalchemy\ntransformers\npsycopg2-binary\n```\n\n### Installation\n\n1. **Clone the repository:**\n   ```sh\n   git clone https://github.com/sandesh300/data-processing-pipeline.git\n   cd data-processing-pipeline\n   ```\n\n2. **Install dependencies:**\n   ```sh\n   pip install -r requirements.txt\n   ```\n\n3. **Set up the database:**\n   - Update the `DATABASE_URL` in `models.py` with your PostgreSQL credentials.\n   - Run the following command to create the database tables:\n     ```sh\n     python models.py\n     ```\n\n4. **Download and save the local model:**\n   ```sh\n   python download_model.py\n   ```\n\n### Running the Application\n\n1. **Start the FastAPI server:**\n   ```sh\n   uvicorn main:app --reload\n   ```\n\n2. **Access the API:**\n   - The API will be available at `http://localhost:8000`.\n### FastAPI-doc\n![Screenshot (61)](https://github.com/user-attachments/assets/c1ba2483-bd53-4054-9507-f213cfa27576)\n\n## API Endpoints\n\n### 1. Process Text\n- **Endpoint:** `POST /process-text/`\n- **URL**: `http://localhost:8000/process-text/`\n- **Request:**\n  ```json\n  {\n      \"text\": \"You are a helpful AI assistant. Here's a user question: How do gaming phones differ from regular smartphones? Please provide a clear, structured response focusing on key features.\"\n  }\n  ```\n- **Response: 200 OK**\n  ```json\n  {\n      \"field1\": \"**Key Differences Between Gaming Phones and Regular Smartphones**\\n\\n**1. Processing Power and Graphic\",\n      \"field2\": \"s:**\\n* Gaming phones have considerably more powerful processors (e.g., Snapdragon 8 Gen 2) and dedic\",\n      \"prompt_used\": \"You are a helpful AI assistant. Question: You are a helpful AI assistant. Here's a user question: How do gaming phones differ from regular smartphones? Please provide a clear, structured response focusing on key features.\\nPlease provide a clear, structured response.\",\n      \"model_name\": \"gemini-pro\"\n  }\n  ```\n\n### 2. Process Text Local\n- **Endpoint:** `POST /process-text-local/`\n- **URL**: `http://localhost:8000/process-text-local/`\n- **Request:**\n  ```json\n  {\n      \"text\": \"Act as an AI expert. Given the question: How do gaming phones differ from regular smartphones? List the main gaming-specific features and hardware differences.\"\n  }\n  ```\n- **Response:200 OK**\n  ```json\n  {\n      \"field1\": \"{'field1': 'Act as an AI expert. Question: Act as an AI expert. Given the question: How do gaming ph\",\n      \"field2\": \"ones differ ', 'field2': 'from regular smartphones? List the main gaming-specific features and hardw\",\n      \"prompt_used\": \"Act as an AI expert. Question: Act as an AI expert. Given the question: How do gaming phones differ from regular smartphones? List the main gaming-specific features and hardware differences.\\nProvide a detailed analysis.\",\n      \"model_name\": \"local-llm\"\n  }\n  ```\n\n### 3. Compare Outputs\n- **Endpoint:** `POST /compare-outputs/`\n- **URL**: `http://localhost:8000/compare-outputs/`\n- **Request:**\n  ```json\n  {\n      \"text\": \"Compare and analyze: How do gaming phones differ from regular smartphones? Provide specific features and capabilities.\"\n  }\n  ```\n- **Response: 200 OK**\n  ```json\n  {\n      \"field1_match\": false,\n      \"field2_match\": false,\n      \"api_response\": {\n          \"field1\": \"**Comparison and Analysis of Gaming Phones vs. Regular Smartphones**\\n\\n**Features and Capabilities**\\n\",\n          \"field2\": \"\\n| **Feature** | **Gaming Phone** | **Regular Smartphone** |\\n|---|---|---|\\n**Display** | Larger, hig\",\n          \"prompt_used\": \"You are a helpful AI assistant. Question: Compare and analyze: How do gaming phones differ from regular smartphones? Provide specific features and capabilities.\\nPlease provide a clear, structured response.\",\n          \"model_name\": \"gemini-pro\"\n      },\n      \"local_response\": {\n          \"field1\": \"{'field1': 'Act as an AI expert. Question: Compare and analyze: How do gaming phones differ from reg\",\n          \"field2\": \"ular smartph', 'field2': 'ones? Provide specific features and capabilities.\\\\nProvide a detailed anal\",\n          \"prompt_used\": \"Act as an AI expert. Question: Compare and analyze: How do gaming phones differ from regular smartphones? Provide specific features and capabilities.\\nProvide a detailed analysis.\",\n          \"model_name\": \"local-llm\"\n      },\n      \"similarity_score\": 0.09090909090909091,\n      \"details\": {\n          \"prompts_used\": {\n              \"api\": \"You are a helpful AI assistant. Question: Compare and analyze: How do gaming phones differ from regular smartphones? Provide specific features and capabilities.\\nPlease provide a clear, structured response.\",\n              \"local\": \"Act as an AI expert. Question: Compare and analyze: How do gaming phones differ from regular smartphones? Provide specific features and capabilities.\\nProvide a detailed analysis.\"\n          },\n          \"response_lengths\": {\n              \"api\": 200,\n              \"local\": 200\n          }\n      },\n      \"context_analysis\": null\n  }\n  ```\n\n## Comparison Report\n\nThe comparison report provides a detailed analysis of the outputs from the external API (Gemini) and the local model. It includes:\n- Field matches (`field1_match`, `field2_match`)\n- API and local model responses\n- Similarity score between the responses\n- Details of the prompts used and response lengths\n\n## Error Handling\n\nThe application includes robust error handling to manage API failures, rate limits, and invalid responses. HTTP exceptions are raised with appropriate error messages to ensure smooth operation.\n\n## Pipeline code with prompt engineering and validation scripts.\n\n Below is the complete pipeline code with prompt engineering and validation scripts. This code includes the main application file (`main.py`), database models and operations (`models.py`), local model processing (`local_llm.py`), and a script to download and save the local model (`download_model.py`).\n\n### `main.py`\n\n```python\nfrom fastapi import FastAPI, HTTPException, Depends\nfrom pydantic import BaseModel\nimport requests\nimport json\nfrom typing import Dict, Optional, Any\nfrom sqlalchemy.orm import Session\nfrom models import SessionLocal, ProcessedText\nfrom local_llm import process_text_local\n\napp = FastAPI()\n\nAPI_KEY = \"AIzaSyALpRWGOdEo2F2Hi0V7NIcxNTGV9tGO5dQ\"\nAPI_URL = f\"https://generativelanguage.googleapis.com/v1/models/gemini-pro:generateContent?key={API_KEY}\"\n\n# Dependency\ndef get_db():\n    db = SessionLocal()\n    try:\n        yield db\n    finally:\n        db.close()\n\n# Pydantic models\nclass TextInput(BaseModel):\n    text: str\n    context: Optional[str] = None\n    style: Optional[str] = None\n\nclass TextOutput(BaseModel):\n    field1: str\n    field2: str\n    prompt_used: str\n    model_name: str\n\n    class Config:\n        orm_mode = True\n\nclass ComparisonOutput(BaseModel):\n    field1_match: bool\n    field2_match: bool\n    api_response: TextOutput\n    local_response: TextOutput\n    similarity_score: float\n    details: Dict[str, Any]\n    context_analysis: Optional[Dict[str, Any]] = None\n\n    class Config:\n        orm_mode = True\n\ndef build_gemini_prompt(text: str, context: Optional[str] = None, style: Optional[str] = None) -\u003e str:\n    \"\"\"Build a structured prompt for Gemini API\"\"\"\n    base_prompt = \"You are a helpful AI assistant. \"\n\n    if style:\n        base_prompt += f\"Please respond in a {style} style. \"\n\n    if context:\n        base_prompt += f\"Context: {context}\\n\"\n\n    base_prompt += f\"Question: {text}\\n\"\n    base_prompt += \"Please provide a clear, structured response.\"\n\n    return base_prompt\n\ndef build_local_prompt(text: str, context: Optional[str] = None, style: Optional[str] = None) -\u003e str:\n    \"\"\"Build a structured prompt for local LLM\"\"\"\n    base_prompt = \"Act as an AI expert. \"\n\n    if style:\n        base_prompt += f\"Respond in {style} style. \"\n\n    if context:\n        base_prompt += f\"Given this context: {context}\\n\"\n\n    base_prompt += f\"Question: {text}\\n\"\n    base_prompt += \"Provide a detailed analysis.\"\n\n    return base_prompt\n\ndef calculate_similarity_score(text1: str, text2: str) -\u003e float:\n    \"\"\"Calculate similarity between two text strings\"\"\"\n    if not text1 or not text2:\n        return 0.0\n\n    words1 = set(text1.lower().split())\n    words2 = set(text2.lower().split())\n\n    intersection = len(words1.intersection(words2))\n    union = len(words1.union(words2))\n\n    return intersection / union if union \u003e 0 else 0.0\n\n@app.post(\"/process-text/\", response_model=TextOutput)\nasync def process_text(input_data: TextInput, db: Session = Depends(get_db)):\n    try:\n        prompt = build_gemini_prompt(input_data.text, input_data.context, input_data.style)\n\n        headers = {\"Content-Type\": \"application/json\"}\n        payload = {\n            \"contents\": [{\n                \"parts\": [{\n                    \"text\": prompt\n                }]\n            }]\n        }\n\n        response = requests.post(API_URL, headers=headers, json=payload)\n        response.raise_for_status()\n\n        api_response = response.json()\n        generated_text = api_response.get(\"candidates\", [{}])[0].get(\"content\", {}).get(\"parts\", [{}])[0].get(\"text\", \"\")\n\n        output_data = TextOutput(\n            field1=generated_text[:100],\n            field2=generated_text[100:200] if len(generated_text) \u003e 100 else \"\",\n            prompt_used=prompt,\n            model_name=\"gemini-pro\"\n        )\n\n        db_text = ProcessedText(\n            original_text=input_data.text,\n            field1=output_data.field1,\n            field2=output_data.field2,\n            prompt_used=prompt\n        )\n        db.add(db_text)\n        db.commit()\n\n        return output_data\n\n    except Exception as e:\n        raise HTTPException(\n            status_code=500,\n            detail=f\"API processing failed: {str(e)}\"\n        )\n\n@app.post(\"/process-text-local/\", response_model=TextOutput)\nasync def process_text_local_endpoint(input_data: TextInput, db: Session = Depends(get_db)):\n    try:\n        prompt = build_local_prompt(input_data.text, input_data.context, input_data.style)\n        local_response = process_text_local(prompt)\n\n        if not isinstance(local_response, str):\n            generated_text = str(local_response)\n        else:\n            generated_text = local_response\n\n        output_data = TextOutput(\n            field1=generated_text[:100] if generated_text else \"\",\n            field2=generated_text[100:200] if len(generated_text) \u003e 100 else \"\",\n            prompt_used=prompt,\n            model_name=\"local-llm\"\n        )\n\n        db_text = ProcessedText(\n            original_text=input_data.text,\n            field1=output_data.field1,\n            field2=output_data.field2,\n            prompt_used=prompt\n        )\n        db.add(db_text)\n        db.commit()\n\n        return output_data\n\n    except Exception as e:\n        raise HTTPException(\n            status_code=500,\n            detail=f\"Local processing failed: {str(e)}\"\n        )\n\n@app.post(\"/compare-outputs/\", response_model=ComparisonOutput)\nasync def compare_outputs_endpoint(input_data: TextInput, db: Session = Depends(get_db)):\n    try:\n        api_response = await process_text(input_data, db)\n        local_response = await process_text_local_endpoint(input_data, db)\n\n        comparison_report = ComparisonOutput(\n            field1_match=api_response.field1 == local_response.field1,\n            field2_match=api_response.field2 == local_response.field2,\n            api_response=api_response,\n            local_response=local_response,\n            similarity_score=calculate_similarity_score(\n                api_response.field1 + api_response.field2,\n                local_response.field1 + local_response.field2\n            ),\n            details={\n                \"prompts_used\": {\n                    \"api\": api_response.prompt_used,\n                    \"local\": local_response.prompt_used\n                },\n                \"response_lengths\": {\n                    \"api\": len(api_response.field1) + len(api_response.field2),\n                    \"local\": len(local_response.field1) + len(local_response.field2)\n                }\n            }\n        )\n\n        return comparison_report\n\n    except Exception as e:\n        raise HTTPException(\n            status_code=500,\n            detail={\n                \"error\": str(e),\n                \"message\": \"Comparison failed\",\n                \"input_text\": input_data.text\n            }\n        )\n\nif __name__ == \"__main__\":\n    import uvicorn\n    uvicorn.run(app, host=\"0.0.0.0\", port=8000)\n```\n\n### `models.py`\n\n```python\nfrom sqlalchemy import create_engine, Column, Integer, String, Text\nfrom sqlalchemy.ext.declarative import declarative_base\nfrom sqlalchemy.orm import sessionmaker, Session\nfrom contextlib import contextmanager\nimport logging\n\n# Configure logging\nlogging.basicConfig(level=logging.INFO)\nlogger = logging.getLogger(__name__)\n\n# Database configuration\nDATABASE_URL = \"postgresql://pipeline_user:1234@localhost/data_pipeline_db\"\n\n# Create SQLAlchemy engine\nengine = create_engine(DATABASE_URL, echo=True)\nSessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)\nBase = declarative_base()\n\nclass ProcessedText(Base):\n    __tablename__ = \"processed_texts\"\n\n    id = Column(Integer, primary_key=True, index=True)\n    original_text = Column(Text, index=True, nullable=False)\n    field1 = Column(Text)\n    field2 = Column(Text)\n    prompt_used = Column(Text, nullable=True)\n\n    def __repr__(self):\n        return f\"\u003cProcessedText(id={self.id}, original_text={self.original_text[:30]}...)\u003e\"\n\n@contextmanager\ndef get_db():\n    \"\"\"Provide a transactional scope around a series of operations.\"\"\"\n    db = SessionLocal()\n    try:\n        yield db\n        db.commit()\n    except Exception as e:\n        db.rollback()\n        logger.error(f\"Database error: {str(e)}\")\n        raise\n    finally:\n        db.close()\n\ndef create_tables():\n    \"\"\"Create all tables in the database.\"\"\"\n    try:\n        Base.metadata.create_all(bind=engine)\n        logger.info(\"Database tables created successfully\")\n    except Exception as e:\n        logger.error(f\"Error creating tables: {str(e)}\")\n        raise\n\ndef drop_tables():\n    \"\"\"Drop all tables in the database.\"\"\"\n    try:\n        Base.metadata.drop_all(bind=engine)\n        logger.info(\"Database tables dropped successfully\")\n    except Exception as e:\n        logger.error(f\"Error dropping tables: {str(e)}\")\n        raise\n\ndef recreate_tables():\n    \"\"\"Drop and recreate all tables.\"\"\"\n    drop_tables()\n    create_tables()\n\n# Database operations\ndef insert_processed_text(db: Session, original_text: str, field1: str, field2: str, prompt_used: str = None):\n    \"\"\"Insert a new processed text entry.\"\"\"\n    db_text = ProcessedText(\n        original_text=original_text,\n        field1=field1,\n        field2=field2,\n        prompt_used=prompt_used\n    )\n    db.add(db_text)\n    return db_text\n\nif __name__ == \"__main__\":\n    # Example usage\n    try:\n        logger.info(\"Starting database setup...\")\n\n        # Recreate tables (warning: this will delete existing data)\n        recreate_tables()\n\n        # Example insertion\n        with get_db() as db:\n            test_entry = insert_processed_text(\n                db=db,\n                original_text=\"Test text\",\n                field1=\"Field 1 content\",\n                field2=\"Field 2 content\",\n                prompt_used=\"Test prompt\"\n            )\n            logger.info(f\"Test entry created with ID: {test_entry.id}\")\n\n        logger.info(\"Database setup completed successfully\")\n\n    except Exception as e:\n        logger.error(f\"Setup failed: {str(e)}\")\n        raise\n```\n\n### `local_llm.py`\n\n```python\nfrom transformers import pipeline\n\nlocal_model_path = \"models/local_model\"\nmodel = pipeline(\"text-generation\", model=local_model_path)\n\ndef process_text_local(text: str):\n    response = model(text, max_length=200, num_return_sequences=1)\n    generated_text = response[0]['generated_text']\n\n    return {\n        \"field1\": generated_text[:100],\n        \"field2\": generated_text[100:200] if len(generated_text) \u003e 100 else \"\"\n    }\n```\n\n### `download_model.py`\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\ndef download_and_save_model(model_name: str, local_path: str):\n    # Download the model and tokenizer\n    model = AutoModelForCausalLM.from_pretrained(model_name)\n    tokenizer = AutoTokenizer.from_pretrained(model_name)\n\n    # Save the model and tokenizer locally\n    model.save_pretrained(local_path)\n    tokenizer.save_pretrained(local_path)\n\n    print(f\"Model and tokenizer saved to {local_path}\")\n\nif __name__ == \"__main__\":\n    model_name = \"facebook/opt-125m\"  # Replace with the model you want to use\n    local_path = \"models/local_model\"  # Path to the local directory\n    download_and_save_model(model_name, local_path)\n```\n\n\n\nThis README provides an overview of the data processing pipeline, including setup instructions, API endpoints, and a comparison report. It ensures that the project meets the objectives and requirements specified.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsandesh300%2Fdata-processing-pipeline-api","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsandesh300%2Fdata-processing-pipeline-api","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsandesh300%2Fdata-processing-pipeline-api/lists"}