{"id":23396030,"url":"https://github.com/paritoshk/bespoke_test","last_synced_at":"2025-04-08T17:27:07.935Z","repository":{"id":268643108,"uuid":"903911137","full_name":"paritoshk/bespoke_test","owner":"paritoshk","description":"Common Crawl and fasttext endpoint ","archived":false,"fork":false,"pushed_at":"2024-12-18T15:59:04.000Z","size":72,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-14T13:49:11.766Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/paritoshk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-15T21:38:57.000Z","updated_at":"2024-12-18T15:59:07.000Z","dependencies_parsed_at":"2024-12-18T03:28:41.712Z","dependency_job_id":"4739c039-992e-44f2-9c0a-47e2eb5d7cd4","html_url":"https://github.com/paritoshk/bespoke_test","commit_stats":null,"previous_names":["paritoshk/bespoke_test"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/paritoshk%2Fbespoke_test","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/paritoshk%2Fbespoke_test/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/paritoshk%2Fbespoke_test/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/paritoshk%2Fbespoke_test/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/paritoshk","download_url":"https://codeload.github.com/paritoshk/bespoke_test/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247890926,"owners_count":21013442,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-22T07:29:33.295Z","updated_at":"2025-04-08T17:27:07.907Z","avatar_url":"https://github.com/paritoshk.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# FastText Classification Service\n\nA high-performance, scalable FastText classification service built with FastAPI and Python. This service provides endpoints for training quality classifiers and scoring documents at scale, particularly useful for data pipeline quality filtering.\n\n## Features\n\n- **Efficient Training Pipeline**: Train FastText classifiers using positive examples and automatically sampled negative examples from Common Crawl\n- **High-Performance Scoring**: Score large batches of documents efficiently using trained models\n- **REST API Interface**: Clean API interface with FastAPI, including automatic OpenAPI documentation\n- **Scalable Architecture**: Designed for handling large-scale document processing\n- **Model Persistence**: Trained models are persisted and can be reused across sessions\n\n## Technical Implementation\n\n### Architecture\n\nThe service is structured into three main components:\n\n1. **API Layer** (`app/main.py`):\n   - FastAPI application handling HTTP requests\n   - Input validation using Pydantic models\n   - Error handling and response formatting\n\n2. **Service Layer** (`app/services/fasttext_service.py`):\n   - FastText model training and management\n   - Document scoring logic\n   - Model persistence handling\n\n3. **Data Layer** (`app/utils/data_loader.py`):\n   - Common Crawl data sampling\n   - Training data preparation\n\n### API Endpoints\n\n#### POST /train\n- Accepts positive training documents (minimum 20k examples)\n- Automatically samples negative examples from Common Crawl\n- Returns a UUID for the trained model\n\n```python\nResponse:\n{\n    \"model_id\": \"uuid-string\"\n}\n```\n\n#### POST /score\n- Accepts a batch of documents and a model ID\n- Returns classification scores for each document\n\n```python\nRequest:\n{\n    \"model_id\": \"uuid-string\",\n    \"documents\": [\"doc1\", \"doc2\", ...]\n}\n\nResponse:\n{\n    \"scores\": [0.92, 0.45, ...]\n}\n```\n\n## Installation\n\n1. Clone the repository:\n```bash\ngit clone https://github.com/yourusername/fasttext-service.git\ncd fasttext-service\n```\n\n2. Create and activate a virtual environment:\n```bash\npython -m venv venv\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\n```\n\n3. Install dependencies:\n```bash\npip install -r requirements.txt\n```\n\n## Usage\n\n1. Start the server:\n```bash\nuvicorn app.main:app --reload\n```\n\n2. Access the API documentation:\n- Navigate to `http://localhost:8000/docs` for the Swagger UI\n- Navigate to `http://localhost:8000/redoc` for the ReDoc documentation\n\n### Example Usage\n\nTraining a model:\n```python\nimport requests\n\nwith open('positive_examples.txt', 'rb') as f:\n    response = requests.post(\n        'http://localhost:8000/train',\n        files={'documents': f}\n    )\nmodel_id = response.json()['model_id']\n```\n\nScoring documents:\n```python\nresponse = requests.post(\n    'http://localhost:8000/score',\n    json={\n        'model_id': model_id,\n        'documents': ['document to classify', 'another document']\n    }\n)\nscores = response.json()['scores']\n```\n\n## Technical Details\n\n### FastText Configuration\n\nThe FastText model is configured for optimal performance in document classification:\n- Word n-grams (n=2) for capturing short phrases\n- Learning rate of 0.5 for stable convergence\n- 25 training epochs for model robustness\n- Minimum word count of 1 to handle rare terms\n\n### Scaling Considerations\n\nThe service is designed with scalability in mind:\n- Asynchronous request handling\n- Efficient model loading/unloading\n- Batch processing capabilities\n- Persistent model storage\n\n### Error Handling\n\nComprehensive error handling is implemented for:\n- Invalid input validation\n- Model not found scenarios\n- Training data requirements\n- Server-side processing errors\n\n## Testing\n\nRun the test suite:\n```bash\npytest tests/\n```\n\n## Analyzing Results\n\nTo analyze the results of the training, run the following command:\n```bash\npython scripts/analyze_results.py\n```\n\n## Bugs\n\n### Technical Challenges Addressed:\n\nNumPy Compatibility Issue: Fixed incompatibility with newer NumPy versions by patching FastText's predict method\nData Processing: Built robust data loading from positive/negative examples\nModel Persistence: Implemented proper model saving and loading\nAsync Support: Built async-compatible API endpoints\n\n\n### Architecture Decisions:\n\nService-based design separating concerns\nComprehensive logging and monitoring\nProper error handling and validation\nClean separation of training and inference\n\n\n### Implementation Details:\n\nUsed FastText for efficient text classification\nBuilt binary classifier (positive/negative)\nImplemented model versioning with UUIDs\nAdded performance monitoring and visualization\n\n\n\n### To explain the approach:\n\nData Pipeline:\n\nOrganized training data into positive/negative examples\nBuilt efficient data loading mechanism\nImplemented proper text cleaning and normalization\n\n\n### Model Training:\n\nUsed FastText for efficient text classification\nImplemented proper hyperparameter configuration\nAdded comprehensive logging and monitoring\nBuilt model persistence with versioning\n\n\n### Inference:\n\nEfficient document scoring\nProper error handling\nAsync support for scalability\nClean API interface\n\n\n### Quality Assurance:\n\nComprehensive test suite\nPerformance monitoring\nError handling\nData validation\n\n## Future Improvements\n    \n- Add model versioning\n- Implement distributed training\n- Add model performance metrics\n- Add data preprocessing pipeline\n- Implement model caching strategy\n- Add support for custom negative examples\n\n## License\n\nMIT License - See LICENSE file for details\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fparitoshk%2Fbespoke_test","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fparitoshk%2Fbespoke_test","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fparitoshk%2Fbespoke_test/lists"}