{"id":37079513,"url":"https://github.com/luiscarbonel1991/nlp2sql","last_synced_at":"2026-04-03T00:14:52.734Z","repository":{"id":306314430,"uuid":"1020453623","full_name":"luiscarbonel1991/nlp2sql","owner":"luiscarbonel1991","description":"Enterprise-ready Natural Language to SQL converter with multi-provider AI support (OpenAI, Anthropic, Gemini). Built for production scale databases (1000+ tables) with Clean Architecture.","archived":false,"fork":false,"pushed_at":"2025-08-27T04:05:36.000Z","size":762,"stargazers_count":2,"open_issues_count":5,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-10-08T23:24:55.944Z","etag":null,"topics":["ai","anthropic","clean-architecture","database","gemini","llms","mcp-server","natural-language","npl","openai","query-generation","sql"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/luiscarbonel1991.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-15T22:36:02.000Z","updated_at":"2025-08-27T04:05:39.000Z","dependencies_parsed_at":"2025-08-05T05:14:46.984Z","dependency_job_id":null,"html_url":"https://github.com/luiscarbonel1991/nlp2sql","commit_stats":null,"previous_names":["luiscarbonel1991/nlp2sql"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/luiscarbonel1991/nlp2sql","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luiscarbonel1991%2Fnlp2sql","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luiscarbonel1991%2Fnlp2sql/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luiscarbonel1991%2Fnlp2sql/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luiscarbonel1991%2Fnlp2sql/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/luiscarbonel1991","download_url":"https://codeload.github.com/luiscarbonel1991/nlp2sql/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luiscarbonel1991%2Fnlp2sql/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28416120,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T08:38:59.149Z","status":"ssl_error","status_checked_at":"2026-01-14T08:38:43.588Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","anthropic","clean-architecture","database","gemini","llms","mcp-server","natural-language","npl","openai","query-generation","sql"],"created_at":"2026-01-14T09:37:20.947Z","updated_at":"2026-04-03T00:14:52.721Z","avatar_url":"https://github.com/luiscarbonel1991.png","language":"Python","readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/nlp2sql-logo.png\" alt=\"nlp2sql logo\" width=\"400\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://pepy.tech/projects/nlp2sql\"\u003e\u003cimg src=\"https://static.pepy.tech/personalized-badge/nlp2sql?period=total\u0026units=INTERNATIONAL_SYSTEM\u0026left_color=BLACK\u0026right_color=GREEN\u0026left_text=downloads\" alt=\"PyPI Downloads\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://opensource.org/licenses/MIT\"\u003e\u003cimg src=\"https://img.shields.io/badge/License-MIT-yellow.svg\" alt=\"License: MIT\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.python.org/downloads/\"\u003e\u003cimg src=\"https://img.shields.io/badge/python-3.9+-blue.svg\" alt=\"Python 3.9+\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/psf/black\"\u003e\u003cimg src=\"https://img.shields.io/badge/code%20style-black-000000.svg\" alt=\"Code style: black\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n# nlp2sql \n\n**Enterprise-ready Natural Language to SQL converter with multi-provider support**\n\nConvert natural language queries to optimized SQL using multiple AI providers. Built with Clean Architecture principles for enterprise-scale applications handling 1000+ table databases.\n\n## Features\n\n- **Multiple AI Providers**: OpenAI, Anthropic Claude, Google Gemini - no vendor lock-in\n- **Database Support**: PostgreSQL, Amazon Redshift\n- **Large Schema Handling**: Vector embeddings and intelligent filtering for 1000+ tables\n- **Smart Caching**: Query and schema embedding caching for improved performance\n- **Async Support**: Full async/await support\n- **Clean Architecture**: Ports \u0026 Adapters pattern for maintainability\n\n## Documentation\n\n| Document | Description |\n|----------|-------------|\n| [Architecture](docs/ARCHITECTURE.md) | Component diagram and data flow |\n| [API Reference](docs/API.md) | Python API and CLI command reference |\n| [Configuration](docs/CONFIGURATION.md) | Environment variables and schema filters |\n| [Enterprise Guide](docs/ENTERPRISE.md) | Large-scale deployment and migration |\n| [Redshift Support](docs/Redshift.md) | Amazon Redshift setup and examples |\n| [Contributing](CONTRIBUTING.md) | Contribution guidelines |\n\n## Installation\n\n```bash\n# With UV (recommended)\nuv add nlp2sql\n\n# With pip\npip install nlp2sql\n\n# With specific providers\npip install nlp2sql[anthropic,gemini]\npip install nlp2sql[all-providers]\n\n# With embeddings\npip install nlp2sql[embeddings-local]   # Local embeddings (free)\npip install nlp2sql[embeddings-openai]  # OpenAI embeddings\n```\n\n## Quick Start\n\n### 1. Set an API Key\n\n```bash\nexport OPENAI_API_KEY=\"your-openai-key\"\n# or ANTHROPIC_API_KEY, GOOGLE_API_KEY\n```\n\n### 2. Connect and Ask\n\n```python\nimport asyncio\nimport nlp2sql\nfrom nlp2sql import ProviderConfig\n\nasync def main():\n    nlp = await nlp2sql.connect(\n        \"postgresql://user:pass@localhost:5432/mydb\",\n        provider=ProviderConfig(provider=\"openai\", api_key=\"sk-...\"),\n    )\n\n    result = await nlp.ask(\"Show me all active users\")\n    print(result.sql)\n    print(result.confidence)\n    print(result.is_valid)\n\nasyncio.run(main())\n```\n\n`connect()` auto-detects the database type from the URL, loads the schema, and builds the FAISS embedding index. Subsequent `ask()` calls reuse everything from disk cache.\n\n### 3. Few-Shot Examples\n\nPass a list of dicts -- `connect()` handles embedding and indexing automatically:\n\n```python\nnlp = await nlp2sql.connect(\n    \"redshift://user:pass@host:5439/db\",\n    provider=ProviderConfig(provider=\"openai\", api_key=\"sk-...\"),\n    schema=\"dwh_data_share_llm\",\n    examples=[\n        {\n            \"question\": \"Total revenue last month?\",\n            \"sql\": \"SELECT SUM(revenue) FROM sales WHERE date \u003e= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')\",\n            \"database_type\": \"redshift\",\n        },\n    ],\n)\n\nresult = await nlp.ask(\"Show me total sales this quarter\")\n```\n\n### 4. Schema Filtering (Large Databases)\n\n```python\nnlp = await nlp2sql.connect(\n    \"postgresql://localhost/enterprise\",\n    provider=ProviderConfig(provider=\"anthropic\", api_key=\"sk-ant-...\"),\n    schema_filters={\n        \"include_schemas\": [\"sales\", \"finance\"],\n        \"exclude_system_tables\": True,\n    },\n)\n```\n\n### 5. Custom Model and Temperature\n\n```python\nnlp = await nlp2sql.connect(\n    \"postgresql://localhost/mydb\",\n    provider=ProviderConfig(\n        provider=\"openai\",\n        api_key=\"sk-...\",\n        model=\"gpt-4o\",\n        temperature=0.0,\n        max_tokens=4000,\n    ),\n)\n```\n\n### 6. CLI\n\n```bash\nnlp2sql query \\\n  --database-url postgresql://user:pass@localhost:5432/mydb \\\n  --question \"Show all active users\" \\\n  --explain\n\nnlp2sql inspect --database-url postgresql://localhost/mydb\n```\n\n### Advanced: Direct Service Access\n\nFor full control over the lifecycle, the lower-level API is still available:\n\n```python\nfrom nlp2sql import create_and_initialize_service, ProviderConfig, DatabaseType\n\nservice = await create_and_initialize_service(\n    database_url=\"postgresql://localhost/mydb\",\n    provider_config=ProviderConfig(provider=\"openai\", api_key=\"sk-...\"),\n    database_type=DatabaseType.POSTGRES,\n)\nresult = await service.generate_sql(\"Count total users\", database_type=DatabaseType.POSTGRES)\n```\n\n## How It Works\n\n```\nQuestion ──► Cache check ──► Schema retrieval ──► Relevance filtering ──► Context building ──► AI generation ──► Validation\n                                    │                     │                      │\n                              SchemaRepository    FAISS + TF-IDF hybrid   Reuses precomputed\n                              (+ disk cache)      + batch scoring          relevance scores\n```\n\n1. **Schema retrieval** -- Fetches tables from database via `SchemaRepository` (with disk cache for Redshift)\n2. **Relevance filtering** -- FAISS dense search + TF-IDF sparse search (50/50 hybrid) finds candidate tables; batch scoring refines with precomputed embeddings\n3. **Context building** -- Builds optimized schema context within token limits, reusing scores from step 2 (zero additional embedding calls)\n4. **SQL generation** -- AI provider (OpenAI, Anthropic, or Gemini) generates SQL from question + schema context\n5. **Validation** -- SQL syntax and safety checks before returning results\n\nSee [Architecture](docs/ARCHITECTURE.md) for the detailed flow with method references and design decisions.\n\n## Provider Comparison\n\n| Provider | Default Model | Context Size | Best For |\n|----------|--------------|-------------|----------|\n| OpenAI | gpt-4o-mini | 128K | Cost-effective, fast |\n| Anthropic | claude-sonnet-4-20250514 | 200K | Large schemas |\n| Google Gemini | gemini-2.0-flash | 1M | High volume |\n\nAll models are configurable via `ProviderConfig(model=\"...\")`. See [Configuration](docs/CONFIGURATION.md) for details.\n\n## Architecture\n\nClean Architecture (Ports \u0026 Adapters) with three layers: core entities, port interfaces, and adapter implementations. The schema management layer uses FAISS + TF-IDF hybrid search for relevance filtering at scale.\n\n```\nnlp2sql/\n├── client.py       # DSL: connect() + NLP2SQL class (recommended entry point)\n├── core/           # Pure Python: entities, ProviderConfig, QueryResult, sql_safety, sql_keywords\n├── ports/          # Interfaces: AIProviderPort, SchemaRepositoryPort, EmbeddingProviderPort,\n│                   #   ExampleRepositoryPort, QuerySafetyPort, QueryValidatorPort, CachePort\n├── adapters/       # Implementations: OpenAI, Anthropic, Gemini, PostgreSQL, Redshift,\n│                   #   RegexQueryValidator\n├── services/       # Orchestration: QueryGenerationService\n├── schema/         # Schema management: SchemaManager, SchemaAnalyzer, SchemaEmbeddingManager,\n│                   #   ExampleStore\n├── config/         # Pydantic Settings (centralized defaults)\n└── exceptions/     # Exception hierarchy (NLP2SQLException -\u003e 8 subclasses)\n```\n\nSee [Architecture](docs/ARCHITECTURE.md) for the full component diagram, data flow, and design decisions.\n\n## Development\n\n```bash\n# Clone and install\ngit clone https://github.com/luiscarbonel1991/nlp2sql.git\ncd nlp2sql\nuv sync\n\n# Start test databases\ncd docker \u0026\u0026 docker-compose up -d\n\n# Run tests\nuv run pytest\n\n# Code quality\nuv run ruff format .\nuv run ruff check .\nuv run mypy src/\n```\n\n## MCP Server\n\nnlp2sql includes a Model Context Protocol server for AI assistant integration.\n\n```json\n{\n  \"mcpServers\": {\n    \"nlp2sql\": {\n      \"command\": \"python\",\n      \"args\": [\"/path/to/nlp2sql/mcp_server/server.py\"],\n      \"env\": {\n        \"OPENAI_API_KEY\": \"${OPENAI_API_KEY}\",\n        \"NLP2SQL_DEFAULT_DB_URL\": \"postgresql://user:pass@localhost:5432/mydb\"\n      }\n    }\n  }\n}\n```\n\nTools: `ask_database`, `explore_schema`, `run_sql`, `list_databases`, `explain_sql`\n\nSee [mcp_server/README.md](mcp_server/README.md) for complete setup.\n\n## Contributing\n\nWe welcome contributions. See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.\n\n## License\n\nMIT License - see [LICENSE](LICENSE).\n\n## Author\n\n**Luis Carbonel** - [@luiscarbonel1991](https://github.com/luiscarbonel1991)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluiscarbonel1991%2Fnlp2sql","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fluiscarbonel1991%2Fnlp2sql","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluiscarbonel1991%2Fnlp2sql/lists"}