{"id":43906252,"url":"https://github.com/vtsaplin/datatalk-cli","last_synced_at":"2026-03-04T18:02:50.914Z","repository":{"id":312774282,"uuid":"1048716456","full_name":"vtsaplin/datatalk-cli","owner":"vtsaplin","description":"Query CSV, Excel \u0026 Parquet files with natural language. Fast, local, DuckDB-powered.","archived":false,"fork":false,"pushed_at":"2025-11-30T00:07:13.000Z","size":13507,"stargazers_count":10,"open_issues_count":1,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-01-13T19:28:38.826Z","etag":null,"topics":["ai-tools","cli","csv","data-analysis","data-engineering","developer-tools","duckdb","excel","gpt","llm","local-first","natural-language-query","openai","parquet","privacy","python","text-to-sql"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vtsaplin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-01T23:09:29.000Z","updated_at":"2026-01-11T15:37:52.000Z","dependencies_parsed_at":"2025-09-01T23:48:04.060Z","dependency_job_id":"d027cb8b-bfdd-47e0-ac4e-fe9d702e6c17","html_url":"https://github.com/vtsaplin/datatalk-cli","commit_stats":null,"previous_names":["vtsaplin/datatalk","vtsaplin/datatalk-cli"],"tags_count":28,"template":false,"template_full_name":null,"purl":"pkg:github/vtsaplin/datatalk-cli","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vtsaplin%2Fdatatalk-cli","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vtsaplin%2Fdatatalk-cli/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vtsaplin%2Fdatatalk-cli/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vtsaplin%2Fdatatalk-cli/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vtsaplin","download_url":"https://codeload.github.com/vtsaplin/datatalk-cli/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vtsaplin%2Fdatatalk-cli/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29580622,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-18T08:38:15.585Z","status":"ssl_error","status_checked_at":"2026-02-18T08:38:14.917Z","response_time":162,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-tools","cli","csv","data-analysis","data-engineering","developer-tools","duckdb","excel","gpt","llm","local-first","natural-language-query","openai","parquet","privacy","python","text-to-sql"],"created_at":"2026-02-06T20:00:20.121Z","updated_at":"2026-02-18T13:00:37.110Z","avatar_url":"https://github.com/vtsaplin.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# DataTalk CLI\n\n[![PyPI version](https://badge.fury.io/py/datatalk-cli.svg)](https://badge.fury.io/py/datatalk-cli)\n[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n## Chat with your data in plain English. Right from your terminal.\n\n\u003e A **natural language interface** for your **CSV**, **Excel (.xlsx)**, and **Parquet** files. **Fast**, **local**, and **private**.\n\nSkip SQL and complex syntax. Just ask **“What are the top 5 products?”**\u003cbr\u003e\nGet instant answers from your **local data**.\n\n**Privacy First:** Your data never leaves your machine.  \n**Formats:** CSV, Excel (.xlsx), Parquet  \n**Performance:** Local analytics engine for **instant results**.\n\n![Demo](docs/demo.gif)\n\n**⭐ If you find this useful, please star the repo. It helps a lot!**\n\n## Why DataTalk?\n\n**The Problem:** You have a CSV file and a simple question. What do you do?\n- Open Excel? Slow for large files, and you have to leave the terminal\n- Use command-line tools (awk, csvkit)? Need to remember complex flags and syntax\n- Write SQL? Overkill for \"show me the top 5 products\"\n\n**The Solution:** Just ask your question naturally.\n\n```bash\ndtalk sales.csv\n\u003e What are the top 5 products by revenue?\n\u003e Show me sales by region for Q4\n\u003e Which customers made orders over $1000?\n```\n\n## Features\n\n- **Natural Language** - Ask questions in plain English, no SQL required\n- **Interactive Mode** - Ask multiple questions with ↑↓ history\n- **100% Local Processing** - Data never leaves your machine, only schema is sent to LLM\n- **100% Offline Option** - Use local Ollama models for complete offline operation, no internet required\n- **Fast** - DuckDB processes gigabytes locally in seconds\n- **100+ LLM Models** - Powered by [LiteLLM](https://docs.litellm.ai) - OpenAI, Anthropic, Google, Ollama (local), and more\n- **Multiple File Formats** - Supports CSV, Excel (.xlsx, .xls), and Parquet files\n- **Scriptable** - JSON and CSV output formats for automation and pipelines\n- **Simple Configuration** - Just set `LLM_MODEL` and API key environment variables\n- **Transparent** - SQL queries shown by default, use `--no-sql` to hide\n\n## Installation\n\n```bash\npip install datatalk-cli\n```\n\n**Requirements:** Python 3.9+ and either an API key for cloud models (OpenAI, Anthropic, etc.) OR local Ollama for offline use\n\n## Quick Start\n\n```bash\n# Option 1: Use cloud models (OpenAI, Anthropic, Google, etc.)\nexport LLM_MODEL=\"gpt-4o\"\nexport OPENAI_API_KEY=\"your-key-here\"\n\n# Option 2: Use local Ollama (100% offline, fully private, no API key needed!)\nexport LLM_MODEL=\"ollama/llama3.1\"\n# No API key needed - works completely offline!\n\n# Start interactive mode - ask multiple questions\ndtalk sales_data.csv\n\n# You'll get a prompt where you can ask questions naturally:\n# \u003e What are the top 5 products by revenue?\n# \u003e Show me monthly sales trends\n# \u003e Which customers made purchases over $1000?\n\n# Or use single query mode for quick answers\ndtalk sales_data.csv -p \"What are the top 5 products by revenue?\"\n```\n\n## Configuration\n\nDataTalk uses [LiteLLM](https://docs.litellm.ai) to support 100+ models from various providers through a unified interface.\n\n### Required Environment Variables\n\nSet two environment variables:\n\n```bash\n# 1. Choose your model\nexport LLM_MODEL=\"gpt-4o\"\n\n# 2. Set the API key for your provider\nexport OPENAI_API_KEY=\"your-key\"\n```\n\n### Supported Models\n\n**OpenAI:**\n```bash\nexport LLM_MODEL=\"gpt-4o\"  # or gpt-4o-mini, gpt-3.5-turbo\nexport OPENAI_API_KEY=\"sk-...\"\n```\n\n**Anthropic Claude:**\n```bash\nexport LLM_MODEL=\"claude-3-5-sonnet-20241022\"\nexport ANTHROPIC_API_KEY=\"sk-ant-...\"\n```\n\n**Google Gemini:**\n```bash\nexport LLM_MODEL=\"gemini-1.5-flash\"  # or gemini-1.5-pro\nexport GEMINI_API_KEY=\"...\"\n```\n\n**Ollama (100% Offline - fully private, no internet required!):**\n```bash\n# Install Ollama from https://ollama.ai\n# Start Ollama: ollama serve\n# Pull a model: ollama pull llama3.1\n\nexport LLM_MODEL=\"ollama/llama3.1\"  # or ollama/mistral, ollama/codellama\n# No API key needed! Works completely offline - your data and queries never leave your machine.\n```\n\n**Azure OpenAI:**\n```bash\nexport LLM_MODEL=\"azure/gpt-4o\"  # Use your deployment name\nexport AZURE_API_KEY=\"...\"\nexport AZURE_API_BASE=\"https://your-resource.openai.azure.com\"\nexport AZURE_API_VERSION=\"2024-02-01\"\n```\n*Note: Replace `gpt-4o` with your actual Azure deployment name*\n\n**And 100+ more models!** See [LiteLLM Providers](https://docs.litellm.ai/docs/providers) for the complete list including Cohere, Replicate, Hugging Face, AWS Bedrock, and more.\n\n### Optional Configuration\n\n**MODEL_TEMPERATURE** - Control LLM response randomness (default: 0.1)\n```bash\nexport MODEL_TEMPERATURE=\"0.5\"  # Range: 0.0-2.0. Lower = more deterministic, Higher = more creative\n```\n\n### Using .env file\n\nCreate a `.env` file in your project directory:\n\n```bash\nLLM_MODEL=gpt-4o\nOPENAI_API_KEY=your-key\n```\n\n## Usage\n\n**Interactive mode** - ask multiple questions:\n```bash\ndtalk sales_data.csv\n```\n\n**Direct query** - single question and exit:\n```bash\ndtalk sales_data.csv -p \"What were total sales in Q4?\"\n# or using long form:\ndtalk sales_data.csv --prompt \"What were total sales in Q4?\"\n```\n\n### Examples\n\n```bash\n# Basic queries\ndtalk data.csv \"How many rows?\"\ndtalk data.csv \"Show first 10 rows\"\ndtalk data.csv \"What is the average order value?\"\n\n# Filtering \u0026 sorting\ndtalk data.csv \"Show customers from Canada\"\ndtalk data.csv \"Top 10 products by revenue\"\n\n# Aggregations\ndtalk data.csv \"Total revenue by category\"\ndtalk data.csv \"Monthly revenue trend for 2024\"\n\n# Excel files work the same way\ndtalk report.xlsx \"What is the average salary?\"\ndtalk budget.xls \"Show expenses by department\"\n\n# Parquet files work the same way\ndtalk data.parquet \"Count distinct users\"\n```\n\n## Options\n\n### Query Modes\n\n```bash\n# Interactive mode (default) - ask multiple questions\ndtalk data.csv\n\n# Non-interactive mode - single query and exit\ndtalk data.csv -p \"What are the top 5 products?\"\ndtalk data.csv --prompt \"What are the top 5 products?\"\n```\n\n### Output Formats (with `-p` only)\n\nDataTalk supports multiple output formats for different use cases:\n\n```bash\n# Human-readable table (default)\ndtalk data.csv -p \"Top 5 products\"\n\n# JSON format - for scripting and automation\ndtalk data.csv -p \"Top 5 products\" --json\n# Output: {\"sql\": \"SELECT ...\", \"data\": [...], \"error\": null}\n\n# CSV format - for export and further processing\ndtalk data.csv -p \"Top 5 products\" --csv\n# Output: product_name,revenue\n#         Apple,1000\n#         Orange,500\n```\n\n### Debug \u0026 Display Options\n\n```bash\n# SQL queries are shown by default\ndtalk data.csv -p \"query\"\n\n# Hide generated SQL\ndtalk data.csv -p \"query\" --no-sql\n\n# Show only SQL without executing (for debugging/validation)\ndtalk data.csv -p \"query\" --sql-only\n\n# Hide column details table when loading data\ndtalk data.csv --no-schema\n\n# Combine options\ndtalk data.csv -p \"query\" --no-sql --no-schema    # Hide both SQL and schema\n```\n\n### Scripting\n\nDataTalk supports structured output formats for integration with scripts and pipelines:\n\n```bash\n# JSON output for scripting\nREVENUE=$(dtalk sales.csv -p \"total revenue\" --json | jq -r '.data[0].total_revenue')\necho \"Total Revenue: $REVENUE\"\n\n# CSV output for further processing\ndtalk sales.csv -p \"sales by region\" --csv | \\\n  awk -F',' '{sum+=$2} END {print \"Grand Total:\", sum}'\n\n# Process multiple files\nfor file in data_*.csv; do\n  COUNT=$(dtalk \"$file\" -p \"row count\" --json | jq -r '.data[0].count')\n  echo \"$file: $COUNT rows\"\ndone\n\n# Generate SQL for external tools\nSQL=$(dtalk sales.csv -p \"top 10 products\" --sql-only)\necho \"$SQL\" | duckdb production.db\n\n# Export filtered data\ndtalk sales.csv -p \"sales from Q4 2024\" --csv \u003e q4_sales.csv\n\n# Combine with other tools\ndtalk sales.csv -p \"top products\" --json | \\\n  jq '.data[] | select(.revenue \u003e 1000)'\n```\n\n\n## Contributing\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, making releases, and contribution guidelines.\n\n## Exit Codes\n\nDataTalk returns standard exit codes for use in scripts and automation:\n\n| Exit Code | Meaning | Example |\n|-----------|---------|---------|\n| `0` | Success | Query completed successfully |\n| `1` | Runtime error | Missing API key, query failed, file not found |\n| `2` | Invalid arguments | `--json` without `-p`, invalid option combination |\n\n**Example usage in scripts:**\n```bash\nif dtalk sales.csv -p \"total revenue\" --json \u003e result.json; then\n    echo \"Success!\"\nelse\n    echo \"Failed with exit code $?\"\nfi\n```\n\n## FAQ\n\n**Q: Can I use this completely offline?**  \nA: Yes! Use local Ollama models and DataTalk works 100% offline with no internet connection required. Your data and queries never leave your machine.\n\n**Q: Is my data sent to the LLM provider?**  \nA: With cloud models, only schema (column names and types) is sent - your actual data stays local. With local Ollama models, nothing leaves your machine at all.\n\n**Q: What file formats are supported?**  \nA: CSV, Excel (.xlsx, .xls), and Parquet files.\n\n**Q: How large files can I query?**  \nA: DuckDB handles multi-gigabyte files. Parquet is faster for large datasets.\n\n## License\n\nMIT License - see [LICENSE](LICENSE) file.\n\nBuilt with [DuckDB](https://duckdb.org/), [LiteLLM](https://docs.litellm.ai), and [Rich](https://rich.readthedocs.io/).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvtsaplin%2Fdatatalk-cli","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvtsaplin%2Fdatatalk-cli","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvtsaplin%2Fdatatalk-cli/lists"}