{"id":26491621,"url":"https://github.com/edisedis777/duckdb-analyzer","last_synced_at":"2026-04-16T08:32:38.630Z","repository":{"id":282109814,"uuid":"947522583","full_name":"edisedis777/DuckDB-Analyzer","owner":"edisedis777","description":"A powerful tool for analyzing large CSV datasets using DuckDB.","archived":false,"fork":false,"pushed_at":"2025-05-02T06:04:41.000Z","size":43,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-02T07:20:05.805Z","etag":null,"topics":["csv","data-analysis","database","duckdb"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/edisedis777.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-03-12T20:27:03.000Z","updated_at":"2025-05-02T06:04:44.000Z","dependencies_parsed_at":"2025-06-18T05:33:00.541Z","dependency_job_id":null,"html_url":"https://github.com/edisedis777/DuckDB-Analyzer","commit_stats":null,"previous_names":["edisedis777/duckdb-data-analysis"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/edisedis777/DuckDB-Analyzer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edisedis777%2FDuckDB-Analyzer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edisedis777%2FDuckDB-Analyzer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edisedis777%2FDuckDB-Analyzer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edisedis777%2FDuckDB-Analyzer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/edisedis777","download_url":"https://codeload.github.com/edisedis777/DuckDB-Analyzer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edisedis777%2FDuckDB-Analyzer/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260495567,"owners_count":23017920,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["csv","data-analysis","database","duckdb"],"created_at":"2025-03-20T08:49:55.925Z","updated_at":"2026-04-16T08:32:38.624Z","avatar_url":"https://github.com/edisedis777.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DuckDB Analyzer\n[![Python Version](https://img.shields.io/badge/python-3.7%2B-blue)](https://www.python.org/downloads/)\n\nA powerful tool for analyzing large CSV datasets using DuckDB - a high-performance analytical database system.\n\nExample - Sample 10 random rows from a CSV file:\n\u003cimg width=\"894\" alt=\"Screenshot 2025-03-12 at 21 52 22\" src=\"https://github.com/user-attachments/assets/1dd3fbb0-54a9-4101-b500-4910ce267b43\" /\u003e\n\n## 🚀 Overview\nDuckDB Analyzer simplifies working with large CSV datasets by leveraging the speed and efficiency of DuckDB. It provides a user-friendly CLI and Python API for common data analysis tasks without requiring complex database setup.\n\n**Key Features:**\n- Lightning-fast CSV import and querying\n- Memory-efficient processing of large datasets\n- Simple command-line interface for common operations\n- Python API for integration with existing workflows\n- No database server or setup required\n\n## 📋 Requirements\n- Python 3.7+\n- Dependencies:\n  - duckdb\n  - pandas\n\n## 🔧 Installation\n\n```bash\n# Clone the repository\ngit clone https://github.com/yourusername/duckdb-analyzer.git\ncd duckdb-analyzer\n\n# Install dependencies\npip install -r requirements.txt\n```\n\nAlternatively, create a virtual environment:\n\n```bash\npython -m venv venv\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\npip install -r requirements.txt\n```\n\n## 📊 Usage\n\n### Command Line Interface\n\n```bash\npython duckdb_analyzer.py [options]\n```\n\n#### Get Sample Data\nSome great sample data CSV files are available [here](https://www.datablist.com/learn/csv/download-sample-csv-files) for free.\n\n#### Examples:\n\nCount rows in a CSV file:\n```bash\npython duckdb_analyzer.py --file data.csv --action count\n```\n\nSample 10 random rows from a CSV file:\n```bash\npython duckdb_analyzer.py --file data.csv --action sample --limit 10 --random\n```\n\nImport a CSV file into a DuckDB table:\n```bash\npython duckdb_analyzer.py --file data.csv --action import --table my_data\n```\n\nGet statistics for a specific column:\n```bash\npython duckdb_analyzer.py --file data.csv --action stats --column age\n```\n\nPerform group-by analysis:\n```bash\npython duckdb_analyzer.py --file data.csv --action group --column category\n```\n\nExecute a custom SQL query:\n```bash\npython duckdb_analyzer.py --action query --sql \"SELECT * FROM 'data.csv' WHERE id \u003e 100 LIMIT 5\"\n```\n\n### Python API\n\n```python\nfrom duckdb_analyzer import DuckDBAnalyzer\n\n# Use as a context manager\nwith DuckDBAnalyzer() as analyzer:\n    # Count rows in a CSV file\n    count = analyzer.count_rows(\"data.csv\")\n    print(f\"Found {count:,} rows\")\n    \n    # Sample data\n    df = analyzer.sample_data(\"data.csv\", rows=5, random=True)\n    print(df)\n    \n    # Import into a table\n    analyzer.import_csv_to_table(\"data.csv\", \"my_table\")\n    \n    # Run a custom query\n    result = analyzer.execute_query(\"SELECT * FROM my_table WHERE age \u003e 30\")\n```\n\n## 🔍 Available Actions\n\n| Action | Description | Required Args | Optional Args |\n|--------|-------------|--------------|--------------|\n| `count` | Count rows in a CSV file | `--file` | - |\n| `sample` | Show sample rows from a file | `--file` | `--limit`, `--random` |\n| `import` | Import CSV to a DuckDB table | `--file`, `--table` | `--overwrite` |\n| `stats` | Get statistics for a column | `--file`, `--column` | - |\n| `schema` | Show table schema | `--table` | - |\n| `compression` | Show table compression info | `--table` | - |\n| `group` | Perform group-by analysis | `--file`, `--column` | - |\n| `query` | Run a custom SQL query | `--sql` | - |\n\n## 🧪 Performance\nDuckDB Analyzer significantly outperforms traditional Python data processing methods for large datasets.\n\n## 🤝 Contributing\nContributions are welcome! Please feel free to submit a Pull Request.\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/amazing-feature`)\n3. Commit your changes (`git commit -m 'Add some amazing feature'`)\n4. Push to the branch (`git push origin feature/amazing-feature`)\n5. Open a Pull Request\n\n## 🙏 Acknowledgements\n- [DuckDB](https://duckdb.org/) - The analytical database system that powers this tool\n- [Pandas](https://pandas.pydata.org/) - For data manipulation and analysis\n- [DataBlist](https://www.datablist.com/learn/csv/download-sample-csv-files) - For free large sample CSV files for testing.\n\n## 📜 License\nDistributed under the GNU Affero General Public License v3.0 License. See `LICENSE` for more information.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fedisedis777%2Fduckdb-analyzer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fedisedis777%2Fduckdb-analyzer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fedisedis777%2Fduckdb-analyzer/lists"}