https://github.com/edisedis777/duckdb-analyzer
A powerful tool for analyzing large CSV datasets using DuckDB.
https://github.com/edisedis777/duckdb-analyzer
csv data-analysis database duckdb
Last synced: 8 months ago
JSON representation
A powerful tool for analyzing large CSV datasets using DuckDB.
- Host: GitHub
- URL: https://github.com/edisedis777/duckdb-analyzer
- Owner: edisedis777
- License: agpl-3.0
- Created: 2025-03-12T20:27:03.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-05-02T06:04:41.000Z (9 months ago)
- Last Synced: 2025-05-02T07:20:05.805Z (9 months ago)
- Topics: csv, data-analysis, database, duckdb
- Language: Python
- Homepage:
- Size: 42 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# DuckDB Analyzer
[](https://opensource.org/licenses/MIT)
[](https://www.python.org/downloads/)
A powerful tool for analyzing large CSV datasets using DuckDB - a high-performance analytical database system.
Example - Sample 10 random rows from a CSV file:

## ๐ Overview
DuckDB Analyzer simplifies working with large CSV datasets by leveraging the speed and efficiency of DuckDB. It provides a user-friendly CLI and Python API for common data analysis tasks without requiring complex database setup.
**Key Features:**
- Lightning-fast CSV import and querying
- Memory-efficient processing of large datasets
- Simple command-line interface for common operations
- Python API for integration with existing workflows
- No database server or setup required
## ๐ Requirements
- Python 3.7+
- Dependencies:
- duckdb
- pandas
## ๐ง Installation
```bash
# Clone the repository
git clone https://github.com/yourusername/duckdb-analyzer.git
cd duckdb-analyzer
# Install dependencies
pip install -r requirements.txt
```
Alternatively, create a virtual environment:
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
```
## ๐ Usage
### Command Line Interface
```bash
python duckdb_analyzer.py [options]
```
#### Get Sample Data
Some great sample data CSV files are available [here](https://www.datablist.com/learn/csv/download-sample-csv-files) for free.
#### Examples:
Count rows in a CSV file:
```bash
python duckdb_analyzer.py --file data.csv --action count
```
Sample 10 random rows from a CSV file:
```bash
python duckdb_analyzer.py --file data.csv --action sample --limit 10 --random
```
Import a CSV file into a DuckDB table:
```bash
python duckdb_analyzer.py --file data.csv --action import --table my_data
```
Get statistics for a specific column:
```bash
python duckdb_analyzer.py --file data.csv --action stats --column age
```
Perform group-by analysis:
```bash
python duckdb_analyzer.py --file data.csv --action group --column category
```
Execute a custom SQL query:
```bash
python duckdb_analyzer.py --action query --sql "SELECT * FROM 'data.csv' WHERE id > 100 LIMIT 5"
```
### Python API
```python
from duckdb_analyzer import DuckDBAnalyzer
# Use as a context manager
with DuckDBAnalyzer() as analyzer:
# Count rows in a CSV file
count = analyzer.count_rows("data.csv")
print(f"Found {count:,} rows")
# Sample data
df = analyzer.sample_data("data.csv", rows=5, random=True)
print(df)
# Import into a table
analyzer.import_csv_to_table("data.csv", "my_table")
# Run a custom query
result = analyzer.execute_query("SELECT * FROM my_table WHERE age > 30")
```
## ๐ Available Actions
| Action | Description | Required Args | Optional Args |
|--------|-------------|--------------|--------------|
| `count` | Count rows in a CSV file | `--file` | - |
| `sample` | Show sample rows from a file | `--file` | `--limit`, `--random` |
| `import` | Import CSV to a DuckDB table | `--file`, `--table` | `--overwrite` |
| `stats` | Get statistics for a column | `--file`, `--column` | - |
| `schema` | Show table schema | `--table` | - |
| `compression` | Show table compression info | `--table` | - |
| `group` | Perform group-by analysis | `--file`, `--column` | - |
| `query` | Run a custom SQL query | `--sql` | - |
## ๐งช Performance
DuckDB Analyzer significantly outperforms traditional Python data processing methods for large datasets.
## ๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
## ๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
## ๐ Acknowledgements
- [DuckDB](https://duckdb.org/) - The analytical database system that powers this tool
- [Pandas](https://pandas.pydata.org/) - For data manipulation and analysis
- [DataBlist](https://www.datablist.com/learn/csv/download-sample-csv-files) - For free large sample CSV files for testing.