An open API service indexing awesome lists of open source software.

https://github.com/edisedis777/duckdb-analyzer

A powerful tool for analyzing large CSV datasets using DuckDB.
https://github.com/edisedis777/duckdb-analyzer

csv data-analysis database duckdb

Last synced: 8 months ago
JSON representation

A powerful tool for analyzing large CSV datasets using DuckDB.

Awesome Lists containing this project

README

          

# DuckDB Analyzer
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python Version](https://img.shields.io/badge/python-3.7%2B-blue)](https://www.python.org/downloads/)

A powerful tool for analyzing large CSV datasets using DuckDB - a high-performance analytical database system.

Example - Sample 10 random rows from a CSV file:
Screenshot 2025-03-12 at 21 52 22

## ๐Ÿš€ Overview

DuckDB Analyzer simplifies working with large CSV datasets by leveraging the speed and efficiency of DuckDB. It provides a user-friendly CLI and Python API for common data analysis tasks without requiring complex database setup.

**Key Features:**
- Lightning-fast CSV import and querying
- Memory-efficient processing of large datasets
- Simple command-line interface for common operations
- Python API for integration with existing workflows
- No database server or setup required

## ๐Ÿ“‹ Requirements

- Python 3.7+
- Dependencies:
- duckdb
- pandas

## ๐Ÿ”ง Installation

```bash
# Clone the repository
git clone https://github.com/yourusername/duckdb-analyzer.git
cd duckdb-analyzer

# Install dependencies
pip install -r requirements.txt
```

Alternatively, create a virtual environment:

```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
```

## ๐Ÿ“Š Usage

### Command Line Interface

```bash
python duckdb_analyzer.py [options]
```

#### Get Sample Data
Some great sample data CSV files are available [here](https://www.datablist.com/learn/csv/download-sample-csv-files) for free.

#### Examples:

Count rows in a CSV file:
```bash
python duckdb_analyzer.py --file data.csv --action count
```

Sample 10 random rows from a CSV file:
```bash
python duckdb_analyzer.py --file data.csv --action sample --limit 10 --random
```

Import a CSV file into a DuckDB table:
```bash
python duckdb_analyzer.py --file data.csv --action import --table my_data
```

Get statistics for a specific column:
```bash
python duckdb_analyzer.py --file data.csv --action stats --column age
```

Perform group-by analysis:
```bash
python duckdb_analyzer.py --file data.csv --action group --column category
```

Execute a custom SQL query:
```bash
python duckdb_analyzer.py --action query --sql "SELECT * FROM 'data.csv' WHERE id > 100 LIMIT 5"
```

### Python API

```python
from duckdb_analyzer import DuckDBAnalyzer

# Use as a context manager
with DuckDBAnalyzer() as analyzer:
# Count rows in a CSV file
count = analyzer.count_rows("data.csv")
print(f"Found {count:,} rows")

# Sample data
df = analyzer.sample_data("data.csv", rows=5, random=True)
print(df)

# Import into a table
analyzer.import_csv_to_table("data.csv", "my_table")

# Run a custom query
result = analyzer.execute_query("SELECT * FROM my_table WHERE age > 30")
```

## ๐Ÿ” Available Actions

| Action | Description | Required Args | Optional Args |
|--------|-------------|--------------|--------------|
| `count` | Count rows in a CSV file | `--file` | - |
| `sample` | Show sample rows from a file | `--file` | `--limit`, `--random` |
| `import` | Import CSV to a DuckDB table | `--file`, `--table` | `--overwrite` |
| `stats` | Get statistics for a column | `--file`, `--column` | - |
| `schema` | Show table schema | `--table` | - |
| `compression` | Show table compression info | `--table` | - |
| `group` | Perform group-by analysis | `--file`, `--column` | - |
| `query` | Run a custom SQL query | `--sql` | - |

## ๐Ÿงช Performance

DuckDB Analyzer significantly outperforms traditional Python data processing methods for large datasets.

## ๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## ๐Ÿ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

## ๐Ÿ™ Acknowledgements

- [DuckDB](https://duckdb.org/) - The analytical database system that powers this tool
- [Pandas](https://pandas.pydata.org/) - For data manipulation and analysis
- [DataBlist](https://www.datablist.com/learn/csv/download-sample-csv-files) - For free large sample CSV files for testing.