https://github.com/edisedis777/duckdb-analyzer

A powerful tool for analyzing large CSV datasets using DuckDB.
https://github.com/edisedis777/duckdb-analyzer

csv data-analysis database duckdb

Last synced: 8 months ago
JSON representation

A powerful tool for analyzing large CSV datasets using DuckDB.

Host: GitHub
URL: https://github.com/edisedis777/duckdb-analyzer
Owner: edisedis777
License: agpl-3.0
Created: 2025-03-12T20:27:03.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-05-02T06:04:41.000Z (9 months ago)
Last Synced: 2025-05-02T07:20:05.805Z (9 months ago)
Topics: csv, data-analysis, database, duckdb
Language: Python
Homepage:
Size: 42 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # DuckDB Analyzer

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

[![Python Version](https://img.shields.io/badge/python-3.7%2B-blue)](https://www.python.org/downloads/)

A powerful tool for analyzing large CSV datasets using DuckDB - a high-performance analytical database system.

Example - Sample 10 random rows from a CSV file:



## 🚀 Overview

DuckDB Analyzer simplifies working with large CSV datasets by leveraging the speed and efficiency of DuckDB. It provides a user-friendly CLI and Python API for common data analysis tasks without requiring complex database setup.

**Key Features:**

- Lightning-fast CSV import and querying

- Memory-efficient processing of large datasets

- Simple command-line interface for common operations

- Python API for integration with existing workflows

- No database server or setup required

## 📋 Requirements

- Python 3.7+

- Dependencies:

  - duckdb

  - pandas

## 🔧 Installation

```bash

# Clone the repository

git clone https://github.com/yourusername/duckdb-analyzer.git

cd duckdb-analyzer

# Install dependencies

pip install -r requirements.txt

```

Alternatively, create a virtual environment:

```bash

python -m venv venv

source venv/bin/activate  # On Windows: venv\Scripts\activate

pip install -r requirements.txt

```

## 📊 Usage

### Command Line Interface

```bash

python duckdb_analyzer.py [options]

```

#### Get Sample Data

Some great sample data CSV files are available [here](https://www.datablist.com/learn/csv/download-sample-csv-files) for free.

#### Examples:

Count rows in a CSV file:

```bash

python duckdb_analyzer.py --file data.csv --action count

```

Sample 10 random rows from a CSV file:

```bash

python duckdb_analyzer.py --file data.csv --action sample --limit 10 --random

```

Import a CSV file into a DuckDB table:

```bash

python duckdb_analyzer.py --file data.csv --action import --table my_data

```

Get statistics for a specific column:

```bash

python duckdb_analyzer.py --file data.csv --action stats --column age

```

Perform group-by analysis:

```bash

python duckdb_analyzer.py --file data.csv --action group --column category

```

Execute a custom SQL query:

```bash

python duckdb_analyzer.py --action query --sql "SELECT * FROM 'data.csv' WHERE id > 100 LIMIT 5"

```

### Python API

```python

from duckdb_analyzer import DuckDBAnalyzer

# Use as a context manager

with DuckDBAnalyzer() as analyzer:

    # Count rows in a CSV file

    count = analyzer.count_rows("data.csv")

    print(f"Found {count:,} rows")

    

    # Sample data

    df = analyzer.sample_data("data.csv", rows=5, random=True)

    print(df)

    

    # Import into a table

    analyzer.import_csv_to_table("data.csv", "my_table")

    

    # Run a custom query

    result = analyzer.execute_query("SELECT * FROM my_table WHERE age > 30")

```

## 🔍 Available Actions

| Action | Description | Required Args | Optional Args |

|--------|-------------|--------------|--------------|

| `count` | Count rows in a CSV file | `--file` | - |

| `sample` | Show sample rows from a file | `--file` | `--limit`, `--random` |

| `import` | Import CSV to a DuckDB table | `--file`, `--table` | `--overwrite` |

| `stats` | Get statistics for a column | `--file`, `--column` | - |

| `schema` | Show table schema | `--table` | - |

| `compression` | Show table compression info | `--table` | - |

| `group` | Perform group-by analysis | `--file`, `--column` | - |

| `query` | Run a custom SQL query | `--sql` | - |

## 🧪 Performance

DuckDB Analyzer significantly outperforms traditional Python data processing methods for large datasets.

## 🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository

2. Create your feature branch (`git checkout -b feature/amazing-feature`)

3. Commit your changes (`git commit -m 'Add some amazing feature'`)

4. Push to the branch (`git push origin feature/amazing-feature`)

5. Open a Pull Request

## 📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

## 🙏 Acknowledgements

- [DuckDB](https://duckdb.org/) - The analytical database system that powers this tool

- [Pandas](https://pandas.pydata.org/) - For data manipulation and analysis

- [DataBlist](https://www.datablist.com/learn/csv/download-sample-csv-files) - For free large sample CSV files for testing.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/edisedis777/duckdb-analyzer

Awesome Lists containing this project

README