https://github.com/pmgraham/datagrunt

Datagrunt is a Python library designed to simplify the way you work with CSV files. It provides a streamlined approach to reading, processing, and transforming your data into various formats, making data manipulation efficient and intuitive.
https://github.com/pmgraham/datagrunt

csv csv-parser data-analysis data-engineering data-science data-wrangling dataframe duckdb open-source polars python python3

Last synced: 5 months ago
JSON representation

Host: GitHub
URL: https://github.com/pmgraham/datagrunt
Owner: pmgraham
License: mit
Created: 2024-08-26T20:02:20.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-07-13T13:06:39.000Z (7 months ago)
Last Synced: 2025-08-17T07:38:43.015Z (6 months ago)
Topics: csv, csv-parser, data-analysis, data-engineering, data-science, data-wrangling, dataframe, duckdb, open-source, polars, python, python3
Language: Python
Homepage: https://pmgraham.github.io/datagrunt
Size: 6.53 MB
Stars: 9
Watchers: 2
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Welcome To Datagrunt

## Why Datagrunt?

Born out of real-world frustration, Datagrunt eliminates the need for repetitive coding when handling CSV files. Whether you're a data analyst, data engineer, or data scientist, Datagrunt empowers you to focus on insights, not tedious data wrangling.

## Key Features

- **Intelligent Delimiter Inference:** Datagrunt automatically detects and applies the correct delimiter for your CSV files.
- **Seamless Data Processing:** Leverage the robust capabilities of [DuckDB](https://duckdb.org) and [Polars](https://pola.rs) to perform advanced data processing tasks directly on your CSV data.
- **Flexible Transformation:** Easily convert your processed CSV data into various formats to suit your needs.
- **AI-Powered Schema Analysis:** Use Google's Gemini models to automatically generate detailed schema reports for your CSV files, including data types, column classifications, and data quality checks.
- **Pythonic API:** Enjoy a clean and intuitive API that integrates seamlessly into your existing Python workflows.

## Installation
We recommend using [UV](https://docs.astral.sh/uv/). However, you may get started with Datagrunt in seconds using UV or pip.

Get started with UV:

```bash
uv pip install datagrunt
```

Get started with pip:

```bash
pip install datagrunt
```

## Getting Started

```python
from datagrunt import CSVReader

# Load your CSV file
csv_file = 'electric_vehicle_population_data.csv'
engine = 'duckdb'

# Set duckdb as the processing engine. Engine set to 'polars' by default
dg = CSVReader(csv_file, engine=engine)

# return sample of the
dg.get_sample()
┌───────
│ VIN (1-10) │ County │
│ varchar │ varchar │
├───────
│ 5YJSA1E28K
│ 1C4JJXP68P │ Yakima
│ WBY8P6C05L │ Kitsap
│ JTDKARFP1J │ Kitsap
│ 5UXTA6C09N
│ 5YJYGDEF8L │ King
│ JTMAB3FV7P │ Thurston
│ JN1AZ0CPXC │ King
│ JN1AZ0CP7B │ King
│ 1N4AZ0CP0F │ Thurston
│ · │ ·
│ · │ ·
│ · │ ·
│ 5YJYGDEE7M │ Clark
│ 7SAYGAEE0P
│ 2C4RC1N75P │ King
│ 1FTVW1EVXP │ King
│ 4JGGM1CB2P │ King
│ 1N4BZ0CP0G │ King
│ 7SAYGDEF2N │ King
│ 1N4BZ1DP7L │ King
...
├───────
│ ? rows (>9999 rows, 20 shown)
└───────
``` data to get a peek at the schema ─────┬───────────┬──────────────┬───┬──────────────────────┬──────────────────────┬───────────────────┐ City │ … │ Vehicle Location │ Electric Utility │ 2020 Census Tract │ varchar │ │ varchar │ varchar │ varchar │ ─────┼───────────┼──────────────┼───┼──────────────────────┼──────────────────────┼───────────────────┤ │ Snohomish │ Mukilteo │ … │ POINT (-122.29943 … │ PUGET SOUND ENERGY… │ 53061042001 │ │ Yakima │ … │ POINT (-120.468875… │ PACIFICORP │ 53077001601 │ │ Kingston │ … │ POINT (-122.517835… │ PUGET SOUND ENERGY… │ 53035090102 │ │ Port Orchard │ … │ POINT (-122.653005… │ PUGET SOUND ENERGY… │ 53035092802 │ │ Snohomish │ Everett │ … │ POINT (-122.203234… │ PUGET SOUND ENERGY… │ 53061041605 │ │ Seattle │ … │ POINT (-122.378886… │ CITY OF SEATTLE - … │ 53033004703 │ │ Rainier │ … │ POINT (-122.677141… │ PUGET SOUND ENERGY… │ 53067012530 │ │ Kirkland │ … │ POINT (-122.192596… │ PUGET SOUND ENERGY… │ 53033022402 │ │ Kirkland │ … │ POINT (-122.192596… │ PUGET SOUND ENERGY… │ 53033022603 │ │ Olympia │ … │ POINT (-122.86491 … │ PUGET SOUND ENERGY… │ 53067010300 │ │ · │ · │ · │ · │ · │ │ · │ · │ · │ · │ · │ │ · │ · │ · │ · │ · │ │ Vancouver │ … │ POINT (-122.515805… │ BONNEVILLE POWER A… │ 53011041310 │ │ Snohomish │ Monroe │ … │ POINT (-121.968385… │ PUGET SOUND ENERGY… │ 53061052203 │ │ Burien │ … │ POINT (-122.347227… │ CITY OF SEATTLE - … │ 53033027600 │ │ Kirkland │ … │ POINT (-122.202653… │ PUGET SOUND ENERGY… │ 53033022300 │ │ Seattle │ … │ POINT (-122.2453 4… │ CITY OF SEATTLE - … │ 53033011700 │ │ Seattle │ … │ POINT (-122.334079… │ CITY OF SEATTLE - … │ 53033008300 │ │ Bellevue │ … │ POINT (-122.144149… │ PUGET SOUND ENERGY… │ 53033024704 │ │ Bellevue │ … │ POINT (-122.144149… │ PUGET SOUND ENERGY… │ 53033024902 │ ─────┴───────────┴──────────────┴───┴──────────────────────┴──────────────────────┴───────────────────┤ 17 columns (6 shown) │ ──────────────────────────────────────────────────────────────────────────────────────────────────────┘

## DuckDB Integration for Performant SQL Queries
```python
from datagrunt import CSVReader

csv_file = 'electric_vehicle_population_data.csv'
engine = 'duckdb'

dg = CSVReader(csv_file, engine=engine)

# Construct your SQL query
query = f"""
WITH core AS (
SELECT
City AS city,
"VIN (1-10)" AS vin
FROM {dg.db_table}
)
SELECT
city,
COUNT(vin) AS vehicle_count
FROM core
GROUP BY 1
ORDER BY 2 DESC
"""

# Execute the query and get results as a Polars DataFrame
df = dg.query_data(query).pl()
print(df)
┌────────────────┬───────────────┐
│ city ┆ vehicle_count │
│ --- ┆ --- │
│ str ┆ i64 │
╞════════════════╪═══════════════╡
│ Seattle ┆ 32602 │
│ Bellevue ┆ 9960 │
│ Redmond ┆ 7165 │
│ Vancouver ┆ 7081 │
│ Bothell ┆ 6602 │
│ … ┆ … │
│ Glenwood ┆ 1 │
│ Walla Walla Co ┆ 1 │
│ Pittsburg ┆ 1 │
│ Decatur ┆ 1 │
│ Redwood City ┆ 1 │
└────────────────┴───────────────┘
```
## License
This project is licensed under the [MIT License](https://opensource.org/license/mit)

## Acknowledgements
A HUGE thank you to the open source community and the creators of [DuckDB](https://duckdb.org) and [Polars](https://pola.rs) for their fantastic libraries that power Datagrunt.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/pmgraham/datagrunt

Awesome Lists containing this project

README