https://github.com/sermetpekin/perse

Perse is an experimental Python package that combines some of the most widely-used functionalities from the powerhouse libraries Pandas, Polars, and DuckDB into a single, unified DataFrame object. The goal of Perse is to provide a streamlined and efficient interface, leveraging the strengths of these libraries to create a versatile data handling.
https://github.com/sermetpekin/perse

data data-science data-structures duckdb pandas polars

Last synced: 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/sermetpekin/perse
Owner: SermetPekin
License: mit
Created: 2024-11-07T12:16:35.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-12-12T13:38:40.000Z (over 1 year ago)
Last Synced: 2025-09-01T08:31:23.853Z (11 months ago)
Topics: data, data-science, data-structures, duckdb, pandas, polars
Language: Python
Homepage: https://perse.readthedocs.io/en/latest/home.html
Size: 132 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          
[![Python Package](https://github.com/SermetPekin/perse/actions/workflows/python-package.yml/badge.svg?2)](https://github.com/SermetPekin/perse/actions/workflows/python-package.yml)

[![PyPI](https://img.shields.io/pypi/v/perse)](https://img.shields.io/pypi/v/perse) ![PyPI Downloads](https://static.pepy.tech/badge/perse?2)![t](https://img.shields.io/badge/status-maintained-yellow.svg) [![](https://img.shields.io/github/license/SermetPekin/perse.svg)](https://github.com/SermetPekin/perse/blob/master/LICENSE.md) [![](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/) 

# Perse

**Perse** is an experimental Python package that combines some of the most widely-used functionalities from the powerhouse libraries **Pandas**, **Polars**, and **DuckDB** into a single, unified `DataFrame` object. The goal of Perse is to provide a streamlined and efficient interface, leveraging the strengths of these libraries to create a versatile data handling experience.

This package is currently experimental, with a focus on essential functions. We plan to expand its capabilities by integrating more features from Pandas, Polars, and DuckDB in future versions.

## Key Features

The `Perse` DataFrame currently supports the following functionalities:

### 1. Data Manipulation

Core data-handling tools inspired by Pandas and Polars.

- **Indexing and Selection**: Access specific rows or columns with `.loc` and `.iloc` properties.

- **Column Operations**: Add, modify, or delete columns efficiently.

- **Row Filtering**: Filter rows based on specific conditions.

- **Aggregation**: Summarize data with aggregations like `sum`, `mean`, `count`.

- **Sorting**: Sort data based on column values.

- **Custom Function Application**: Apply custom functions to columns, supporting both element-wise operations and complex transformations.

### 2. SQL Querying

Use DuckDB's SQL engine to run SQL queries directly on the DataFrame, ideal for complex filtering and data manipulation.

- **Direct SQL Queries**: Run SQL queries directly on data using DuckDB’s powerful engine.

- **Seamless Integration**: Convert between Polars and DuckDB seamlessly for efficient querying on large datasets.

- **Advanced Filtering**: Filter, join, and group data using SQL syntax.

### 3. Data Transformation

A collection of versatile data transformation functions.

- **Pivot and Unpivot**: Reshape data for summary reports and visualizations.

- **Melt/Stack**: Transform data between wide and long formats.

- **Mapping and Replacing**: Map values based on conditions or replace them in columns.

- **Grouping and Window Functions**: Group by specific columns and apply aggregations or window functions for advanced data summarization.

### 4. Compatibility and Conversion

Interoperability between Pandas, Polars, and DuckDB formats, offering flexibility in data manipulation.

- **Pandas Compatibility**: Conversion utilities to easily move data between Pandas and Polars.

- **Automatic Data Handling**: Automatically convert and handle data depending on the operation, allowing users to work flexibly with either Pandas or Polars.

- **File I/O Support**: Read and write from common file formats (e.g., CSV, Parquet, JSON).

### 5. Visualization

Basic plotting capabilities that make it easy to visualize data directly from the Perse DataFrame.

- **Line, Bar, and Scatter Plots**: Quick visualizations with common plot types.

- **Customization**: Customize plot titles, labels, and legends with Matplotlib.

- **Direct Plotting**: Plot directly from the Perse DataFrame, which internally uses Pandas’ Matplotlib integration.

### 6. Data Integrity and Locking

Features designed to prevent accidental modifications and ensure data integrity.

- **Locking Mechanism**: Lock the DataFrame to prevent accidental edits.

- **Unlocking**: Explicitly unlock to allow modifications.

- **Validation**: Ensure data type consistency across columns for critical operations.

## Installation

To install Perse, run:

```bash

pip install perse

```

### Usage 

```python 

from perse import DataFrame

import numpy as np

# Sample data

data = {"A": np.random.randint(0, 100, 10), "B": np.random.random(10), "C": np.random.choice(["X", "Y", "Z"], 10)}

df = DataFrame(data)

# 1. Add a New Column 

df.add_column("D", np.random.random(10), inplace=True)

print("DataFrame with new column D:\n", df)

# 2. Filter Rows

df2 = df.filter_rows(df.dl["A"] > 50, inplace=False) # default inplace = False 

print("Filtered DataFrame (A > 50):\n", df2)

# 3. SQL Querying with DuckDB

df2 = df.query("SELECT A, AVG(B) AS avg_B FROM this GROUP BY A")

print("SQL Query Result:\n", df2)

# 4. Visualization

df.plot(kind="scatter", x="A", y="B", title="Scatter Plot of A vs B", xlabel="A values", ylabel="B values")

# 5. Convert to Pandas

df2 = df.to_pandas()

print("Converted to Pandas DataFrame:\n", df2)

```

### Exporting data

```python

from perse import DataFrame

import numpy as np

# Generate sample data

np.random.seed(42)

data = {

    "A": np.random.randint(0, 100, 10),

    "B": np.random.random(10),

    "C": np.random.choice(["X", "Y", "Z"], 10),

}

df = DataFrame(data)

# Export as CSV file

df.to_csv('example.csv')

# Export as Excel file

df.to_excel('example.xlsx')

# Export as JSON file

df.to_json('example.json')

# Alternatively this concise expression can also be used

df > 'example.csv'

df > 'example.xlsx'

df > 'example.json'

```

Pipe Operator

================

In Python, the | operator is traditionally used as the OR operator. However, in the DataFrame class, the | operator has been repurposed for a functional, chainable approach, similar to other modern data processing libraries. This enables more readable and flexible expressions.

```python 

from perse import DataFrame

import numpy as np

# Sample data

data = {"A": np.random.randint(0, 100, 10), "B": np.random.random(10), "C": np.random.choice(["X", "Y", "Z"], 10)}

df = DataFrame(data)

# Applying the print function to the DataFrame instance

df | print

# Chaining functions: the instance is returned if no modification is made

df2 = df | print | print

# Using a lambda function to call `to_csv` with arguments, demonstrating flexibility in piping

_ = df | (lambda x: x.to_csv('example.csv'))

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sermetpekin/perse

Awesome Lists containing this project

README