Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ohadmata/shmessy
If your data is messy - Use Shmessy!
https://github.com/ohadmata/shmessy
Last synced: about 2 months ago
JSON representation
If your data is messy - Use Shmessy!
- Host: GitHub
- URL: https://github.com/ohadmata/shmessy
- Owner: ohadmata
- License: mit
- Created: 2023-12-27T20:15:01.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-04-08T17:24:40.000Z (9 months ago)
- Last Synced: 2024-04-08T20:13:44.853Z (9 months ago)
- Language: Python
- Homepage:
- Size: 915 KB
- Stars: 21
- Watchers: 1
- Forks: 1
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
- awesome-opensource-israel - Shmessy - If your data is messy - Use Shmessy! ![GitHub last commit](https://img.shields.io/github/last-commit/ohadmata/shmessy?style=flat-square) ![GitHub top language](https://img.shields.io/github/languages/top/ohadmata/shmessy?style=flat-square) (Projects by main language / python)
README
# Shmessy
[![PyPI version](https://img.shields.io/pypi/v/shmessy)](https://img.shields.io/pypi/v/shmessy)
[![PyPI - Downloads](https://img.shields.io/pypi/dm/shmessy)](https://pypi.org/project/shmessy/)
![Coverage report](https://raw.githubusercontent.com/ohadmata/shmessy/main/assets/coverage.svg)
[![CI](https://github.com/ohadmata/shmessy/actions/workflows/main.yml/badge.svg)](https://github.com/ohadmata/shmessy/actions/workflows/main.yml)
[![License](https://img.shields.io/:license-MIT-blue.svg)](https://opensource.org/license/mit/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/shmessy)](https://pypi.org/project/shmessy/)
![OS](https://img.shields.io/badge/ubuntu-blue?logo=ubuntu)
![OS](https://img.shields.io/badge/mac-blue?logo=apple)
![OS](https://img.shields.io/badge/win-blue?logo=windows)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
### If your data is messy - Use Shmessy!Shmessy designed to deal with messy pandas dataframes.
We all knows the frustrating times when we as analysts or data-engineers should handle messy dataframe and analyze them by ourselves.The goal of this tiny tool is to identify the physical / logical data type for each Dataframe column.
It based on fast validators that will validate the data (Based on a sample) against regex / pydantic types or any additional validation function that you want to implement.As you understand, this tool was designed to deal with dirty data,
ideally developed for Dataframes generated from CSV / Flat files or any source that doesn't contain strict schema.## Installation
```python
pip install shmessy
```## Usage
You have two ways to use this tool
### Identify the Dataframe schema
```python
import pandas as pd
from shmessy import Shmessydf = pd.read_csv('/tmp/file.csv')
inferred_schema = Shmessy().infer_schema(df)
```Output (inferred_schema dump):
```json
{
"infer_duration_ms": 12,
"columns": [
{
"field_name": "id",
"source_type": "Integer",
"inferred_type": "Integer"
},
{
"field_name": "email_value",
"source_type": "String",
"inferred_type": "Email"
},
{
"field_name": "date_value",
"source_type": "String",
"inferred_type": "Date",
"inferred_pattern": "%d-%m-%Y"
},
{
"field_name": "datetime_value",
"source_type": "String",
"inferred_type": "Datetime",
"inferred_pattern": "%Y/%m/%d %H:%M:%S"
},
{
"field_name": "yes_no_data",
"source_type": "String",
"inferred_type": "Boolean",
"inferred_pattern": [
"YES",
"NO"
]
},
{
"field_name": "unix_value",
"source_type": "Integer",
"inferred_type": "UnixTimestamp",
"inferred_pattern": "ms"
},
{
"field_name": "ip_value",
"source_type": "String",
"inferred_type": "IPv4"
}
]
}
```### Identify and fix Pandas Dataframe
This piece of code will change the column types of the input Dataframe according to Messy infer.
```python
import pandas as pd
from shmessy import Shmessydf = pd.read_csv('/tmp/file.csv')
fixed_df = Shmessy().fix_schema(df)
```#### Original Dataframe
![Original Dataframe](https://raw.githubusercontent.com/ohadmata/shmessy/main/assets/screenshot_7.png)#### Fixed Dataframe
![After fix](https://raw.githubusercontent.com/ohadmata/shmessy/main/assets/screenshot_8.png)### Read Messy CSV file
```python
from shmessy import Shmessy
df = Shmessy().read_csv('/tmp/file.csv')
```#### Original file
![Original Dataframe](https://raw.githubusercontent.com/ohadmata/shmessy/main/assets/screenshot_5.png)#### Fixed Dataframe
![After fix](https://raw.githubusercontent.com/ohadmata/shmessy/main/assets/screenshot_6.png)## API
### Constructor
```python
shmessy = Shmessy(
sample_size: Optional[int] = 1000,
reader_encoding: Optional[str] = "UTF-8",
locale_formatter: Optional[str] = "en_US",
use_random_sample: Optional[bool] = True,
types_to_ignore: Optional[List[str]] = None,
max_columns_num: Optional[int] = 500,
fallback_to_string: Optional[bool] = False, # Fallback to string in case of casting exception
fallback_to_null: Optional[bool] = False, # Fallback to null in case of casting exception
use_csv_sniffer: Optional[bool] = True, # Use python sniffer to identify the dialect (seperator / quote-char / etc...)
fix_column_names: Optional[bool] = False, # Replace non-alphabetic/numeric chars with underscore
)
```### read_csv
```python
shmessy.read_csv(filepath_or_buffer: Union[str, TextIO, BinaryIO]) -> DataFrame
```### infer_schema
```python
shmessy.infer_schema(df: Dataframe) -> ShmessySchema
```### fix_schema
```python
shmessy.fix_schema(df: Dataframe) -> DataFrame
```### get_inferred_schema
```python
shmessy.get_inferred_schema() -> ShmessySchema
```