https://github.com/itsbigspark/pymetagen
Metadata Generator
https://github.com/itsbigspark/pymetagen
cli csv metadata metadata-extraction parquet parquet-tools polars pyarrow python sql-query
Last synced: 25 days ago
JSON representation
Metadata Generator
- Host: GitHub
- URL: https://github.com/itsbigspark/pymetagen
- Owner: itsbigspark
- License: mit
- Created: 2023-09-25T21:03:51.000Z (over 2 years ago)
- Default Branch: dev/main
- Last Pushed: 2026-01-16T08:41:53.000Z (30 days ago)
- Last Synced: 2026-01-16T23:08:21.167Z (29 days ago)
- Topics: cli, csv, metadata, metadata-extraction, parquet, parquet-tools, polars, pyarrow, python, sql-query
- Language: Python
- Homepage:
- Size: 1.06 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Codeowners: .github/CODEOWNERS
Awesome Lists containing this project
README
[](https://sonarcloud.io/summary/new_code?id=itsbigspark_pymetagen)[](https://sonarcloud.io/summary/new_code?id=itsbigspark_pymetagen)[](https://github.com/itsbigspark/pymetagen/actions/workflows/release-action.yml)[](https://github.com/itsbigspark/pymetagen/actions/workflows/pypi-release.yml)[](https://dl.circleci.com/status-badge/redirect/circleci/E2yL1VmSsVrCRNYYNptvyt/61gfAqXnv7nkeAjCUm2P1s/tree/dev/main)
# PyMetaGen
**PyMetaGen** is a powerful and fast data quality tool base on [Polars](https://pola.rs/#) designed for generating metadata and extracting useful information from various data file formats. It provides both a Python API and a command-line interface (CLI) to inspect, filter, and extract data from files such as CSV, JSON, Parquet, and Excel.
## Key Features
- **Metadata Generation**: Automatically generates metadata for your datasets, including statistics such as min, max, standard deviation, and more.
- **Data Extraction**: Easily extract specific rows from your datasets using head, tail, or random sampling.
- **Command Line Interface**: Perform operations like metadata generation, data inspection, and filtering using an intuitive CLI.
- **Multiple File Format Support**: Import and export data in various formats, including CSV, Parquet, Excel, and JSON.
- **SQL Query Support**: Filter data using SQL queries directly on the command line.
## Installation
To install the package, use the following command:
```bash
pip install pymetagen
```
## Local Installation
To install the package locally, use the following command:
```bash
python -m pip install -U git+ssh://git@github.com/itsbigspark/dotdda.git@dev/main
```
## Usage
### Python API
You can use the Python API to load a data file and generate metadata:
```python
from pymetagen import MetaGen
# Create an instance of the MetaGen class reading a data file
metagen = MetaGen.from_path("tests/data/testdata.csv", loading_mode="eager")
# Display the first few rows of the data
metagen.data.head()
```
```python
# Generate metadata and reset the index
metadata = metagen.compute_metadata().reset_index()
```
```python
# Save the metadata to a file
metagen.write_metadata("tests/data/testdata_metadata.csv")
```
### Command Line Interface
- **Metadata Generation** Generate metadata for a tabular data file:
```bash
$ metagen metadata -i tests/data/testdata.csv -o tests/data/testdata_metadata.csv
>>> Generating metadata for tests/data/testdata.csv...
```
- **Data Inspection** Inspect a data file (e.g., a partitioned Parquet file):
```bash
metagen inspect -i tests/data/input_ab_partition.parquet
```
- **Data Filtering** Filter a data set using an SQL query:
```bash
metagen filter -i tests/data/testdata.csv -q "SELECT * FROM data WHERE imdb_score > 9"
```
- **Data Extraction** Extract a specific number of rows from a data set:
```bash
$ metagen extracts -i tests/data/testdata.csv -o tests.csv -n 3
>>> Writing extract in: tests-head.csv
>>> Writing extract in: tests-tail.csv
>>> Writing extract in: tests-sample.csv
```
### Available Output Formats
- **CSV**
- **Parquet**
- **JSON**
- **Excel**