https://github.com/hansmeershoek/pytics
https://github.com/hansmeershoek/pytics
data-analysis data-profiling data-quality data-science data-visualization eda jupyter pandas plotly profiling-datasets python
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/hansmeershoek/pytics
- Owner: HansMeershoek
- License: mit
- Created: 2025-04-01T15:20:56.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-06T20:49:54.000Z (about 1 year ago)
- Last Synced: 2025-05-09T01:44:53.527Z (11 months ago)
- Topics: data-analysis, data-profiling, data-quality, data-science, data-visualization, eda, jupyter, pandas, plotly, profiling-datasets, python
- Language: HTML
- Homepage:
- Size: 1.48 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# pytics
[](https://pypi.org/project/pytics/)
[](https://pypi.org/project/pytics/)
[](https://opensource.org/licenses/MIT)
[](https://github.com/HansMeershoek/pytics/actions/workflows/python-test.yml)
An interactive data profiling library for Python that generates comprehensive HTML reports with rich visualizations and PDF export capabilities.
## Features
- 📊 **Interactive Visualizations**: Built with Plotly for dynamic, interactive charts
- 📱 **Responsive Design**: Reports adapt to different screen sizes
- 📄 **PDF Export**: Generate publication-ready PDF reports
- 🎯 **Target Analysis**: Special insights for classification/regression tasks
- 🔍 **Comprehensive Profiling**: Detailed statistics and distributions
- ⚡ **Performance Optimized**: Efficient handling of large datasets
- 🛠️ **Customizable**: Configure sections and visualization options
- ↔️ **DataFrame Comparison**: Compare two datasets for differences in schema, stats, and distributions
## Example Reports
### Full Profile Report

### Targeted Analysis Report

## Installation
```bash
pip install pytics
```
## Quick Start
```python
import pandas as pd
from pytics import profile, compare
# --- Basic Profiling ---
# Method 1: Profile a DataFrame object
df = pd.read_csv('your_data.csv')
profile(df, output_file='report.html')
# Method 2: Profile directly from a file path
# Supports CSV and Parquet files
profile('path/to/your_data.csv', output_file='report.html')
profile('path/to/your_data.parquet', output_file='report.html')
# --- Advanced Profiling ---
# Generate a PDF report
profile(df, output_format='pdf', output_file='report.pdf')
# Profile with a target variable for enhanced analysis
profile(
df,
target='target_column', # Enables target-specific analysis
output_file='targeted_report.html'
)
# Select specific sections to include/exclude
profile(
df,
include_sections=['overview', 'correlations'],
exclude_sections=['target_analysis'],
output_file='custom_report.html'
)
# --- DataFrame Comparison ---
# Method 1: Compare two DataFrame objects
df_train = pd.read_csv('train_data.csv')
df_test = pd.read_csv('test_data.csv')
compare(
df_train,
df_test,
name1='Train Set', # Optional: Custom names for the datasets
name2='Test Set',
output_file='comparison.html'
)
# Method 2: Compare directly from file paths
compare(
'path/to/train_data.csv',
'path/to/test_data.csv',
name1='Train Set',
name2='Test Set',
output_file='comparison.html'
)
```
## Target Variable Analysis
When you specify a target variable using the `target` parameter, pytics enhances the analysis with:
- Target distribution visualization
- Feature importance analysis
- Target-specific correlations
- Conditional distributions of features
- Statistical tests for feature-target relationships
Example:
```python
# Profile with target variable analysis
profile(
df,
target='target_column',
output_file='targeted_report.html'
)
```
## Configuration Options
### Profile Configuration
```python
profile(
df,
target='target_column', # Target variable for supervised learning
include_sections=['overview'], # Sections to include
exclude_sections=['correlations'],# Sections to exclude
output_format='pdf', # 'html' or 'pdf'
output_file='report.html', # Output file path
theme='light', # Report theme ('light' or 'dark')
title='Custom Report Title' # Report title
)
```
### Compare Configuration
```python
compare(
df1,
df2,
name1='First Dataset', # Custom name for first dataset
name2='Second Dataset', # Custom name for second dataset
output_file='comparison.html', # Output file path
theme='light', # Report theme ('light' or 'dark')
title='Dataset Comparison' # Report title
)
```
### Available Sections
- `overview`: Dataset summary and memory usage
- `variables`: Detailed variable analysis
- `correlations`: Correlation analysis
- `target_analysis`: Target-specific insights (requires target parameter)
- `interactions`: Feature interaction analysis
- `missing_values`: Missing value patterns
- `duplicates`: Duplicate record analysis
## Report Sections
1. **Overview**
- Dataset summary
- Memory usage
- Data types distribution
- Missing values summary
2. **DataFrame Summary**
- Complete DataFrame info output
- Numerical and categorical statistics
- Data preview (head/tail)
- Memory usage details
3. **Variable Analysis**
- Detailed statistics
- Distribution plots
- Missing value patterns
- Unique values analysis
4. **Correlations**
- Correlation matrix
- Feature relationships
- Interactive heatmaps
5. **Target Analysis** (when target specified)
- Target distribution
- Feature importance
- Target correlations
6. **Missing Values**
- Missing value patterns
- Distribution analysis
- Correlation with other features
7. **Duplicates**
- Duplicate record analysis
- Pattern identification
- Impact assessment
8. **About**
- Project information
- Feature overview
- GitHub repository links
## Edge Cases and Limitations
### Data Size Limits
- Recommended maximum rows: 1 million
- Recommended maximum columns: 1000
- Large datasets may require increased memory allocation
### PDF Export Limitations
When exporting reports to PDF format:
- Plots are intentionally omitted due to a known issue with Kaleido version >= 0.2.1 that causes PDF export to hang indefinitely
- A message is displayed in place of each plot indicating it has been omitted
- All other report content (statistics, tables, etc.) remains fully functional
- For viewing plots, use the HTML export format which provides fully interactive visualizations
- If PDF plots are required, consider using pytics version 1.1.3 which supports them
### Special Cases
- Missing Values: Automatically handled and reported
- Categorical Variables: Limited to 1000 unique values by default
- Date/Time: Automatically detected and analyzed
- Mixed Data Types: Handled with appropriate warnings
### Error Handling
- Custom exceptions for clear error reporting
- Warning system for non-critical issues
- Graceful degradation for memory constraints
## Best Practices
1. **Memory Management**
- Sample large datasets if needed
- Use section selection for focused analysis
- Monitor memory usage for big datasets
2. **Performance Optimization**
- Limit categorical variables when possible
- Use targeted section selection
- Consider data sampling for initial exploration
3. **Report Generation**
- Choose appropriate output format
- Use meaningful report titles
- Save reports with descriptive filenames
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request. See the [CONTRIBUTING.md](CONTRIBUTING.md) file for guidelines.
## License
This project is licensed under the MIT License - see the LICENSE file for details.