{"id":28002934,"url":"https://github.com/hansmeershoek/pytics","last_synced_at":"2026-02-17T01:33:41.538Z","repository":{"id":285631806,"uuid":"958673620","full_name":"HansMeershoek/pytics","owner":"HansMeershoek","description":null,"archived":false,"fork":false,"pushed_at":"2025-04-06T20:49:54.000Z","size":1551,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-09T01:44:53.527Z","etag":null,"topics":["data-analysis","data-profiling","data-quality","data-science","data-visualization","eda","jupyter","pandas","plotly","profiling-datasets","python"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/HansMeershoek.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-04-01T15:20:56.000Z","updated_at":"2025-04-06T20:48:39.000Z","dependencies_parsed_at":"2025-04-02T16:59:00.509Z","dependency_job_id":null,"html_url":"https://github.com/HansMeershoek/pytics","commit_stats":null,"previous_names":["hansmeershoek/data-profiler","hansmeershoek/pytics"],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HansMeershoek%2Fpytics","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HansMeershoek%2Fpytics/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HansMeershoek%2Fpytics/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HansMeershoek%2Fpytics/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/HansMeershoek","download_url":"https://codeload.github.com/HansMeershoek/pytics/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253176444,"owners_count":21866142,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","data-profiling","data-quality","data-science","data-visualization","eda","jupyter","pandas","plotly","profiling-datasets","python"],"created_at":"2025-05-09T01:45:00.397Z","updated_at":"2026-02-17T01:33:41.460Z","avatar_url":"https://github.com/HansMeershoek.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pytics\n\n[![PyPI version](https://img.shields.io/pypi/v/pytics)](https://pypi.org/project/pytics/)\n[![Python Versions](https://img.shields.io/pypi/pyversions/pytics)](https://pypi.org/project/pytics/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Tests](https://github.com/HansMeershoek/pytics/actions/workflows/python-test.yml/badge.svg?branch=main)](https://github.com/HansMeershoek/pytics/actions/workflows/python-test.yml)\n\nAn interactive data profiling library for Python that generates comprehensive HTML reports with rich visualizations and PDF export capabilities.\n\n## Features\n\n- 📊 **Interactive Visualizations**: Built with Plotly for dynamic, interactive charts\n- 📱 **Responsive Design**: Reports adapt to different screen sizes\n- 📄 **PDF Export**: Generate publication-ready PDF reports\n- 🎯 **Target Analysis**: Special insights for classification/regression tasks\n- 🔍 **Comprehensive Profiling**: Detailed statistics and distributions\n- ⚡ **Performance Optimized**: Efficient handling of large datasets\n- 🛠️ **Customizable**: Configure sections and visualization options\n- ↔️ **DataFrame Comparison**: Compare two datasets for differences in schema, stats, and distributions\n\n## Example Reports\n\n### Full Profile Report\n![Full Profile Report](examples/full_report.png)\n\n### Targeted Analysis Report\n![Targeted Analysis Report](examples/targeted_report.png)\n\n## Installation\n\n```bash\npip install pytics\n```\n\n## Quick Start\n\n```python\nimport pandas as pd\nfrom pytics import profile, compare\n\n# --- Basic Profiling ---\n# Method 1: Profile a DataFrame object\ndf = pd.read_csv('your_data.csv')\nprofile(df, output_file='report.html')\n\n# Method 2: Profile directly from a file path\n# Supports CSV and Parquet files\nprofile('path/to/your_data.csv', output_file='report.html')\nprofile('path/to/your_data.parquet', output_file='report.html')\n\n# --- Advanced Profiling ---\n# Generate a PDF report\nprofile(df, output_format='pdf', output_file='report.pdf')\n\n# Profile with a target variable for enhanced analysis\nprofile(\n    df,\n    target='target_column',  # Enables target-specific analysis\n    output_file='targeted_report.html'\n)\n\n# Select specific sections to include/exclude\nprofile(\n    df,\n    include_sections=['overview', 'correlations'],\n    exclude_sections=['target_analysis'],\n    output_file='custom_report.html'\n)\n\n# --- DataFrame Comparison ---\n# Method 1: Compare two DataFrame objects\ndf_train = pd.read_csv('train_data.csv')\ndf_test = pd.read_csv('test_data.csv')\n\ncompare(\n    df_train, \n    df_test,\n    name1='Train Set',    # Optional: Custom names for the datasets\n    name2='Test Set',\n    output_file='comparison.html'\n)\n\n# Method 2: Compare directly from file paths\ncompare(\n    'path/to/train_data.csv',\n    'path/to/test_data.csv',\n    name1='Train Set',\n    name2='Test Set',\n    output_file='comparison.html'\n)\n```\n\n## Target Variable Analysis\n\nWhen you specify a target variable using the `target` parameter, pytics enhances the analysis with:\n\n- Target distribution visualization\n- Feature importance analysis\n- Target-specific correlations\n- Conditional distributions of features\n- Statistical tests for feature-target relationships\n\nExample:\n```python\n# Profile with target variable analysis\nprofile(\n    df,\n    target='target_column',\n    output_file='targeted_report.html'\n)\n```\n\n## Configuration Options\n\n### Profile Configuration\n```python\nprofile(\n    df,\n    target='target_column',           # Target variable for supervised learning\n    include_sections=['overview'],    # Sections to include\n    exclude_sections=['correlations'],# Sections to exclude\n    output_format='pdf',             # 'html' or 'pdf'\n    output_file='report.html',       # Output file path\n    theme='light',                   # Report theme ('light' or 'dark')\n    title='Custom Report Title'      # Report title\n)\n```\n\n### Compare Configuration\n```python\ncompare(\n    df1,\n    df2,\n    name1='First Dataset',           # Custom name for first dataset\n    name2='Second Dataset',          # Custom name for second dataset\n    output_file='comparison.html',   # Output file path\n    theme='light',                   # Report theme ('light' or 'dark')\n    title='Dataset Comparison'       # Report title\n)\n```\n\n### Available Sections\n- `overview`: Dataset summary and memory usage\n- `variables`: Detailed variable analysis\n- `correlations`: Correlation analysis\n- `target_analysis`: Target-specific insights (requires target parameter)\n- `interactions`: Feature interaction analysis\n- `missing_values`: Missing value patterns\n- `duplicates`: Duplicate record analysis\n\n## Report Sections\n\n1. **Overview**\n   - Dataset summary\n   - Memory usage\n   - Data types distribution\n   - Missing values summary\n\n2. **DataFrame Summary**\n   - Complete DataFrame info output\n   - Numerical and categorical statistics\n   - Data preview (head/tail)\n   - Memory usage details\n\n3. **Variable Analysis**\n   - Detailed statistics\n   - Distribution plots\n   - Missing value patterns\n   - Unique values analysis\n\n4. **Correlations**\n   - Correlation matrix\n   - Feature relationships\n   - Interactive heatmaps\n\n5. **Target Analysis** (when target specified)\n   - Target distribution\n   - Feature importance\n   - Target correlations\n\n6. **Missing Values**\n   - Missing value patterns\n   - Distribution analysis\n   - Correlation with other features\n\n7. **Duplicates**\n   - Duplicate record analysis\n   - Pattern identification\n   - Impact assessment\n\n8. **About**\n   - Project information\n   - Feature overview\n   - GitHub repository links\n\n## Edge Cases and Limitations\n\n### Data Size Limits\n- Recommended maximum rows: 1 million\n- Recommended maximum columns: 1000\n- Large datasets may require increased memory allocation\n\n### PDF Export Limitations\n\nWhen exporting reports to PDF format:\n- Plots are intentionally omitted due to a known issue with Kaleido version \u003e= 0.2.1 that causes PDF export to hang indefinitely\n- A message is displayed in place of each plot indicating it has been omitted\n- All other report content (statistics, tables, etc.) remains fully functional\n- For viewing plots, use the HTML export format which provides fully interactive visualizations\n- If PDF plots are required, consider using pytics version 1.1.3 which supports them\n\n### Special Cases\n- Missing Values: Automatically handled and reported\n- Categorical Variables: Limited to 1000 unique values by default\n- Date/Time: Automatically detected and analyzed\n- Mixed Data Types: Handled with appropriate warnings\n\n### Error Handling\n- Custom exceptions for clear error reporting\n- Warning system for non-critical issues\n- Graceful degradation for memory constraints\n\n## Best Practices\n\n1. **Memory Management**\n   - Sample large datasets if needed\n   - Use section selection for focused analysis\n   - Monitor memory usage for big datasets\n\n2. **Performance Optimization**\n   - Limit categorical variables when possible\n   - Use targeted section selection\n   - Consider data sampling for initial exploration\n\n3. **Report Generation**\n   - Choose appropriate output format\n   - Use meaningful report titles\n   - Save reports with descriptive filenames\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request. See the [CONTRIBUTING.md](CONTRIBUTING.md) file for guidelines.\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details. \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhansmeershoek%2Fpytics","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhansmeershoek%2Fpytics","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhansmeershoek%2Fpytics/lists"}