https://github.com/frauddi/dataspot

Find data concentration patterns and hotspots. Built for fraud detection and risk analysis.
https://github.com/frauddi/dataspot
anomalies anomalies-detection data-analysis data-science fraud-detection hotspots pattern-mining python
Last synced: 3 months ago
JSON representation
Find data concentration patterns and hotspots. Built for fraud detection and risk analysis.
Host: GitHub
URL: https://github.com/frauddi/dataspot
Owner: frauddi
License: mit
Created: 2025-06-24T00:51:29.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2026-01-26T03:22:37.000Z (5 months ago)
Last Synced: 2026-01-26T17:54:33.618Z (5 months ago)
Topics: anomalies, anomalies-detection, data-analysis, data-science, fraud-detection, hotspots, pattern-mining, python
Language: Python
Homepage:
Size: 7.8 MB
Stars: 6
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: docs/CONTRIBUTING.md
- License: LICENSE
- Code of conduct: docs/CODE_OF_CONDUCT.md
- Citation: CITATION.cff
Awesome Lists containing this project

README

          # Dataspot 🔥

> **Find data concentration patterns and dataspots in your datasets**

[![PyPI version](https://img.shields.io/pypi/v/dataspot.svg)](https://pypi.org/project/dataspot/)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

[![Maintained by Frauddi](https://img.shields.io/badge/Maintained%20by-Frauddi-blue.svg)](https://frauddi.com)

[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)

Dataspot automatically discovers **where your data concentrates**, helping you identify patterns, anomalies, and insights in datasets. Originally developed for fraud detection at Frauddi, now available as open source.

## ✨ Why Dataspot?

- 🎯 **Purpose-built** for finding data concentrations, not just clustering

- 🔍 **Fraud detection ready** - spot suspicious behavior patterns

- ⚡ **Simple API** - get insights in 3 lines of code

- 📊 **Hierarchical analysis** - understand data at multiple levels

- 🔧 **Flexible filtering** - customize analysis with powerful options

- 📈 **Field-tested** - validated in real fraud detection systems

## 🚀 Quick Start

```bash

pip install dataspot

```

```python

from dataspot import Dataspot

from dataspot.models.finder import FindInput, FindOptions

# Sample transaction data

data = [

    {"country": "US", "device": "mobile", "amount": "high", "user_type": "premium"},

    {"country": "US", "device": "mobile", "amount": "medium", "user_type": "premium"},

    {"country": "EU", "device": "desktop", "amount": "low", "user_type": "free"},

    {"country": "US", "device": "mobile", "amount": "high", "user_type": "premium"},

]

# Find concentration patterns

dataspot = Dataspot()

result = dataspot.find(

    FindInput(data=data, fields=["country", "device", "user_type"]),

    FindOptions(min_percentage=10.0, limit=5)

)

# Results show where data concentrates

for pattern in result.patterns:

    print(f"{pattern.path} → {pattern.percentage}% ({pattern.count} records)")

# Output:

# country=US > device=mobile > user_type=premium → 75.0% (3 records)

# country=US > device=mobile → 75.0% (3 records)

# device=mobile → 75.0% (3 records)

```

## 🎯 Real-World Use Cases

### 🚨 Fraud Detection

```python

from dataspot.models.finder import FindInput, FindOptions

# Find suspicious transaction patterns

result = dataspot.find(

    FindInput(

        data=transactions,

        fields=["country", "payment_method", "time_of_day"]

    ),

    FindOptions(min_percentage=15.0, contains="crypto")

)

# Spot unusual concentrations that might indicate fraud

for pattern in result.patterns:

    if pattern.percentage > 30:

        print(f"⚠️ High concentration: {pattern.path}")

```

### 📊 Business Intelligence

```python

from dataspot.models.analyzer import AnalyzeInput, AnalyzeOptions

# Discover customer behavior patterns

insights = dataspot.analyze(

    AnalyzeInput(

        data=customer_data,

        fields=["region", "device", "product_category", "tier"]

    ),

    AnalyzeOptions(min_percentage=10.0)

)

print(f"📈 Found {len(insights.patterns)} concentration patterns")

print(f"🎯 Top opportunity: {insights.patterns[0].path}")

```

### 🔍 Temporal Analysis

```python

from dataspot.models.compare import CompareInput, CompareOptions

# Compare patterns between time periods

comparison = dataspot.compare(

    CompareInput(

        current_data=this_month_data,

        baseline_data=last_month_data,

        fields=["country", "payment_method"]

    ),

    CompareOptions(

        change_threshold=0.20,

        statistical_significance=True

    )

)

print(f"📊 Changes detected: {len(comparison.changes)}")

print(f"🆕 New patterns: {len(comparison.new_patterns)}")

```

### 🌳 Hierarchical Visualization

```python

from dataspot.models.tree import TreeInput, TreeOptions

# Build hierarchical tree for data exploration

tree = dataspot.tree(

    TreeInput(

        data=sales_data,

        fields=["region", "product_category", "sales_channel"]

    ),

    TreeOptions(min_value=10, max_depth=3, sort_by="value")

)

print(f"🌳 Total records: {tree.value}")

print(f"📊 Main branches: {len(tree.children)}")

# Navigate the hierarchy

for region in tree.children:

    print(f"  📍 {region.name}: {region.value} records")

    for product in region.children:

        print(f"    📦 {product.name}: {product.value} records")

```

### 🤖 Auto Discovery

```python

from dataspot.models.discovery import DiscoverInput, DiscoverOptions

# Automatically discover important patterns

discovery = dataspot.discover(

    DiscoverInput(data=transaction_data),

    DiscoverOptions(max_fields=3, min_percentage=15.0)

)

print(f"🎯 Top patterns discovered: {len(discovery.top_patterns)}")

for field_ranking in discovery.field_ranking[:3]:

    print(f"📈 {field_ranking.field}: {field_ranking.score:.2f}")

```

## 🛠️ Core Methods

| Method | Purpose | Input Model | Options Model | Output Model |

|--------|---------|-------------|---------------|--------------|

| `find()` | Find concentration patterns | `FindInput` | `FindOptions` | `FindOutput` |

| `analyze()` | Statistical analysis | `AnalyzeInput` | `AnalyzeOptions` | `AnalyzeOutput` |

| `compare()` | Temporal comparison | `CompareInput` | `CompareOptions` | `CompareOutput` |

| `discover()` | Auto pattern discovery | `DiscoverInput` | `DiscoverOptions` | `DiscoverOutput` |

| `tree()` | Hierarchical visualization | `TreeInput` | `TreeOptions` | `TreeOutput` |

### Advanced Filtering Options

```python

# Complex analysis with multiple criteria

result = dataspot.find(

    FindInput(

        data=data,

        fields=["country", "device", "payment"],

        query={"country": ["US", "EU"]}  # Pre-filter data

    ),

    FindOptions(

        min_percentage=10.0,      # Only patterns with >10% concentration

        max_depth=3,             # Limit hierarchy depth

        contains="mobile",       # Must contain "mobile" in pattern

        min_count=50,           # At least 50 records

        sort_by="percentage",   # Sort by concentration strength

        limit=20                # Top 20 patterns

    )

)

```

## ⚡ Performance

Dataspot delivers consistent, predictable performance with exceptionally efficient memory usage and linear scaling.

### 🚀 Real-World Performance

| Dataset Size | Processing Time | Memory Usage | Patterns Found |

|--------------|-----------------|---------------|----------------|

| 1,000 records | **~5ms** | **~1.4MB** | 12 patterns |

| 10,000 records | **~43ms** | **~2.8MB** | 12 patterns |

| 100,000 records | **~375ms** | **~2.9MB** | 20 patterns |

| 1,000,000 records | **~3.7s** | **~3.0MB** | 20 patterns |

> **Benchmark Methodology**: Performance measured using validated testing with 5 iterations per dataset size on MacBook Pro (M-series). Test data specifications:

>

> - **JSON Size**: ~164 bytes per JSON record (~0.16 KB each)

> - **JSON Structure**: 8 keys per JSON record (`country`, `device`, `payment_method`, `amount`, `user_type`, `channel`, `status`, `id`)

> - **Analysis Scope**: 4 fields analyzed simultaneously (`country`, `device`, `payment_method`, `user_type`)

> - **Configuration**: `min_percentage=5.0`, `limit=50` patterns

> - **Results**: Consistently finds 12 concentration patterns across all dataset sizes

> - **Variance**: Minimal timing variance (±1-6ms), demonstrating algorithmic stability

> - **Memory Efficiency**: Near-constant memory usage regardless of dataset size

### 💡 Performance Tips

```python

# Optimize for speed

result = dataspot.find(

    FindInput(data=large_dataset, fields=fields),

    FindOptions(

        min_percentage=10.0,    # Skip low-concentration patterns

        max_depth=3,           # Limit hierarchy depth

        limit=100             # Cap results

    )

)

# Memory efficient processing

from dataspot.models.tree import TreeInput, TreeOptions

tree = dataspot.tree(

    TreeInput(data=data, fields=["country", "device"]),

    TreeOptions(min_value=10, top=5)  # Simplified tree

)

```

## 📈 What Makes Dataspot Different?

| **Traditional Clustering** | **Dataspot Analysis** |

|---------------------------|---------------------|

| Groups similar data points | **Finds concentration patterns** |

| Equal-sized clusters | **Identifies where data accumulates** |

| Distance-based | **Percentage and count based** |

| Hard to interpret | **Business-friendly hierarchy** |

| Generic approach | **Built for real-world analysis** |

## 🎬 Dataspot in Action

[View the algorithm](https://frauddi.github.io/dataspot/algorithm-dataspot.html)

![Dataspot in action - Finding data concentration patterns](algorithm-dataspot.gif)

See Dataspot discover concentration patterns and dataspots in real-time with hierarchical analysis and statistical insights.

## 📊 API Structure

### Input Models

- `FindInput` - Data and fields for pattern finding

- `AnalyzeInput` - Statistical analysis configuration

- `CompareInput` - Current vs baseline data comparison

- `DiscoverInput` - Automatic pattern discovery

- `TreeInput` - Hierarchical tree visualization

### Options Models

- `FindOptions` - Filtering and sorting for patterns

- `AnalyzeOptions` - Statistical analysis parameters

- `CompareOptions` - Change detection thresholds

- `DiscoverOptions` - Auto-discovery constraints

- `TreeOptions` - Tree structure customization

### Output Models

- `FindOutput` - Pattern discovery results with statistics

- `AnalyzeOutput` - Enhanced analysis with insights and confidence scores

- `CompareOutput` - Change detection results with significance tests

- `DiscoverOutput` - Auto-discovery findings with field rankings

- `TreeOutput` - Hierarchical tree structure with navigation

## 🔧 Installation & Requirements

```bash

# Install from PyPI

pip install dataspot

# Development installation

git clone https://github.com/frauddi/dataspot.git

cd dataspot

pip install -e ".[dev]"

```

**Requirements:**

- Python 3.9+

- No heavy dependencies (just standard library + optional speedups)

## 🛠️ Development Commands

| Command | Description |

|---------|-------------|

| `make lint` | Check code for style and quality issues |

| `make lint-fix` | Automatically fix linting issues where possible |

| `make tests` | Run all tests with coverage reporting |

| `make check` | Run both linting and tests |

| `make clean` | Remove cache files, build artifacts, and temporary files |

| `make install` | Create virtual environment and install dependencies |

## 📚 Documentation & Examples

- 📖 [User Guide](docs/user-guide.md) - Complete usage documentation

- 💡 [Examples](examples/) - Real-world usage examples:

  - `01_basic_query_filtering.py` - Query and filtering basics

  - `02_pattern_filtering_basic.py` - Pattern-based filtering

  - `06_real_world_scenarios.py` - Business use cases

  - `08_auto_discovery.py` - Automatic pattern discovery

  - `09_temporal_comparison.py` - A/B testing and change detection

  - `10_stats.py` - Statistical analysis

- 🤝 [Contributing](docs/CONTRIBUTING.md) - How to contribute

## 🌟 Why Open Source?

Dataspot was born from real-world fraud detection needs at Frauddi. We believe powerful pattern analysis shouldn't be locked behind closed doors. By open-sourcing Dataspot, we hope to:

- 🎯 **Advance fraud detection** across the industry

- 🤝 **Enable collaboration** on pattern analysis techniques

- 🔍 **Help companies** spot issues in their data

- 📈 **Improve data quality** everywhere

## 🤝 Contributing

We welcome contributions! Whether you're:

- 🐛 Reporting bugs

- 💡 Suggesting features

- 📝 Improving documentation

- 🔧 Adding new analysis methods

See our [Contributing Guide](docs/CONTRIBUTING.md) for details.

## 📄 License

MIT License - see [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- **Created by [@3l1070r](https://github.com/3l1070r)** - Original algorithm and implementation

- **Sponsored by [Frauddi](https://frauddi.com)** - Field testing and open source support

- **Inspired by real fraud detection challenges** - Built to solve actual problems

## 🔗 Links

- 🏠 [Homepage](https://github.com/frauddi/dataspot)

- 📦 [PyPI Package](https://pypi.org/project/dataspot/)

- 🐛 [Issue Tracker](https://github.com/frauddi/dataspot/issues)

---

**Find your data's dataspots. Discover what others miss.**

Built with ❤️ by [Frauddi](https://frauddi.com)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/frauddi/dataspot

Awesome Lists containing this project

README