Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/akade/Akade.IndexedSet

A convenient data structure supporting efficient in-memory indexing and querying, including range queries and fuzzy string matching.
https://github.com/akade/Akade.IndexedSet

Last synced: about 1 month ago
JSON representation

A convenient data structure supporting efficient in-memory indexing and querying, including range queries and fuzzy string matching.

Lists

README

        

# Akade.IndexedSet

![.Net Version](https://img.shields.io/badge/dynamic/xml?color=%23512bd4&label=version&query=%2F%2FTargetFrameworks%5B1%5D&url=https://raw.githubusercontent.com/akade/Akade.IndexedSet/main/Akade.IndexedSet/Akade.IndexedSet.csproj&logo=.net)
[![CI Build](https://github.com/akade/Akade.IndexedSet/actions/workflows/ci-build.yml/badge.svg?branch=master)](https://github.com/akade/Akade.IndexedSet/actions/workflows/ci-build.yml)
[![NuGet version (Akade.IndexedSet)](https://img.shields.io/nuget/v/Akade.IndexedSet.svg)](https://www.nuget.org/packages/Akade.IndexedSet/)
[![MIT](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/akade/Akade.IndexedSet#readme)
[![Static Badge](https://img.shields.io/badge/API%20Docs-DNDocs-43bc00?logo=readme&logoColor=white)](https://www.robiniadocs.com/d/akadeinde/api/Akade.IndexedSet.IndexedSet-1.html)

A convenient data structure supporting efficient in-memory indexing and querying, including range queries and fuzzy string matching.
In a nutshell, it allows you to write LINQ-like queries *without* enumerating through the entire list. If you are currently completly enumerating
through your data, expect huge [speedups](docs/Benchmarks.md) and much better scalability!

- [Overview](#overview)
- [Design Goals](#design-goals)
- [Performance and Operation-Support of the different indices:](#performance-and-operation-support-of-the-different-indices)
- [General queries](#general-queries)
- [String queries](#string-queries)
- [Features](#features)
- [Unique index (single entity, single key)](#unique-index-single-entity-single-key)
- [Non-unique index (multiple entities, single key)](#non-unique-index-multiple-entities-single-key)
- [Non-unique index (multiple entities, multiple keys)](#non-unique-index-multiple-entities-multiple-keys)
- [Range index](#range-index)
- [String indices and fuzzy matching](#string-indices-and-fuzzy-matching)
- [Computed or compound key](#computed-or-compound-key)
- [Concurrency and Thread-Safety](#concurrency-and-thread-safety)
- [No reflection and no expressions - convention-based index naming](#no-reflection-and-no-expressions-convention-based-index-naming)
- [FAQs](#faqs)
- [How do I use multiple index types for the same property?](#how-do-i-use-multiple-index-types-for-the-same-property)
- [How do I update key values if the elements are already in the set?](#how-do-i-update-key-values-if-the-elements-are-already-in-the-set)
- [How do I do case-insensitve (fuzzy) string matching (Prefix, FullTextIndex)?](#how-do-i-do-case-insensitve-fuzzy-string-matching-prefix-fulltextindex)
- [Roadmap](#roadmap)

## Overview

A sample showing different queries as you might want do for a report:

```csharp
// typically, you would query this from the db
var data = new Purchase[] {
new(Id: 1, ProductId: 1, Amount: 1, UnitPrice: 5),
new(Id: 2, ProductId: 1, Amount: 2, UnitPrice: 5),
new(Id: 6, ProductId: 4, Amount: 3, UnitPrice: 12),
new(Id: 7, ProductId: 4, Amount: 8, UnitPrice: 10) // discounted price
};

IndexedSet set = data.ToIndexedSet(x => x.Id)
.WithIndex(x => x.ProductId)
.WithRangeIndex(x => x.Amount)
.WithRangeIndex(x => x.UnitPrice)
.WithRangeIndex(x => x.Amount * x.UnitPrice)
.WithIndex(x => (x.ProductId, x.UnitPrice))
.Build();

// efficient queries on configured indices
// in contrast to standard LINQ, they do not enumerate the entire list!
_ = set.Where(x => x.ProductId, 4);
_ = set.Range(x => x.Amount, 1, 3, inclusiveStart: true, inclusiveEnd: true);
_ = set.GreaterThanOrEqual(x => x.UnitPrice, 10);
_ = set.MaxBy(x => x.Amount * x.UnitPrice);
_ = set.Where(x => (x.ProductId, x.UnitPrice), (4, 10));
```

### Design Goals
- Much faster solution than (naive) LINQ-based full-enumeration
- Syntax close to LINQ-Queries
- Easy to use with a fluent builder API
- Reflection & Expression-free to be AOT & Trimming friendly (for example for Blazor/WebASM)
- It's not a db - in-memory only

### Performance and Operation-Support of the different indices:

Below, you find runtime complexities. Benchmarks can be found [here](docs/Benchmarks.md)

#### General queries

- n: total number of elements
- m: number of elements in the return set
- ✔: Supported
- ⚠: Supported but throws if not exactly 1 item was found
- ❌: Not-supported

| Query | Unique-Index | NonUnique-Index | Range-Index |
| --------- | ------------ | --------------- | --------------- |
| Single | ⚠ O(1) | ⚠ O(1) | ⚠ O(log n) |
| Where | ✔ O(1) | ✔ O(m) | ✔ O(log n + m) |
| Range | ❌ | ❌ | ✔ O(log n + m) |
| < / <= | ❌ | ❌ | ✔ O(log n + m) |
| > / >= | ❌ | ❌ | ✔ O(log n + m) |
| OrderBy | ❌ | ❌ | ✔ O(m) |
| Max/Min | ❌ | ❌ | ✔ O(1) |

#### String queries

- w: length of query word
- D: maximum distance in fuzzy query
- r: number of items in result set

| Query | Prefix-Index | FullText-Index |
| ----------------| ------------ | ---------------|
| StartWith | ✔ O(w+r) | ✔ O(w+r) |
| Contains | ❌ | ✔ O(w+r) |
| Fuzzy StartWith | ✔ O(w+D+r) | ✔ O(w+D+r) |
| Fuzzy Contains | ❌ | ✔ O(w+D+r) |

> ℹ FullText indices use a lot more memory than prefix indices and are more expensive to construct. Only
use FullText indices if you really require it.

## Features

### Unique index (single entity, single key)
Dictionary-based, O(1), access on keys:

```csharp
IndexedSet set = IndexedSetBuilder.Create(a => a.PrimaryKey)
.WithUniqueIndex(x => x.SecondaryKey)
.Build();

_ = set.Add(new(PrimaryKey: 1, SecondaryKey: 5));

// fast access via primary key
Data data = set[1];

// fast access via secondary key
data = set.Single(x => x.SecondaryKey, 5);
```

> ℹ Entities do not require a primary key. `IndexedSet` inherits from `IndexedSet`
but provides convenient access to the automatically added unique index: `set[primaryKey]` instead
of `set.Single(x => x.PrimaryKey, primaryKey)`.

### Non-unique index (multiple entities, single key)
Dictionary-based, O(1), access on keys (single value) with multiple values (multiple keys):

```csharp
IndexedSet set = new Data[] { new(PrimaryKey: 1, SecondaryKey: 5), new(PrimaryKey: 2, SecondaryKey: 5) }
.ToIndexedSet(x => x.PrimaryKey)
.WithIndex(x => x.SecondaryKey)
.Build();

// fast access via secondary key
IEnumerable data = set.Where(x => x.SecondaryKey, 5);
```

### Non-unique index (multiple entities, multiple keys)
Dictionary-based, O(1), access on denormalized keys i.e. multiple keys for multiple entities:
```csharp

IndexedSet set = IndexedSetBuilder.Create(a => a.Id)
.WithIndex(x => x.ConnectsTo) // Where ConnectsTo returns an IEnumerable
.Build();

// 1 2
// |\ /
// | 3
// \|
// 4

_ = set.Add(new(Id: 1, ConnectsTo: new[] { 3, 4 }));
_ = set.Add(new(Id: 2, ConnectsTo: new[] { 3 }));
_ = set.Add(new(Id: 3, ConnectsTo: new[] { 1, 2, 3 }));
_ = set.Add(new(Id: 4, ConnectsTo: new[] { 1, 3 }));

// For readability, it is recommended to write the name for the parameter contains
IEnumerable nodesThatConnectTo1 = set.Where(x => x.ConnectsTo, contains: 1); // returns nodes 3 & 4
IEnumerable nodesThatConnectTo3 = set.Where(x => x.ConnectsTo, contains: 1); // returns nodes 1 & 2 & 3

// Non-optimized Where(x => x.Contains(...)) query:
nodesThatConnectTo1 = set.FullScan().Where(x => x.ConnectsTo.Contains(1)); // returns nodes 3 & 4, but enumerates through the entire set
```

### Range index
Binary-heap based O(log(n)) access for range based, smaller than (or equals) or bigger than (or equals) and orderby queries. Also useful to do paging sorted on exactly one index.

```csharp
IndexedSet set = IndexedSetBuilder.Create(new Data[] { new(1, SecondaryKey: 3), new(2, SecondaryKey: 4) })
.WithRangeIndex(x => x.SecondaryKey)
.Build();

// fast access via range query
IEnumerable data = set.Range(x => x.SecondaryKey, 1, 5);

// fast max & min key value or elements
int maxKey = set.Max(x => x.SecondaryKey);
data = set.MaxBy(x => x.SecondaryKey);

// fast larger or smaller than
data = set.LessThan(x => x.SecondaryKey, 4);

// fast ordering & paging
data = set.OrderBy(x => x.SecondaryKey, skip: 10).Take(10); // second page of 10 elements
```

### String indices and fuzzy matching
Prefix- & Suffix-Trie based indices for efficient StartWith & String-Contains queries including support
for fuzzy matching.

```csharp
IndexedSet data = typeof(object).Assembly.GetTypes()
.ToIndexedSet()
.WithPrefixIndex(x => x.Name)
.WithFullTextIndex(x => x.FullName)
.Build();

// fast prefix or contains queries via indices
_ = data.StartsWith(x => x.Name, "Int");
_ = data.Contains(x => x.FullName, "Int");

// fuzzy searching is supported by prefix and full text indices
// the following will also match "String"
_ = data.FuzzyStartsWith(x => x.Name, "Strang", 1);
_ = data.FuzzyContains(x => x.FullName, "Strang", 1);
```

### Computed or compound key

The data structure also allows to use computed or compound keys:

```csharp
var data = new RangeData[] { new(Start: 2, End: 10) };
IndexedSet set = data.ToIndexedSet()
.WithIndex(x => (x.Start, x.End))
.WithIndex(x => x.End - x.Start)
.WithIndex(ComputedKey.SomeStaticMethod)
.Build();
// fast access via indices
IEnumerable result = set.Where(x => (x.Start, x.End), (2, 10));
result = set.Where(x => x.End - x.Start, 8);
result = set.Where(ComputedKey.SomeStaticMethod, 42);
```
> ℹ For more samples, take a look at the unit tests.

### Concurrency and Thread-Safety

The "normal" indexedset is not thread-safe, however, a ReaderWriterLock-based implementation is available.
Just call `BuildConcurrent()` instead of `Build()`:

```csharp
ConcurrentIndexedSet set = data.ToIndexedSet()
.WithIndex(x => (x.Start, x.End))
.BuildConcurrent();
```

> ⚠ The concurrent implmentation needs to materialize all query results.

> `OrderBy` and `OrderByDescending` take an additional `count` parameter to avoid unnecessary materialization.
> You can judge the overhead [here](docs/Benchmarks.md#ConcurrentSet)
### No reflection and no expressions - convention-based index naming

We are using the [CallerArgumentExpression](https://docs.microsoft.com/en-us/dotnet/api/system.runtime.compilerservices.callerargumentexpressionattribute)-Feature
of .Net 6/C# 10 to provide convention-based naming of the indices:
- `set.Where(x => (x.Prop1, x.Prop2), (1, 2))` tries to use an index named `"x => (x.Prop1, x.Prop2)"`
- `set.Where(ComputedKeys.NumberOfDays, 5)` tries to use an index named `"ComputedKeys.NumberOfDays"`
- **Hence, be careful what you pass in.
> :information_source: The following naming conventions are recommended:
> - Use x as parameter name in any lambdas that determines an index name.
> - Do not use parentheses in any lambda that determines an index name.
> - Do not use block bodied in any lambda that determines an index name.
> - For complex indices, use a static method.
> [C# Analyzers](./Analyzers/Readme.md) are shipped with the package to spot incorrect index names.

Reasons
- Simple and yet effective:
- Allows computed, compound, custom values etc. to be indexed without adding complexity...
- Performance: No reflection at work and no (runtime) code-gen necessary
- AOT-friendly including full trimming support

## FAQs

### How do I use multiple index types for the same property?

Use "named" indices by using static methods:

```csharp
record Data(int PrimaryKey, int SecondaryKey);

IndexedSet set = IndexedSetBuilder.Create(x => x.PrimaryKey)
.WithUniqueIndex(DataIndices.UniqueIndex)
.WithRangeIndex(x => x.SecondaryKey)
.Build();
_ = set.Add(new(1, 4));
// querying unique index:
Data data = set.Single(DataIndices.UniqueIndex, 4); // Uses the unique index
Data data2 = set.Single(x => x.SecondaryKey, 4); // Uses the range index
IEnumerable inRange = set.Range(x => x.SecondaryKey, 1, 10); // Uses the range index
```

> ℹ We recommend using the lambda syntax for "simple" properties and static methods for more complicated ones. It's easy to read, resembles "normal" LINQ-Queries and all the magic strings are compiler generated.

### How do I update key values if the elements are already in the set?
**The implementation requires any keys of any type to never change the value while the instance is within the set**.
You can manually remove, update and add an object. However, there are some helper methods for that - which is especially
useful for the concurrent variant as it provides thread-safe serialized access.

```csharp
// updating a mutable property
_ = set.Update(dataElement, e => e.MutableProperty = 7);
// updating an immutable property
_ = set.Update(dataElement, e => e with { SecondaryKey = 12 });
// be careful: the dataElement still refers to the "old" record after the update method
_ = set.Update(dataElement, e => e with { SecondaryKey = 12 });

// updating in an concurrent set
concurrentSet.Update(set =>
{
// serialized access to the inner IndexedSet, where you can safely use above update methods
// in an multi-threaded environment
});
```

### How do I do case-insensitve (fuzzy) string matching (Prefix, FullTextIndex)?
Remember that you can index whatever you want, including computed properties. This also applies for fuzzy matching:

```csharp
IndexedSet set = IndexedSetBuilder.Create(x => x.PrimaryKey)
.WithFullTextIndex(x => x.Text.ToLowerInvariant())
.Build();
IEnumerable matches = set.FuzzyContains(x => x.Text.ToLowerInvariant(), "Search", maxDistance: 2);
```

## Roadmap
Potential features (not ordered):
- [x] Thread-safe version
- [x] Easier updating of keys
- [x] More index types (Trie)
- [x] Range insertion and corresponding `.ToIndexedSet().WithIndex(x => ...).[...].Build()`
- [x] Refactoring to allow a primarykey-less set: this was an artifical restriction that is not necessary
- [x] Benchmarks
- [x] Simplification of string indices, i.e. Span/String based overloads to avoid `AsMemory()`...
- [x] Analyzers to help with best practices
- [ ] Tree-based range index for better insertion performance
- [ ] Aggregates (i.e. sum or average: interface based on state & add/removal state update functions)
- [ ] Custom (equality) comparators for indices

If you have any suggestion or found a bug / unexpected behavior, open an issue! I will also review PRs and integrate them if they fit the project.