https://github.com/dav009/abacus
Counter Data structure for Golang using CountMin Sketch with a fixed amount of memory
https://github.com/dav009/abacus
cm count counter frequency golang minsketch probabilistic word
Last synced: 8 months ago
JSON representation
Counter Data structure for Golang using CountMin Sketch with a fixed amount of memory
- Host: GitHub
- URL: https://github.com/dav009/abacus
- Owner: dav009
- Created: 2017-12-11T09:35:43.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2018-01-04T06:52:59.000Z (about 8 years ago)
- Last Synced: 2025-05-08T16:58:14.202Z (8 months ago)
- Topics: cm, count, counter, frequency, golang, minsketch, probabilistic, word
- Language: Go
- Homepage:
- Size: 41 KB
- Stars: 45
- Watchers: 5
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README

# Abacus
Abacus let you count item frequencies in big datasets with a fixed amount of memory.
Unlike a regular counter it trades off accuracy for memory.
This is useful for particular tasks, for example in NLP/ML related tasks you might want to count millions of items
however approximate counts are good enough.
Example:
```go
counter := abacus.New(maxMemoryMB=10) // abacus will use max 10MB to store your counts
counter.Update([]string{"item1", "item2", "item2"})
counter.Counts("item1") // 1 , counts for "item1"
counter.Total() // 3 ,Total number of counts (sum of counts of all elements)
counter.Cardinality() // 2 , How many different items are there?
```
Abacus lets you define how much memory you want to use and you go from there counting items.
Of course there are some limitations, and if you set the memory threshold too low, you might get innacurate counts.
## Benchmarks
- Counting bigrams (words) from [Wiki corpus](http://www.cs.upc.edu/~nlp/wikicorpus/).
- Compared memory and accuracy of `Abacus` vs using a `map[string]int`
Corpus Data Structure Used Memory Accuracy
| Corpus | Data Structure | Used Memory | Accuracy |
|---------|-----------------|-----------------|-----------|
| Half of Wiki corpus (English) | Abacus (1000MB) | 1.75GB | 96% |
| Half of Wiki corpus (English) | Abacus (Log8) (200MB) | 369MB | 70% |
| Half of Wiki corpus (English) | Abacus (Log8) (400MB) | 407MB | 98% |
| Half of Wiki corpus (English) | Map | 3.3GB | 100% |
| Corpus | Data Structure | Used Memory | Accuracy |
|---------|-----------------|-----------------|-----------|
| Complete Wiki corpus (English) | Abacus (2200MB) | 3.63GB | 98% |
| Complete Wiki corpus (English) | Abacus (500MB) | 741MB | 15% |
| Complete Wiki corpus (English) | Abacus (Log8) (500MB) | 760MB | 90% |
| Complete Wiki corpus (English) | Abacus (Log8) (700MB) | 889MB | 97% |
| Complete Wiki corpus (English) | Map | 10.46GB | 100% |
Note: This is me playing with Golang again, heavily based on [Bounter](https://github.com/RaRe-Technologies/bounter)
## Under the hood
### Count–min sketch
Used to count item frequencies.
### HyperLogLog
Used to calculate the cardinality
-----------
Icon made by [free-icon](https://www.flaticon.com/free-icon/)