https://github.com/teragrep/dpf_03

Teragrep Tokenizer for Apache Spark
https://github.com/teragrep/dpf_03

apache-spark bloom-filter bloomfilter spark teragrep tokenization tokenizer unstructured-data

Last synced: 7 months ago
JSON representation

Teragrep Tokenizer for Apache Spark

Host: GitHub
URL: https://github.com/teragrep/dpf_03
Owner: teragrep
License: agpl-3.0
Created: 2023-01-17T07:14:52.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2024-11-04T16:20:54.000Z (over 1 year ago)
Last Synced: 2025-03-29T15:02:19.109Z (about 1 year ago)
Topics: apache-spark, bloom-filter, bloomfilter, spark, teragrep, tokenization, tokenizer, unstructured-data
Language: Scala
Homepage: https://teragrep.com/
Size: 78.1 KB
Stars: 0
Watchers: 2
Forks: 4
Open Issues: 5
Metadata Files:
- Readme: README.adoc
- License: LICENSE

Awesome Lists containing this project

README

link:https://scan.coverity.com/projects/teragrep-dpf_03[image:https://img.shields.io/coverity/scan/30737.svg[Coverity Scan Build Status]]

# DPF_03

Holds a customized lexical tokenizer as a Spark UDF and a bloom filter aggregator.
Used to tokenize spark string columns and run Spark aggregation into a bloom filter with configurable filter size selection.

## Features

### TokenizerUDF

Spark UDF that will tokenize incoming string value and return it as a list of byte arrays.
Tokenization rules are set in blf_01.

### ByteArrayListAsStringListUDF

Spark UDF that converts results from TokenizerUDF into a list of strings.

### BloomFilterAggregator

Custom spark aggregator that aggregates a column string tokens into a single bloom filter and
returns the bytes of the resulting filter that can be processed.

Filter size is selected by giving the aggregator the name of a spark column that holds an estimated value of tokens and
by configuring a map of bloom filters with preset values (expected number of items, false positive probability).

## Documentation

See the official documentation on https://docs.teragrep.com[docs.teragrep.com].

## Limitations

Compatible with Java version 1.8, other versions might not work.

## How to [compile/use/implement]

See tests for how to apply and import into a Spark project

## Contributing

You can involve yourself with our project by https://github.com/teragrep/dpf_03/issues/new/choose[opening an issue] or submitting a pull request.

Contribution requirements:

. *All changes must be accompanied by a new or changed test.* If you think testing is not required in your pull request, include a sufficient explanation as why you think so.
. Security checks must pass
. Pull requests must align with the principles and http://www.extremeprogramming.org/values.html[values] of extreme programming.
. Pull requests must follow the principles of Object Thinking and Elegant Objects (EO).

Read more in our https://github.com/teragrep/teragrep/blob/main/contributing.adoc[Contributing Guideline].

### Contributor License Agreement

Contributors must sign https://github.com/teragrep/teragrep/blob/main/cla.adoc[Teragrep Contributor License Agreement] before a pull request is accepted to organization's repositories.

You need to submit the CLA only once. After submitting the CLA you can contribute to all Teragrep's repositories.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/teragrep/dpf_03

Awesome Lists containing this project

README