https://github.com/teragrep/dpf_03
Teragrep Tokenizer for Apache Spark
https://github.com/teragrep/dpf_03
apache-spark bloom-filter bloomfilter spark teragrep tokenization tokenizer unstructured-data
Last synced: 5 months ago
JSON representation
Teragrep Tokenizer for Apache Spark
- Host: GitHub
- URL: https://github.com/teragrep/dpf_03
- Owner: teragrep
- License: agpl-3.0
- Created: 2023-01-17T07:14:52.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2024-11-04T16:20:54.000Z (over 1 year ago)
- Last Synced: 2025-03-29T15:02:19.109Z (12 months ago)
- Topics: apache-spark, bloom-filter, bloomfilter, spark, teragrep, tokenization, tokenizer, unstructured-data
- Language: Scala
- Homepage: https://teragrep.com/
- Size: 78.1 KB
- Stars: 0
- Watchers: 2
- Forks: 4
- Open Issues: 5
-
Metadata Files:
- Readme: README.adoc
- License: LICENSE
Awesome Lists containing this project
README
link:https://scan.coverity.com/projects/teragrep-dpf_03[image:https://img.shields.io/coverity/scan/30737.svg[Coverity Scan Build Status]]
# DPF_03
Holds a customized lexical tokenizer as a Spark UDF and a bloom filter aggregator.
Used to tokenize spark string columns and run Spark aggregation into a bloom filter with configurable filter size selection.
## Features
### TokenizerUDF
Spark UDF that will tokenize incoming string value and return it as a list of byte arrays.
Tokenization rules are set in blf_01.
### ByteArrayListAsStringListUDF
Spark UDF that converts results from TokenizerUDF into a list of strings.
### BloomFilterAggregator
Custom spark aggregator that aggregates a column string tokens into a single bloom filter and
returns the bytes of the resulting filter that can be processed.
Filter size is selected by giving the aggregator the name of a spark column that holds an estimated value of tokens and
by configuring a map of bloom filters with preset values (expected number of items, false positive probability).
## Documentation
See the official documentation on https://docs.teragrep.com[docs.teragrep.com].
## Limitations
Compatible with Java version 1.8, other versions might not work.
## How to [compile/use/implement]
See tests for how to apply and import into a Spark project
## Contributing
You can involve yourself with our project by https://github.com/teragrep/dpf_03/issues/new/choose[opening an issue] or submitting a pull request.
Contribution requirements:
. *All changes must be accompanied by a new or changed test.* If you think testing is not required in your pull request, include a sufficient explanation as why you think so.
. Security checks must pass
. Pull requests must align with the principles and http://www.extremeprogramming.org/values.html[values] of extreme programming.
. Pull requests must follow the principles of Object Thinking and Elegant Objects (EO).
Read more in our https://github.com/teragrep/teragrep/blob/main/contributing.adoc[Contributing Guideline].
### Contributor License Agreement
Contributors must sign https://github.com/teragrep/teragrep/blob/main/cla.adoc[Teragrep Contributor License Agreement] before a pull request is accepted to organization's repositories.
You need to submit the CLA only once. After submitting the CLA you can contribute to all Teragrep's repositories.