https://github.com/teragrep/blf_02
Teragrep Bloom filter plugin for MariaDB
https://github.com/teragrep/blf_02
bloom-filter bloomfilter mariadb mariadb-plugin search-optimization teragrep
Last synced: 28 days ago
JSON representation
Teragrep Bloom filter plugin for MariaDB
- Host: GitHub
- URL: https://github.com/teragrep/blf_02
- Owner: teragrep
- License: apache-2.0
- Created: 2023-02-15T13:42:25.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2025-02-12T09:24:42.000Z (3 months ago)
- Last Synced: 2025-03-29T15:02:02.672Z (about 2 months ago)
- Topics: bloom-filter, bloomfilter, mariadb, mariadb-plugin, search-optimization, teragrep
- Language: M4
- Homepage: https://teragrep.com
- Size: 30.3 KB
- Stars: 1
- Watchers: 2
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: README.adoc
- License: LICENSE
Awesome Lists containing this project
README
= BLF_02: Teragrep Bloom Filter Plugin for MariaDB
This package provides two user-defined functions (UDFs) for MySQL to efficiently work with Bloom filters:
- `bloommatch` function to compare two bloom filters if one is contained in the other.
- `bloomupdate` function to combine two bloom filters.These UDFs enable efficient querying and manipulation of Bloom filters stored in MySQL.
Bloom filters are represented as arrays of bytes in little-endian order.License: Apache
== Installation
Install the blf_02 package.[source,sh]
----
yum install blf_02.rpm
----=== Enabling
link:https://mariadb.com/kb/en/user-defined-functions-security/[Read more about required permissions]
==== Option 1 — Execute the pre-made query
[source,shell]
----
mariadb < /opt/teragrep/blf_02/share/installdb.sql
----==== Option 2 — Execute the queries manually
[source,sql]
----
USE mysql;DROP FUNCTION IF EXISTS bloommatch;
DROP FUNCTION IF EXISTS bloomupdate;
CREATE FUNCTION bloommatch RETURNS integer SONAME 'lib_mysqludf_bloom.so';
CREATE FUNCTION bloomupdate RETURNS STRING SONAME 'lib_mysqludf_bloom.so';
----=== Disabling
link:https://mariadb.com/kb/en/user-defined-functions-security/[Read more about required permissions]
==== Option 1 — Execute the pre-made query
[source,shell]
----
mariadb < /opt/teragrep/blf_02/share/uninstalldb.sql
----==== Option 2 — Execute the queries manually
[source,sql]
----
USE mysql;DROP FUNCTION IF EXISTS bloommatch;
DROP FUNCTION IF EXISTS bloomupdate;
----== Functions
=== Match Function
This function performs a byte-by-bytes check of `(a & b == a)`.
If true, then `a` may be found in `b`.
If false then `a` is not in `b`.Function in SQL:
[source,sql]
----
bloommatch(blob a, blob b)
----A Java example of how the function is used:
[source,java]
----
Connection con = ... // Get the db connection
InputStream is = ... // Input stream containing the bloom filter to locate in the table
PreparedStatement stmt = con.prepareStatement( "SELECT * FROM bloomTable WHERE bloommatch( ?, bloomTable.filter );" );
stmt.setBlob( 1, is );
ResultSet rs = stmt.executeQuery();
// Result set now contains all the matching bloom filters from the table.
----
=== Update Function
This function performs a byte-by-byte construct of a new filter where `a | b`.Function in SQL:
[source, SQL]
----
bloomupdate( blob a, blob b )
----
A Java example of how the function is used:
[source, java]
----
Connection con = ... // Get the db connection
InputStream is = ... // Input stream containing the bloom filter to locate in the table
PreparedStatement stmt = con.prepareStatement( "UPDATE bloomTable SET filter=bloomupdate( ?, bloomTable.filter ) WHERE id=?;" );
stmt.setBlob( 1, is );
stmt.setInt( 2, 5 );
stmt.executeUpdate();
// Bloom filters on rows with id of 5 have been updated to include values from the blob.
----== Development
MySQL client and server headers are required to compile this code.
Please do the following in the root directory of the source tree:
[source,shell]
----
aclocal
autoconf
autoheader
automake --add-missing./configure
make
sudo make install
sudo make installdb
----To remove the library from your system:
[source]
----
make uninstalldb
make uninstall
----== Spark Example
A short demo of how to use blf_02 in practice by using Apache Spark and Scala.
=== Creating and Storing Bloom Filter to a Database
In the following example, we generate a Bloom Filter from a Spark DataFrame
and store its serialized form in a database for later use.The filter is stored in a table alongside a string value.
When searching for a token,
we can first check the filter before checking the value.[source,scala]
----
// Generate and upload a spark bloomfilter to a databaseimport spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import java.sql.DriverManager
import org.apache.spark.util.sketch.BloomFilter
import java.io.{ByteArrayOutputStream,ByteArrayInputStream, ObjectOutputStream, InputStream}// Filter parameters
val expected: Long = 500
val fpp: Double = 0.3val dburl = "DATABASE_URL"
val updatesql = "INSERT INTO `example_strings` (`value`, `filter`) VALUES (?,?)"
val conn = DriverManager.getConnection(dburl,"DB_USERNAME","DB_PASSWORD")
val value = "one two three"// Create a Spark Dataframe with values 'one', 'two' and 'three'
// This emulates a tokenized form of the value field
val in1 = spark.sparkContext.parallelize(List("one","two","three"))
val df = in1.toDF("tokens")val ps = conn.prepareStatement(updatesql)
// Create a bloomfilter from the Dataframe
val filter = df.stat.bloomFilter($"tokens", expected, fpp)
println(filter.mightContain("one"))// Write a filter bit array to the output stream
val baos = new ByteArrayOutputStream
filter.writeTo(baos)
val is: InputStream = new ByteArrayInputStream(baos.toByteArray())
ps.setString(1, value)
ps.setBlob(2,is)
val update = ps.executeUpdate
println("Updated rows: "+ update)
df.show()
conn.close()
----=== Finding Matching Filters
A Bloom Filter is created from a Spark DataFrame
and compared with stored filters in the database to retrieve matching string values.
Note that each comparison generates a new Bloom Filter for the SQL function.Imagine we want to search if a value
contains tokens `one` and `two` from the previous example.
[source,scala]
----
// Create a bloomfilter and find matches
import spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import java.sql.DriverManager
import org.apache.spark.util.sketch.BloomFilter
import java.io.{ByteArrayOutputStream,ByteArrayInputStream, ObjectOutputStream, InputStream}// Generated filter array must have the same length as the one it is compared to
val expected: Long = 500
val fpp: Double = 0.3val dburl = "DATABASE_URL"
val conn = DriverManager.getConnection(dburl,"DB_USERNAME","DB_PASSWORD")val updatesql = "SELECT `value` FROM `example_strings` WHERE bloommatch(?, `example_strings`.`filter`);"
val ps = conn.prepareStatement(updatesql)// Creating a filter with values 'one' and 'two'
val in2 = spark.sparkContext.parallelize(List("one","two"))
val df2 = in2.toDF("tokens")
val filter = df2.stat.bloomFilter($"tokens", expected, fpp)val baos = new ByteArrayOutputStream
filter.writeTo(baos)
baos.flush()
val is :InputStream = new ByteArrayInputStream(baos.toByteArray())
ps.setBlob(1, is)
val rs = ps.executeQuery// Will find a match since tokens searched are both in the filter
val resultList = Iterator.from(0).takeWhile(_ => rs.next()).map(_ => rs.getString(1)).toList
println("Found matches: " + resultList.size)
conn.close()
----
== Contributing// Change the repository name in the issues link to match with your project's name
You can involve yourself with our project by https://github.com/teragrep/blf_02/issues/new/choose[opening an issue] or submitting a pull request.
Contribution requirements:
. *All changes must be accompanied by a new or changed test.* If you think testing is not required in your pull request, include a sufficient explanation as why you think so.
. Security checks must pass
. Pull requests must align with the principles and http://www.extremeprogramming.org/values.html[values] of extreme programming.
. Pull requests must follow the principles of Object Thinking and Elegant Objects (EO).Read more in our https://github.com/teragrep/teragrep/blob/main/contributing.adoc[Contributing Guideline].
=== Contributor License Agreement
Contributors must sign https://github.com/teragrep/teragrep/blob/main/cla.adoc[Teragrep Contributor License Agreement] before a pull request is accepted to organization's repositories.
You need to submit the CLA only once. After submitting the CLA you can contribute to all Teragrep's repositories.