https://github.com/airbnb/airbnb-spark-thrift
A library for loadling Thrift data into Spark SQL
https://github.com/airbnb/airbnb-spark-thrift
spark spark-sql spark-streaming thrift
Last synced: about 1 month ago
JSON representation
A library for loadling Thrift data into Spark SQL
- Host: GitHub
- URL: https://github.com/airbnb/airbnb-spark-thrift
- Owner: airbnb
- License: apache-2.0
- Created: 2017-04-20T21:06:10.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2023-03-04T01:16:31.000Z (over 2 years ago)
- Last Synced: 2025-08-29T06:33:19.600Z (about 2 months ago)
- Topics: spark, spark-sql, spark-streaming, thrift
- Language: Scala
- Homepage:
- Size: 50.8 KB
- Stars: 43
- Watchers: 16
- Forks: 16
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Spark Thrift Loader
[](https://travis-ci.org/airbnb/airbnb-spark-thrift)A library for loadling Thrift data into [Spark SQL](http://spark.apache.org/docs/latest/sql-programming-guide.html).
## Features
It supports conversions from Thrift records to Spark SQL, making Thrift a first-class citizen in Spark.
It automatically derives Spark SQL schema from Thrift struct and convert Thrift object to Spark Row in runtime.
Any nested-structs are all support except Map key field needs to be primitive.It is especially useful when running spark streaming job to consume thrift events from different streaming sources.
## Supported types for Thrift -> Spark SQL conversion
This library supports reading following types. It uses the following mapping from convert Thrift types to Spark SQL types:
| Thrift Type | Spark SQL type |
| --- | --- |
| bool | BooleanType |
| i16 | ShortType |
| i32 | IntegerType |
| i64 | LongType |
| double | DoubleType |
| binary | StringType |
| string | StringType |
| enum | String |
| list | ArrayType |
| set | ArrayType |
| map | MapType |
| struct | StructType |## Examples
### Convert Thrift Schema to StructType in Spark
```scala
import com.airbnb.spark.thrift.ThriftSchemaConverter// this will return a StructType for the thrift class
val thriftStructType = ThriftSchemaConverter.convert(ThriftExampleClass.getClass)```
### Convert Thrift Object to Row in Spark
```scala
import com.airbnb.spark.thrift.ThriftSchemaConverter
import com.airbnb.spark.thrift.ThriftParser// this will return a StructType for the thrift class
val thriftStructType = ThriftSchemaConverter.convert(ThriftExampleClass.getClass)
val row = ThriftParser.convertObject(
thriftObject,
thriftStructType)
```### Use cases: consume Kafka Streaming, where each event is a thrift object
```scala
import com.airbnb.spark.thrift.ThriftSchemaConverter
import com.airbnb.spark.thrift.ThriftParserdirectKafkaStream.foreachRDD(rdd => {
val schema = ThriftSchemaConverter.convert(ThriftExampleClass.getClass)val deserializedEvents = rdd
.map(_.message)
.filter(_ != null)
.flatMap(eventBytes => {
try Some(MessageSerializer.getInstance().fromBytes(eventBytes))
.asInstanceOf[Option[Message[_]]]
catch {
case e: Exception => {
LOG.warn(s"Failed to deserialize thrift event ${e.toString}")
None
}
}
}).map(_.getEvent.asInstanceOf[TBaseType])val rows: RDD[Row] = ThriftParser(
ThriftExampleClass.getClass,
deserializedEvents,
schema)val df = sqlContext.createDataFrame(rows, schema)
// Process the dataframe on this micrao batch
})
}
```## How to get started
Clone the project and mvn package to get the artifact.## How to contribute
Please send the PR here and cc @liyintang or @jingweilu1974 for reviewing