https://github.com/databricks-industry-solutions/smolder

HL7 Apache Spark Datasource
https://github.com/databricks-industry-solutions/smolder

datasource hl7 hl7v2 spark

Last synced: 4 months ago
JSON representation

HL7 Apache Spark Datasource

Host: GitHub
URL: https://github.com/databricks-industry-solutions/smolder
Owner: databricks-industry-solutions
License: apache-2.0
Created: 2020-11-19T22:37:52.000Z (almost 5 years ago)
Default Branch: master
Last Pushed: 2024-10-23T22:49:50.000Z (about 1 year ago)
Last Synced: 2025-06-09T15:43:55.506Z (5 months ago)
Topics: datasource, hl7, hl7v2, spark
Language: Scala
Homepage:
Size: 74.2 KB
Stars: 66
Watchers: 11
Forks: 25
Open Issues: 1
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE-OF-CONDUCT.md

Awesome Lists containing this project

README

          


  





  A library for burning through electronic health record data using Apache Spark™



Smolder provides an Apache Spark™ SQL data source for loading EHR data from

[HL7v2](https://www.hl7.org/implement/standards/product_brief.cfm?product_id=244)

message formats. Additionally, Smolder provides helper functions that can be used

on a Spark SQL DataFrame to parse HL7 message text, and to extract segments,

fields, and subfields, from a message.

# Project Support

Please note that all projects in the /databrickslabs github account are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs).  They are provided AS-IS and we do not make any guarantees of any kind.  Please do not submit a support ticket relating to any issues arising from the use of these projects.

Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo.  They will be reviewed as time permits, but there are no formal SLAs for support.

# Building and Testing

This project is built using [sbt](https://www.scala-sbt.org/1.0/docs/Setup.html) and Java 8.

Start an sbt shell using the `sbt` command.

> **FYI**: The following SBT projects are built on Spark 3.2.1/Scala 2.12.8 by default. To change the Spark version and

Scala version, set the environment variables `SPARK_VERSION` and `SCALA_VERSION`.

To compile the main code:

```

compile

```

To run all Scala tests:

```

test

```

To test a specific suite:

```

testOnly *HL7FileFormatSuite

```

To create a JAR that can be run as part of an [Apache Spark job or

shell](http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management), run:

```

package

```

The JAR can be found under `target/scala-`.

# Getting Started

To load HL7 messages into an [Apache Spark SQL

DataFrame](http://spark.apache.org/docs/latest/sql-programming-guide.html),

simply invoke the `hl7` reader:

```

scala> val df = spark.read.format("hl7").load("path/to/hl7/messages")

df: org.apache.spark.sql.DataFrame = [message: string, segments: array>>]

```

The schema returned contains the message header in the `message` column. The

message segments are nested in the `segments` column, which is an array. This

array contains two nested fields: the string `id` for the segment (e.g., `PID`

for a [patient identification segment](http://www.hl7.eu/refactored/segPID.html)

and an array of segment `fields`.

## Parsing message text from a DataFrame

Smolder can also be used to parse raw message text. This might happen if you had

an HL7 message feed land in an intermediate source first (e.g., a Kafka stream).

To do this, we can use Smolder's `parse_hl7_message` helper function. First, we

start with a DataFrame containing HL7 message text:

```

scala> val textMessageDf = ...

textMessageDf: org.apache.spark.sql.DataFrame = [value: string]

scala> textMessageDf.show()

+--------------------+                                                          

|               value|

+--------------------+

|MSH|^~\&|||||2020...|

+--------------------+

```

Then, we can import the `parse_hl7_message` message from the

`com.databricks.labs.smolder.functions` object and apply that to the column we

want to parse:

```

scala> import com.databricks.labs.smolder.functions.parse_hl7_message

import com.databricks.labs.smolder.functions.parse_hl7_message

scala> val parsedDf = textMessageDf.select(parse_hl7_message($"value").as("message"))

parsedDf: org.apache.spark.sql.DataFrame = [message: struct>>>]

```

This yields the same schema as our `hl7` data source.

## Extracting fields from an HL7 message segment

While Smolder provides an easy-to-use schema for HL7 messages, we also provide

helper functions in `com.databricks.labs.smolder.functions` to extract subfields

of a message segment. For instance, let's say we want to get the patient's name,

which is the 5th field in the patient ID (PID) segment. We can extract this with

the `segment_field` function:

```

scala> import com.databricks.labs.smolder.functions.segment_field

import com.databricks.labs.smolder.functions.segment_field

scala> val nameDf = df.select(segment_field("PID", 4).alias("name"))

nameDf: org.apache.spark.sql.DataFrame = [name: string]

scala> nameDf.show()

+-------------+

|         name|

+-------------+

|Heller^Keneth|

+-------------+

```

If we then wanted to get the patient's first name, we can use the `subfield`

function:

```

scala> import com.databricks.labs.smolder.functions.subfield

import com.databricks.labs.smolder.functions.subfield

scala> val firstNameDf = nameDf.select(subfield($"name", 1).alias("firstname"))

firstNameDf: org.apache.spark.sql.DataFrame = [firstname: string]

scala> firstNameDf.show()

+---------+

|firstname|

+---------+

|   Keneth|

+---------+

```

# License and Contributing

Smolder is made available under an [Apache 2.0 license](LICENSE), and we welcome

contributions from the community. Please see our [contibutor guidance](CONTRIBUTING.md)

for information about how to contribute to the project. To ensure that contributions

to Smolder are properly licensed, we follow the [Developer Certificate of Origin

(DCO)](http://developercertificate.org/) for all contributions to the project.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/databricks-industry-solutions/smolder

Awesome Lists containing this project

README