Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/databricks/spark-corenlp

Stanford CoreNLP wrapper for Apache Spark
https://github.com/databricks/spark-corenlp

Last synced: 2 months ago
JSON representation

Stanford CoreNLP wrapper for Apache Spark

Host: GitHub
URL: https://github.com/databricks/spark-corenlp
Owner: databricks
License: gpl-3.0
Created: 2015-08-21T20:54:58.000Z (almost 9 years ago)
Default Branch: master
Last Pushed: 2018-11-15T23:06:53.000Z (over 5 years ago)
Last Synced: 2024-01-08T11:20:34.220Z (6 months ago)
Language: Scala
Homepage:
Size: 59.6 KB
Stars: 424
Watchers: 52
Forks: 120
Open Issues: 19
Metadata Files:
- Readme: README.md
- License: LICENSE

Lists

awesome-spark - spark-corenlp - commit/databricks/spark-corenlp.svg"> - DataFrame wrapper for [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/). (Packages / Natural Language Processing)
awesome-spark - spark-corenlp - commit/databricks/spark-corenlp.svg"> - DataFrame wrapper for [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/). (Packages / Natural Language Processing)

README

## Stanford CoreNLP wrapper for Apache Spark

This package wraps [Stanford CoreNLP](http://stanfordnlp.github.io/CoreNLP/) annotators as Spark
DataFrame functions following the [simple APIs](http://stanfordnlp.github.io/CoreNLP/simple.html)
introduced in Stanford CoreNLP 3.7.0.

This package requires Java 8 and CoreNLP to run.
Users must include CoreNLP model jars as dependencies to use language models.

All functions are defined under `com.databricks.spark.corenlp.functions`.

* *`cleanxml`*: Cleans XML tags in a document and returns the cleaned document.
* *`tokenize`*: Tokenizes a sentence into words.
* *`ssplit`*: Splits a document into sentences.
* *`pos`*: Generates the part of speech tags of the sentence.
* *`lemma`*: Generates the word lemmas of the sentence.
* *`ner`*: Generates the named entity tags of the sentence.
* *`depparse`*: Generates the semantic dependencies of the sentence and returns a flattened list of
`(source, sourceIndex, relation, target, targetIndex, weight)` relation tuples.
* *`coref`*: Generates the coref chains in the document and returns a list of
`(rep, mentions)` chain tuples, where `mentions` are in the format of
`(sentNum, startIndex, mention)`.
* *`natlog`*: Generates the Natural Logic notion of polarity for each token in a sentence, returned
as `up`, `down`, or `flat`.
* *`openie`*: Generates a list of Open IE triples as flat `(subject, relation, target, confidence)`
tuples.
* *`sentiment`*: Measures the sentiment of an input sentence on a scale of 0 (strong negative) to 4
(strong positive).

Users can chain the functions to create pipeline, for example:

~~~scala
import org.apache.spark.sql.functions._
import com.databricks.spark.corenlp.functions._

val input = Seq(
(1, "Stanford University is located in California. It is a great university.")
).toDF("id", "text")

val output = input
.select(cleanxml('text).as('doc))
.select(explode(ssplit('doc)).as('sen))
.select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))

output.show(truncate = false)
~~~

~~~
+----------------------------------------------+------------------------------------------------------+--------------------------------------------------+---------+
|sen |words |nerTags |sentiment|
+----------------------------------------------+------------------------------------------------------+--------------------------------------------------+---------+
|Stanford University is located in California .|[Stanford, University, is, located, in, California, .]|[ORGANIZATION, ORGANIZATION, O, O, O, LOCATION, O]|1 |
|It is a great university . |[It, is, a, great, university, .] |[O, O, O, O, O, O] |4 |
+----------------------------------------------+------------------------------------------------------+--------------------------------------------------+---------+
~~~

### Databricks

If you are a Databricks user, please follow the instructions in this
[example notebook](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/1962483213436895/588180/latest.html).

### Dependencies

Because CoreNLP depends on `protobuf-java` 3.x but Spark 2.4 depends on `protobuf-java` 2.x,
we release `spark-corenlp` as an assembly jar that includes CoreNLP as well as its transitive dependencies,
except `protobuf-java` being shaded.
This might cause issues if you have CoreNLP or its dependencies on the classpath.

To use `spark-corenlp`, you need one of the CoreNLP language models:

~~~bash
# Download one of the language models.
wget http://repo1.maven.org/maven2/edu/stanford/nlp/stanford-corenlp/3.9.1/stanford-corenlp-3.9.1-models.jar
# Run spark-shell
spark-shell --packages databricks/spark-corenlp:0.4.0-spark_2.4-scala_2.11 --jars stanford-corenlp-3.9.1-models.jar
~~~

### Acknowledgements

Many thanks to Jason Bolton from the Stanford NLP Group for API discussions.