{"id":13795338,"url":"https://github.com/databricks/spark-corenlp","last_synced_at":"2025-04-06T01:09:33.319Z","repository":{"id":36870199,"uuid":"41177171","full_name":"databricks/spark-corenlp","owner":"databricks","description":"Stanford CoreNLP wrapper for Apache Spark","archived":false,"fork":false,"pushed_at":"2018-11-15T23:06:53.000Z","size":61,"stargazers_count":422,"open_issues_count":19,"forks_count":120,"subscribers_count":50,"default_branch":"master","last_synced_at":"2025-03-30T00:08:22.000Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/databricks.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-08-21T20:54:58.000Z","updated_at":"2024-07-06T13:03:16.000Z","dependencies_parsed_at":"2022-09-11T00:21:38.950Z","dependency_job_id":null,"html_url":"https://github.com/databricks/spark-corenlp","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Fspark-corenlp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Fspark-corenlp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Fspark-corenlp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Fspark-corenlp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/databricks","download_url":"https://codeload.github.com/databricks/spark-corenlp/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247419860,"owners_count":20936012,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T23:00:54.771Z","updated_at":"2025-04-06T01:09:33.305Z","avatar_url":"https://github.com/databricks.png","language":"Scala","readme":"## Stanford CoreNLP wrapper for Apache Spark\n\nThis package wraps [Stanford CoreNLP](http://stanfordnlp.github.io/CoreNLP/) annotators as Spark\nDataFrame functions following the [simple APIs](http://stanfordnlp.github.io/CoreNLP/simple.html)\nintroduced in Stanford CoreNLP 3.7.0.\n\nThis package requires Java 8 and CoreNLP to run.\nUsers must include CoreNLP model jars as dependencies to use language models.\n\nAll functions are defined under `com.databricks.spark.corenlp.functions`.\n\n* *`cleanxml`*: Cleans XML tags in a document and returns the cleaned document.\n* *`tokenize`*: Tokenizes a sentence into words.\n* *`ssplit`*: Splits a document into sentences.\n* *`pos`*: Generates the part of speech tags of the sentence.\n* *`lemma`*: Generates the word lemmas of the sentence.\n* *`ner`*: Generates the named entity tags of the sentence.\n* *`depparse`*: Generates the semantic dependencies of the sentence and returns a flattened list of\n  `(source, sourceIndex, relation, target, targetIndex, weight)` relation tuples.\n* *`coref`*: Generates the coref chains in the document and returns a list of\n  `(rep, mentions)` chain tuples, where `mentions` are in the format of\n  `(sentNum, startIndex, mention)`.\n* *`natlog`*: Generates the Natural Logic notion of polarity for each token in a sentence, returned\n  as `up`, `down`, or `flat`.\n* *`openie`*: Generates a list of Open IE triples as flat `(subject, relation, target, confidence)`\n  tuples.\n* *`sentiment`*: Measures the sentiment of an input sentence on a scale of 0 (strong negative) to 4\n  (strong positive).  \n\nUsers can chain the functions to create pipeline, for example:\n\n~~~scala\nimport org.apache.spark.sql.functions._\nimport com.databricks.spark.corenlp.functions._\n\nval input = Seq(\n  (1, \"\u003cxml\u003eStanford University is located in California. It is a great university.\u003c/xml\u003e\")\n).toDF(\"id\", \"text\")\n\nval output = input\n  .select(cleanxml('text).as('doc))\n  .select(explode(ssplit('doc)).as('sen))\n  .select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))\n\noutput.show(truncate = false)\n~~~\n\n~~~\n+----------------------------------------------+------------------------------------------------------+--------------------------------------------------+---------+\n|sen                                           |words                                                 |nerTags                                           |sentiment|\n+----------------------------------------------+------------------------------------------------------+--------------------------------------------------+---------+\n|Stanford University is located in California .|[Stanford, University, is, located, in, California, .]|[ORGANIZATION, ORGANIZATION, O, O, O, LOCATION, O]|1        |\n|It is a great university .                    |[It, is, a, great, university, .]                     |[O, O, O, O, O, O]                                |4        |\n+----------------------------------------------+------------------------------------------------------+--------------------------------------------------+---------+\n~~~\n\n### Databricks\n\nIf you are a Databricks user, please follow the instructions in this\n[example notebook](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/1962483213436895/588180/latest.html).\n\n### Dependencies\n\nBecause CoreNLP depends on `protobuf-java` 3.x but Spark 2.4 depends on `protobuf-java` 2.x,\nwe release `spark-corenlp` as an assembly jar that includes CoreNLP as well as its transitive dependencies,\nexcept `protobuf-java` being shaded.\nThis might cause issues if you have CoreNLP or its dependencies on the classpath.\n\nTo use `spark-corenlp`, you need one of the CoreNLP language models:\n\n~~~bash\n# Download one of the language models. \nwget http://repo1.maven.org/maven2/edu/stanford/nlp/stanford-corenlp/3.9.1/stanford-corenlp-3.9.1-models.jar\n# Run spark-shell \nspark-shell --packages databricks/spark-corenlp:0.4.0-spark_2.4-scala_2.11 --jars stanford-corenlp-3.9.1-models.jar\n~~~\n\n### Acknowledgements\n\nMany thanks to Jason Bolton from the Stanford NLP Group for API discussions.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabricks%2Fspark-corenlp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatabricks%2Fspark-corenlp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabricks%2Fspark-corenlp/lists"}