Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/stanfordnlp/python-stanford-corenlp
Python interface to CoreNLP using a bidirectional server-client interface.
https://github.com/stanfordnlp/python-stanford-corenlp
corenlp corenlp-server nlp
Last synced: 3 days ago
JSON representation
Python interface to CoreNLP using a bidirectional server-client interface.
- Host: GitHub
- URL: https://github.com/stanfordnlp/python-stanford-corenlp
- Owner: stanfordnlp
- License: mit
- Created: 2017-06-08T23:28:16.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2022-01-10T17:24:09.000Z (about 3 years ago)
- Last Synced: 2025-01-12T19:05:26.518Z (10 days ago)
- Topics: corenlp, corenlp-server, nlp
- Language: Python
- Homepage:
- Size: 78.1 KB
- Stars: 518
- Watchers: 46
- Forks: 105
- Open Issues: 8
-
Metadata Files:
- Readme: README.rst
- Changelog: CHANGELOG
- License: LICENSE
Awesome Lists containing this project
README
Stanford CoreNLP Python Interface
=================================**NOTE:** This package is now deprecated. Please use the `stanza `_ package instead.
.. image:: https://travis-ci.org/stanfordnlp/python-stanford-corenlp.svg?branch=master
:target: https://travis-ci.org/stanfordnlp/python-stanford-corenlpThis package contains a python interface for `Stanford CoreNLP
`_ that contains a reference
implementation to interface with the `Stanford CoreNLP server
`_.
The package also contains a base class to expose a python-based annotation
provider (e.g. your favorite neural NER system) to the CoreNLP
pipeline via a lightweight service.To use the package, first download the `official java CoreNLP release
`_, unzip it, and define an environment
variable :code:`$CORENLP_HOME` that points to the unzipped directory.You can also install this package from `PyPI `_ using :code:`pip install stanford-corenlp`
----
Command Line Usage
------------------
Probably the easiest way to use this package is through the `annotate` command-line utility::usage: annotate [-h] [-i INPUT] [-o OUTPUT] [-f {json}]
[-a ANNOTATORS [ANNOTATORS ...]] [-s] [-v] [-m MEMORY]
[-p PROPS [PROPS ...]]Annotate data
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Input file to process; each line contains one document
(default: stdin)
-o OUTPUT, --output OUTPUT
File to write annotations to (default: stdout)
-f {json}, --format {json}
Output format
-a ANNOTATORS [ANNOTATORS ...], --annotators ANNOTATORS [ANNOTATORS ...]
A list of annotators
-s, --sentence-mode Assume each line of input is a sentence.
-v, --verbose-server Server is made verbose
-m MEMORY, --memory MEMORY
Memory to use for the server
-p PROPS [PROPS ...], --props PROPS [PROPS ...]
Properties as a list of key=value pairsWe recommend using `annotate` in conjuction with the wonderful `jq`
command to process the output. As an example, given a file with a
sentence on each line, the following command produces an equivalent
space-separated tokens::cat file.txt | annotate -s -a tokenize | jq '[.tokens[].originalText]' > tokenized.txt
Annotation Server Usage
-----------------------.. code-block:: python
import corenlp
text = "Chris wrote a simple sentence that he parsed with Stanford CoreNLP."
# We assume that you've downloaded Stanford CoreNLP and defined an environment
# variable $CORENLP_HOME that points to the unzipped directory.
# The code below will launch StanfordCoreNLPServer in the background
# and communicate with the server to annotate the sentence.
with corenlp.CoreNLPClient(annotators="tokenize ssplit pos lemma ner depparse".split()) as client:
ann = client.annotate(text)# You can access annotations using ann.
sentence = ann.sentence[0]# The corenlp.to_text function is a helper function that
# reconstructs a sentence from tokens.
assert corenlp.to_text(sentence) == text# You can access any property within a sentence.
print(sentence.text)# Likewise for tokens
token = sentence.token[0]
print(token.lemma)# Use tokensregex patterns to find who wrote a sentence.
pattern = '([ner: PERSON]+) /wrote/ /an?/ []{0,3} /sentence|article/'
matches = client.tokensregex(text, pattern)
# sentences contains a list with matches for each sentence.
assert len(matches["sentences"]) == 1
# length tells you whether or not there are any matches in this
assert matches["sentences"][0]["length"] == 1
# You can access matches like most regex groups.
matches["sentences"][1]["0"]["text"] == "Chris wrote a simple sentence"
matches["sentences"][1]["0"]["1"]["text"] == "Chris"# Use semgrex patterns to directly find who wrote what.
pattern = '{word:wrote} >nsubj {}=subject >dobj {}=object'
matches = client.semgrex(text, pattern)
# sentences contains a list with matches for each sentence.
assert len(matches["sentences"]) == 1
# length tells you whether or not there are any matches in this
assert matches["sentences"][0]["length"] == 1
# You can access matches like most regex groups.
matches["sentences"][1]["0"]["text"] == "wrote"
matches["sentences"][1]["0"]["$subject"]["text"] == "Chris"
matches["sentences"][1]["0"]["$object"]["text"] == "sentence"See `test_client.py` and `test_protobuf.py` for more examples. Props to
@dan-zheng for tokensregex/semgrex support.Annotation Service Usage
------------------------*NOTE*: The annotation service allows users to provide a custom
annotator to be used by the CoreNLP pipeline. Unfortunately, it relies
on experimental code internal to the Stanford CoreNLP project is not yet
available for public use... code-block:: python
import corenlp
from .happyfuntokenizer import Tokenizerclass HappyFunTokenizer(Tokenizer, corenlp.Annotator):
def __init__(self, preserve_case=False):
Tokenizer.__init__(self, preserve_case)
corenlp.Annotator.__init__(self)@property
def name(self):
"""
Name of the annotator (used by CoreNLP)
"""
return "happyfun"@property
def requires(self):
"""
Requires has to specify all the annotations required before we
are called.
"""
return []@property
def provides(self):
"""
The set of annotations guaranteed to be provided when we are done.
NOTE: that these annotations are either fully qualified Java
class names or refer to nested classes of
edu.stanford.nlp.ling.CoreAnnotations (as is the case below).
"""
return ["TextAnnotation",
"TokensAnnotation",
"TokenBeginAnnotation",
"TokenEndAnnotation",
"CharacterOffsetBeginAnnotation",
"CharacterOffsetEndAnnotation",
]def annotate(self, ann):
"""
@ann: is a protobuf annotation object.
Actually populate @ann with tokens.
"""
buf, beg_idx, end_idx = ann.text.lower(), 0, 0
for i, word in enumerate(self.tokenize(ann.text)):
token = ann.sentencelessToken.add()
# These are the bare minimum required for the TokenAnnotation
token.word = word
token.tokenBeginIndex = i
token.tokenEndIndex = i+1# Seek into the txt until you can find this word.
try:
# Try to update beginning index
beg_idx = buf.index(word, beg_idx)
except ValueError:
# Give up -- this will be something random
end_idx = beg_idx + len(word)token.beginChar = beg_idx
token.endChar = end_idxbeg_idx, end_idx = end_idx, end_idx
annotator = HappyFunTokenizer()
# Calling .start() will launch the annotator as a service running on
# port 8432 by default.
annotator.start()# annotator.properties contains all the right properties for
# Stanford CoreNLP to use this annotator.
with corenlp.CoreNLPClient(properties=annotator.properties, annotators="happyfun ssplit pos".split()) as client:
ann = client.annotate("RT @ #happyfuncoding: this is a typical Twitter tweet :-)")tokens = [t.word for t in ann.sentence[0].token]
print(tokens)See `test_annotator.py` for more examples.