{"id":34041951,"url":"https://github.com/hltcoe/concrete-python","last_synced_at":"2026-04-08T19:32:03.937Z","repository":{"id":14508704,"uuid":"17222228","full_name":"hltcoe/concrete-python","owner":"hltcoe","description":"Python modules and scripts for working with Concrete, a data serialization format for NLP","archived":false,"fork":false,"pushed_at":"2023-10-20T21:49:01.000Z","size":2058,"stargazers_count":21,"open_issues_count":4,"forks_count":8,"subscribers_count":6,"default_branch":"main","last_synced_at":"2026-01-06T10:17:58.942Z","etag":null,"topics":["annotation","communication-protocol","data-format","hlt","nlp","python","thrift"],"latest_commit_sha":null,"homepage":"https://concrete-python.readthedocs.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hltcoe.png","metadata":{"files":{"readme":"README.rst","changelog":"CHANGELOG","contributing":"CONTRIBUTING.rst","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2014-02-26T19:27:30.000Z","updated_at":"2025-01-18T22:00:19.000Z","dependencies_parsed_at":"2023-02-19T09:46:40.957Z","dependency_job_id":"eb1cfc76-bffb-43e0-8a24-7bc91fe6271e","html_url":"https://github.com/hltcoe/concrete-python","commit_stats":{"total_commits":928,"total_committers":18,"mean_commits":51.55555555555556,"dds":0.5301724137931034,"last_synced_commit":"e64a735a290a6ed44289ea3ec902702c616514fb"},"previous_names":[],"tags_count":83,"template":false,"template_full_name":null,"purl":"pkg:github/hltcoe/concrete-python","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hltcoe%2Fconcrete-python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hltcoe%2Fconcrete-python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hltcoe%2Fconcrete-python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hltcoe%2Fconcrete-python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hltcoe","download_url":"https://codeload.github.com/hltcoe/concrete-python/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hltcoe%2Fconcrete-python/sbom","scorecard":{"id":466487,"data":{"date":"2025-08-11","repo":{"name":"github.com/hltcoe/concrete-python","commit":"5fb4fbb968970404b15806902072e6010b974e48"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":3.9,"checks":[{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Token-Permissions","score":0,"reason":"detected GitHub workflow tokens with excessive permissions","details":["Warn: no topLevel permission defined: .github/workflows/docker-build-and-push.yml:1","Warn: no topLevel permission defined: .github/workflows/tox.yml:1","Info: no jobLevel write permissions found"],"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Dangerous-Workflow","score":10,"reason":"no dangerous workflow patterns detected","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"License","score":9,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Warn: project license file does not contain an FSF or OSI license."],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Pinned-Dependencies","score":2,"reason":"dependency not pinned by hash detected -- score normalized to 2","details":["Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/docker-build-and-push.yml:36: update your workflow using https://app.stepsecurity.io/secureworkflow/hltcoe/concrete-python/docker-build-and-push.yml/main?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/tox.yml:18: update your workflow using https://app.stepsecurity.io/secureworkflow/hltcoe/concrete-python/tox.yml/main?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/tox.yml:20: update your workflow using https://app.stepsecurity.io/secureworkflow/hltcoe/concrete-python/tox.yml/main?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/tox.yml:37: update your workflow using https://app.stepsecurity.io/secureworkflow/hltcoe/concrete-python/tox.yml/main?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/tox.yml:39: update your workflow using https://app.stepsecurity.io/secureworkflow/hltcoe/concrete-python/tox.yml/main?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/tox.yml:52: update your workflow using https://app.stepsecurity.io/secureworkflow/hltcoe/concrete-python/tox.yml/main?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/tox.yml:54: update your workflow using https://app.stepsecurity.io/secureworkflow/hltcoe/concrete-python/tox.yml/main?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/tox.yml:71: update your workflow using https://app.stepsecurity.io/secureworkflow/hltcoe/concrete-python/tox.yml/main?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/tox.yml:73: update your workflow using https://app.stepsecurity.io/secureworkflow/hltcoe/concrete-python/tox.yml/main?enable=pin","Warn: containerImage not pinned by hash: Dockerfile:1: pin your Docker image by updating ccmaymay/concrete-python-base:thrift-0.19.0 to ccmaymay/concrete-python-base:thrift-0.19.0@sha256:bc0238343879e751d68fac70fbaa50e5169ff8de4cc08e305adbd98ac0432e31","Warn: pipCommand not pinned by hash: Dockerfile:3","Warn: pipCommand not pinned by hash: Dockerfile:7-8","Warn: pipCommand not pinned by hash: install-mojave-homebrew-accelerated-thrift.sh:21","Warn: pipCommand not pinned by hash: .github/workflows/tox.yml:25","Warn: pipCommand not pinned by hash: .github/workflows/tox.yml:44","Info:   0 out of   9 GitHub-owned GitHubAction dependencies pinned","Info:   4 out of   4 third-party GitHubAction dependencies pinned","Info:   0 out of   1 containerImage dependencies pinned","Info:   0 out of   5 pipCommand dependencies pinned"],"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Branch-Protection","score":-1,"reason":"internal error: error during branchesHandler.setup: internal error: githubv4.Query: Resource not accessible by integration","details":null,"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 7 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-19T12:39:12.103Z","repository_id":14508704,"created_at":"2025-08-19T12:39:12.103Z","updated_at":"2025-08-19T12:39:12.103Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31571600,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-08T14:31:17.711Z","status":"ssl_error","status_checked_at":"2026-04-08T14:31:17.202Z","response_time":54,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["annotation","communication-protocol","data-format","hlt","nlp","python","thrift"],"created_at":"2025-12-13T22:11:04.773Z","updated_at":"2026-04-08T19:32:03.929Z","avatar_url":"https://github.com/hltcoe.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Tutorial\n========\n\n.. image:: https://badge.fury.io/py/concrete.svg\n   :target: https://badge.fury.io/py/concrete\n.. image:: https://github.com/hltcoe/concrete-python/actions/workflows/tox.yml/badge.svg\n   :target: https://github.com/hltcoe/concrete-python/actions/workflows/tox.yml\n.. image:: https://github.com/hltcoe/concrete-python/actions/workflows/docker-build-and-push.yml/badge.svg\n   :target: https://github.com/hltcoe/concrete-python/actions/workflows/docker-build-and-push.yml\n\n\nConcrete-python is the Python interface to Concrete_, a\nnatural language processing data format and set of service protocols\nthat work across different operating systems and programming languages\nvia `Apache Thrift`_.  Concrete-python contains generated Python\nclasses, utility classes and functions, and scripts.  It does not contain the\nThrift schema for Concrete, which can be found in the\n`Concrete GitHub repository`_.\n\nThis document provides a quick tutorial of concrete-python installation and\nusage.  For more information, including an API reference and development\ninformation, please see the `online documentation`_.\n\n\n.. contents:: **Table of Contents**\n   :local:\n   :backlinks: none\n\n\nLicense\n-------\n\nCopyright 2012-2019 Johns Hopkins University HLTCOE. All rights\nreserved.  This software is released under the 2-clause BSD license.\nPlease see LICENSE_ for more information.\n\n\nRequirements\n------------\n\nconcrete-python is tested on Python 3.5 and requires the\nThrift Python library, among other Python libraries.  These are\ninstalled automatically by ``setup.py`` or ``pip``.  The Thrift\ncompiler is *not* required.\n\n**Note**: The accelerated protocol offers a (de)serialization speedup\nof 10x or more; if you would like to use it, ensure a C++ compiler is\navailable on your system before installing concrete-python.\n(If a compiler is not available, concrete-python will fall back to the\nunaccelerated protocol automatically.)  If you are on Linux, a suitable\nC++ compiler will be listed as ``g++`` or ``gcc-c++`` in your package\nmanager.\n\nIf you are using macOS Mojave with the Homebrew package manager\n(https://brew.sh), you can install the accelerated protocol using\nthe script ``install-mojave-homebrew-accelerated-thrift.sh``.\n\n\nInstallation\n------------\n\nYou can install Concrete using the ``pip`` package manager::\n\n    pip install concrete\n\nor by cloning the repository and running ``setup.py``::\n\n    git clone https://github.com/hltcoe/concrete-python.git\n    cd concrete-python\n    python setup.py install\n\n\nBasic usage\n-----------\n\nHere and in the following sections we make use of an example Concrete\nCommunication file included in the concrete-python source distribution.\nThe *Communication* type represents an article, book, post, Tweet, or\nany other kind of document that we might want to store and analyze.\nCopy it from ``tests/testdata/serif_dog-bites-man.concrete`` if you\nhave the concrete-python source distribution or download it\nseparately here: serif_dog-bites-man.concrete_.\n\nFirst we use the ``concrete-inspect.py`` tool (explained in more detail\nin the following section) to inspect some of the contents of the\nCommunication::\n\n    concrete-inspect.py --text serif_dog-bites-man.concrete\n\nThis command prints the text of the Communication to the console.  In\nour case the text is a short article formatted in SGML::\n\n    \u003cDOC id=\"dog-bites-man\" type=\"other\"\u003e\n    \u003cHEADLINE\u003e\n    Dog Bites Man\n    \u003c/HEADLINE\u003e\n    \u003cTEXT\u003e\n    \u003cP\u003e\n    John Smith, manager of ACMÉ INC, was bit by a dog on March 10th, 2013.\n    \u003c/P\u003e\n    \u003cP\u003e\n    He died!\n    \u003c/P\u003e\n    \u003cP\u003e\n    John's daughter Mary expressed sorrow.\n    \u003c/P\u003e\n    \u003c/TEXT\u003e\n    \u003c/DOC\u003e\n\nNow run the following command to inspect some of the annotations stored\nin that Communication::\n\n    concrete-inspect.py --ner --pos --dependency serif_dog-bites-man.concrete\n\nThis command shows a tokenization, part-of-speech tagging, named entity\ntagging, and dependency parse in a CoNLL_-like columnar format::\n\n    INDEX\tTOKEN\tPOS\tNER\tHEAD\tDEPREL\n    -----\t-----\t---\t---\t----\t------\n    1\tJohn\tNNP\tPER\t2\tcompound\n    2\tSmith\tNNP\tPER\t10\tnsubjpass\n    3\t,\t,\n    4\tmanager\tNN\t\t2\tappos\n    5\tof\tIN\t\t7\tcase\n    6\tACMÉ\tNNP\tORG\t7\tcompound\n    7\tINC\tNNP\tORG\t4\tnmod\n    8\t,\t,\n    9\twas\tVBD\t\t10\tauxpass\n    10\tbit\tNN\t\t0\tROOT\n    11\tby\tIN\t\t13\tcase\n    12\ta\tDT\t\t13\tdet\n    13\tdog\tNN\t\t10\tnmod\n    14\ton\tIN\t\t15\tcase\n    15\tMarch\tDATE-NNP\t\t13\tnmod\n    16\t10th\tJJ\t\t15\tamod\n    17\t,\t,\n    18\t2013\tCD\t\t13\tamod\n    19\t.\t.\n\n    1\tHe\tPRP\t\t2\tnsubj\n    2\tdied\tVBD\t\t0\tROOT\n    3\t!\t.\n\n    1\tJohn\tNNP\tPER\t3\tnmod:poss\n    2\t's\tPOS\t\t1\tcase\n    3\tdaughter\tNN\t\t5\tdep\n    4\tMary\tNNP\tPER\t5\tnsubj\n    5\texpressed\tVBD\t\t0\tROOT\n    6\tsorrow\tNN\t\t5\tdobj\n    7\t.\t.\n\nReading Concrete\n~~~~~~~~~~~~~~~~\n\nThere are even more annotations stored in this Communication, but for\nnow we move on to demonstrate handling of the Communication in Python.\nThe example file contains a single Communication, but many (if\nnot most) files contain several.  The same code can be used to read\nCommunications in a regular file, tar archive, or zip\narchive::\n\n    from concrete.util import CommunicationReader\n\n    for (comm, filename) in CommunicationReader('serif_dog-bites-man.concrete'):\n        print(comm.id)\n        print()\n        print(comm.text)\n\nThis loop prints the unique ID and text (the same text we saw\nbefore) of our one Communication::\n\n    tests/testdata/serif_dog-bites-man.xml\n\n    \u003cDOC id=\"dog-bites-man\" type=\"other\"\u003e\n    \u003cHEADLINE\u003e\n    Dog Bites Man\n    \u003c/HEADLINE\u003e\n    \u003cTEXT\u003e\n    \u003cP\u003e\n    John Smith, manager of ACMÉ INC, was bit by a dog on March 10th, 2013.\n    \u003c/P\u003e\n    \u003cP\u003e\n    He died!\n    \u003c/P\u003e\n    \u003cP\u003e\n    John's daughter Mary expressed sorrow.\n    \u003c/P\u003e\n    \u003c/TEXT\u003e\n    \u003c/DOC\u003e\n\nIn addition to the general-purpose ``CommunicationReader`` there is a\nconvenience function for reading a single Communication from a regular\nfile::\n\n    from concrete.util import read_communication_from_file\n\n    comm = read_communication_from_file('serif_dog-bites-man.concrete')\n\nCommunications are broken into *Sections*, which are in turn broken\ninto *Sentences*, which are in turn broken into *Tokens* (and that's\nonly scratching the surface).  To traverse this decomposition::\n\n    from concrete.util import lun, get_tokens\n\n    for section in lun(comm.sectionList):\n        print('* section')\n        for sentence in lun(section.sentenceList):\n            print('  + sentence')\n            for token in get_tokens(sentence.tokenization):\n                print('    - ' + token.text)\n\nThe output is::\n\n    * section\n    * section\n      + sentence\n        - John\n        - Smith\n        - ,\n        - manager\n        - of\n        - ACMÉ\n        - INC\n        - ,\n        - was\n        - bit\n        - by\n        - a\n        - dog\n        - on\n        - March\n        - 10th\n        - ,\n        - 2013\n        - .\n    * section\n      + sentence\n        - He\n        - died\n        - !\n    * section\n      + sentence\n        - John\n        - 's\n        - daughter\n        - Mary\n        - expressed\n        - sorrow\n        - .\n\nHere we used ``get_tokens``, which abstracts the process of extracting\na sequence of *Tokens* from a *Tokenization*, and ``lun``, which\nreturns its argument or (if its argument is ``None``) an empty list\nand stands for \"list un-none\".  Many fields in Concrete are optional,\nincluding ``Communication.sectionList`` and ``Section.sentenceList``;\nchecking for ``None`` quickly becomes tedious.\n\nIn this Communication the tokens have been annotated with\npart-of-speech tags, as we saw previously using\n``concrete-inspect.py``.  We can print them with the following code::\n\n    from concrete.util import get_tagged_tokens\n\n    for section in lun(comm.sectionList):\n        print('* section')\n        for sentence in lun(section.sentenceList):\n            print('  + sentence')\n            for token_tag in get_tagged_tokens(sentence.tokenization, 'POS'):\n                print('    - ' + token_tag.tag)\n\nThe output is::\n\n    * section\n    * section\n      + sentence\n        - NNP\n        - NNP\n        - ,\n        - NN\n        - IN\n        - NNP\n        - NNP\n        - ,\n        - VBD\n        - NN\n        - IN\n        - DT\n        - NN\n        - IN\n        - DATE-NNP\n        - JJ\n        - ,\n        - CD\n        - .\n    * section\n      + sentence\n        - PRP\n        - VBD\n        - .\n    * section\n      + sentence\n        - NNP\n        - POS\n        - NN\n        - NNP\n        - VBD\n        - NN\n        - .\n\nWriting Concrete\n~~~~~~~~~~~~~~~~\n\nWe can add a new part-of-speech tagging to the Communication as well.\nLet's add a simplified version of the current tagging::\n\n    from concrete.util import AnalyticUUIDGeneratorFactory, now_timestamp\n    from concrete import TokenTagging, TaggedToken, AnnotationMetadata\n\n    augf = AnalyticUUIDGeneratorFactory(comm)\n    aug = augf.create()\n\n    for section in lun(comm.sectionList):\n        for sentence in lun(section.sentenceList):\n            sentence.tokenization.tokenTaggingList.append(TokenTagging(\n                uuid=aug.next(),\n                metadata=AnnotationMetadata(\n                    tool='Simple POS',\n                    timestamp=now_timestamp(),\n                    kBest=1\n                ),\n                taggingType='POS',\n                taggedTokenList=[\n                    TaggedToken(\n                        tokenIndex=original.tokenIndex,\n                        tag=original.tag.split('-')[-1][:2],\n                    )\n                    for original\n                    in get_tagged_tokens(sentence.tokenization, 'POS')\n                ]\n            ))\n\nHere we used ``AnalyticUUIDGeneratorFactory``, which creates generators of\nConcrete *UUID* objects (see `Working with UUIDs`_ for more information).\nWe also used ``now_timestamp``, which returns a Concrete timestamp representing\nthe current time.  But now how do we know which tagging is ours?  Each\nannotation's metadata contains a *tool* name, and we can use it to\ndistinguish between competing annotations::\n\n    from concrete.util import get_tagged_tokens\n\n    for section in lun(comm.sectionList):\n        print('* section')\n        for sentence in lun(section.sentenceList):\n            print('  + sentence')\n            token_tag_pairs = zip(\n                get_tagged_tokens(sentence.tokenization, 'POS', tool='Serif: part-of-speech'),\n                get_tagged_tokens(sentence.tokenization, 'POS', tool='Simple POS')\n            )\n            for (old_tag, new_tag) in token_tag_pairs:\n                print('    - ' + old_tag.tag + ' -\u003e ' + new_tag.tag)\n\nThe output shows our new part-of-speech tagging has a smaller, simpler\nset of possible values::\n\n    * section\n    * section\n      + sentence\n        - NNP -\u003e NN\n        - NNP -\u003e NN\n        - , -\u003e ,\n        - NN -\u003e NN\n        - IN -\u003e IN\n        - NNP -\u003e NN\n        - NNP -\u003e NN\n        - , -\u003e ,\n        - VBD -\u003e VB\n        - NN -\u003e NN\n        - IN -\u003e IN\n        - DT -\u003e DT\n        - NN -\u003e NN\n        - IN -\u003e IN\n        - DATE-NNP -\u003e NN\n        - JJ -\u003e JJ\n        - , -\u003e ,\n        - CD -\u003e CD\n        - . -\u003e .\n    * section\n      + sentence\n        - PRP -\u003e PR\n        - VBD -\u003e VB\n        - . -\u003e .\n    * section\n      + sentence\n        - NNP -\u003e NN\n        - POS -\u003e PO\n        - NN -\u003e NN\n        - NNP -\u003e NN\n        - VBD -\u003e VB\n        - NN -\u003e NN\n        - . -\u003e .\n\nFinally, let's write our newly annotated Communication back to disk::\n\n    from concrete.util import CommunicationWriter\n\n    with CommunicationWriter('serif_dog-bites-man.concrete') as writer:\n        writer.write(comm)\n\nNote there are many other useful classes and functions in the\n``concrete.util`` library.  See the API reference in the\n`online documentation`_ for details.\n\n\nconcrete-inspect.py\n-------------------\n\nUse ``concrete-inspect.py`` to quickly explore the contents of a\nCommunication from the command line.  ``concrete-inspect.py`` and other\nscripts are installed to the path along with the concrete-python\nlibrary.\n\n--id\n~~~~\n\nRun the following command to print the unique ID of our modified\nexample Communication::\n\n    concrete-inspect.py --id serif_dog-bites-man.concrete\n\nOutput::\n\n    tests/testdata/serif_dog-bites-man.xml\n\n--metadata\n~~~~~~~~~~\n\nUse ``--metadata`` to print the stored annotations along with their\ntool names::\n\n    concrete-inspect.py --metadata serif_dog-bites-man.concrete\n\nOutput::\n\n    Communication:  concrete_serif v3.10.1pre\n\n      Tokenization:  Serif: tokens\n\n        Dependency Parse:  Stanford\n\n        Parse:  Serif: parse\n\n        TokenTagging:  Serif: names\n        TokenTagging:  Serif: part-of-speech\n        TokenTagging:  Simple POS\n\n      EntityMentionSet #0:  Serif: names\n      EntityMentionSet #1:  Serif: values\n      EntityMentionSet #2:  Serif: mentions\n\n      EntitySet #0:  Serif: doc-entities\n      EntitySet #1:  Serif: doc-values\n\n      SituationMentionSet #0:  Serif: relations\n      SituationMentionSet #1:  Serif: events\n\n      SituationSet #0:  Serif: relations\n      SituationSet #1:  Serif: events\n\n      CommunicationTagging:  lda\n      CommunicationTagging:  urgency\n\n--sections\n~~~~~~~~~~\n\nUse ``--sections`` to print the text of the Communication, broken out\nby section::\n\n    concrete-inspect.py --sections serif_dog-bites-man.concrete\n\nOutput::\n\n    Section 0 (0ab68635-c83d-4b02-b8c3-288626968e05)[kind: SectionKind.PASSAGE], from 81 to 82:\n\n\n\n    Section 1 (54902d75-1841-4d8d-b4c5-390d4ef1a47a)[kind: SectionKind.PASSAGE], from 85 to 162:\n\n    John Smith, manager of ACMÉ INC, was bit by a dog on March 10th, 2013.\n    \u003c/P\u003e\n\n\n    Section 2 (7ec8b7d9-6be0-4c62-af57-3c6c48bad711)[kind: SectionKind.PASSAGE], from 165 to 180:\n\n    He died!\n    \u003c/P\u003e\n\n\n    Section 3 (68da91a1-5beb-4129-943d-170c40c7d0f7)[kind: SectionKind.PASSAGE], from 183 to 228:\n\n    John's daughter Mary expressed sorrow.\n    \u003c/P\u003e\n\n--entities\n~~~~~~~~~~\n\nUse ``--entities`` to print the named entities detected in the\nCommunication::\n\n    concrete-inspect.py --entities serif_dog-bites-man.concrete\n\nOutput::\n\n    Entity Set 0 (Serif: doc-entities):\n      Entity 0-0:\n          EntityMention 0-0-0:\n              tokens:     John Smith\n              text:       John Smith\n              entityType: PER\n              phraseType: PhraseType.NAME\n          EntityMention 0-0-1:\n              tokens:     John Smith , manager of ACMÉ INC ,\n              text:       John Smith, manager of ACMÉ INC,\n              entityType: PER\n              phraseType: PhraseType.APPOSITIVE\n              child EntityMention #0:\n                  tokens:     John Smith\n                  text:       John Smith\n                  entityType: PER\n                  phraseType: PhraseType.NAME\n              child EntityMention #1:\n                  tokens:     manager of ACMÉ INC\n                  text:       manager of ACMÉ INC\n                  entityType: PER\n                  phraseType: PhraseType.COMMON_NOUN\n          EntityMention 0-0-2:\n              tokens:     manager of ACMÉ INC\n              text:       manager of ACMÉ INC\n              entityType: PER\n              phraseType: PhraseType.COMMON_NOUN\n          EntityMention 0-0-3:\n              tokens:     He\n              text:       He\n              entityType: PER\n              phraseType: PhraseType.PRONOUN\n          EntityMention 0-0-4:\n              tokens:     John\n              text:       John\n              entityType: PER.Individual\n              phraseType: PhraseType.NAME\n\n      Entity 0-1:\n          EntityMention 0-1-0:\n              tokens:     ACMÉ INC\n              text:       ACMÉ INC\n              entityType: ORG\n              phraseType: PhraseType.NAME\n\n      Entity 0-2:\n          EntityMention 0-2-0:\n              tokens:     John 's daughter Mary\n              text:       John's daughter Mary\n              entityType: PER.Individual\n              phraseType: PhraseType.NAME\n              child EntityMention #0:\n                  tokens:     Mary\n                  text:       Mary\n                  entityType: PER\n                  phraseType: PhraseType.OTHER\n          EntityMention 0-2-1:\n              tokens:     daughter\n              text:       daughter\n              entityType: PER\n              phraseType: PhraseType.COMMON_NOUN\n\n\n    Entity Set 1 (Serif: doc-values):\n      Entity 1-0:\n          EntityMention 1-0-0:\n              tokens:     March 10th , 2013\n              text:       March 10th, 2013\n              entityType: TIMEX2.TIME\n              phraseType: PhraseType.OTHER\n\n--mentions\n~~~~~~~~~~\n\nUse ``--mentions`` to show the named entity *mentions* in the\nCommunication, annotated on the text::\n\n    concrete-inspect.py --mentions serif_dog-bites-man.concrete\n\nOutput::\n\n    \u003cENTITY ID=0\u003e\u003cENTITY ID=0\u003eJohn Smith\u003c/ENTITY\u003e , \u003cENTITY ID=0\u003emanager of \u003cENTITY ID=1\u003eACMÉ INC\u003c/ENTITY\u003e\u003c/ENTITY\u003e ,\u003c/ENTITY\u003e was bit by a dog on \u003cENTITY ID=3\u003eMarch 10th , 2013\u003c/ENTITY\u003e .\n\n    \u003cENTITY ID=0\u003eHe\u003c/ENTITY\u003e died !\n\n    \u003cENTITY ID=2\u003e\u003cENTITY ID=0\u003eJohn\u003c/ENTITY\u003e 's \u003cENTITY ID=2\u003edaughter\u003c/ENTITY\u003e Mary\u003c/ENTITY\u003e expressed sorrow .\n\n--situations\n~~~~~~~~~~~~\n\nUse ``--situations`` to show the situations detected in the\nCommunication::\n\n    concrete-inspect.py --situations serif_dog-bites-man.concrete\n\nOutput::\n\n    Situation Set 0 (Serif: relations):\n\n    Situation Set 1 (Serif: events):\n      Situation 1-0:\n          situationType:    Life.Die\n\n--treebank\n~~~~~~~~~~\n\nUse ``--treebank`` to show constituency parse trees of the sentences in\nthe Communication::\n\n    concrete-inspect.py --treebank serif_dog-bites-man.concrete\n\nOutput::\n\n    (S (NP (NPP (NNP john)\n                (NNP smith))\n           (, ,)\n           (NP (NPA (NN manager))\n               (PP (IN of)\n                   (NPP (NNP acme)\n                        (NNP inc))))\n           (, ,))\n       (VP (VBD was)\n           (NP (NPA (NN bit))\n               (PP (IN by)\n                   (NP (NPA (DT a)\n                            (NN dog))\n                       (PP (IN on)\n                           (NP (DATE (DATE-NNP march)\n                                     (JJ 10th))\n                               (, ,)\n                               (NPA (CD 2013))))))))\n       (. .))\n\n\n    (S (NPA (PRP he))\n       (VP (VBD died))\n       (. !))\n\n\n    (S (NPA (NPPOS (NPP (NNP john))\n                   (POS 's))\n            (NN daughter)\n            (NPP (NNP mary)))\n       (VP (VBD expressed)\n           (NPA (NN sorrow)))\n       (. .))\n\nOther options\n~~~~~~~~~~~~~\n\nUse ``--ner``, ``--pos``, ``--lemmas``, and ``--dependency`` (together\nor independently) to show respective token-level information in a\nCoNLL-like format, and use ``--text`` to print the text of the\nCommunication, as described in a previous section.\n\nRun ``concrete-inspect.py --help`` to show a detailed help message\nexplaining the options discussed above and others.  All\nconcrete-python scripts have such help messages.\n\n\ncreate-comm.py\n--------------\n\nUse ``create-comm.py`` to generate a simple Communication from a text\nfile.  For example, create a file called ``history-of-the-world.txt``\ncontaining the following text::\n\n    The dog ran .\n    The cat jumped .\n\n    The dolphin teleported .\n\nThen run the following command to convert it to a Concrete\nCommunication, creating Sections, Sentences, and Tokens based on\nwhitespace::\n\n    create-comm.py --annotation-level token history-of-the-world.txt history-of-the-world.concrete\n\nUse ``concrete-inspect.py`` as shown previously to verify the\nstructure of the Communication::\n\n    concrete-inspect.py --sections history-of-the-world.concrete\n\nOutput::\n\n    Section 0 (a188dcdd-1ade-be5d-41c4-fd4d81f71685)[kind: passage], from 0 to 30:\n    The dog ran .\n    The cat jumped .\n\n    Section 1 (a188dcdd-1ade-be5d-41c4-fd4d81f7168a)[kind: passage], from 32 to 57:\n    The dolphin teleported .\n\nOther scripts\n-------------\n\nconcrete-python provides a number of other scripts, including but not\nlimited to:\n\n``concrete2json.py``\n    reads in a Concrete Communication and prints a\n    JSON version of the Communication to stdout.  The JSON is \"pretty\n    printed\" with indentation and whitespace, which makes the JSON\n    easier to read and to use for diffs.\n\n``create-comm-tarball.py``\n    like ``create-comm.py`` but for multiple files: reads in a tar.gz\n    archive of text files, parses them into sections and sentences based\n    on whitespace, and writes them back out as Concrete Communications\n    in another tar.gz archive.\n\n``fetch-client.py``\n    connects to a FetchCommunicationService, retrieves one or more\n    Communications (as specified on the command line), and writes them\n    to disk.\n\n``fetch-server.py``\n    implements FetchCommunicationService, serving Communications to\n    clients from a file or directory of Communications on disk.\n\n``search-client.py``\n    connects to a SearchService, reading queries from the console and\n    printing out results as Communication ids in a loop.\n\n``validate-communication.py``\n    reads in a Concrete Communication file and prints out information\n    about any invalid fields.  This script is a command-line wrapper\n    around the functionality in the ``concrete.validate`` library.\n\nUse the ``--help`` flag for details about the scripts' command line\narguments.\n\n\nWorking with UUIDs\n------------------\n\nEach *UUID* object contains a single string,\n``uuidString``, which can be used as a universally unique identifier for the\nobject the *UUID* is attached to.  The ``AnalyticUUIDGeneratorFactory`` produces\n*UUID* generators for a *Communication,* one for each analytic (tool) used to\nprocess the *Communication.*  In contrast to the Python ``uuid`` library, the\n``AnalyticUUIDGeneratorFactory`` yields UUIDs that have common prefixes within a\n*Communication* and within annotations produced by the same analytic, enabling\ncommon compression algorithms to much more efficiently store the UUIDs in each\n*Communication.*  See the ``AnalyticUUIDGeneratorFactory`` class in the API\nreference in the `online documentation`_ for more information.\n\nNote that ``uuidString`` is generated by\na random process, so running the same code twice will result in two\ncompletely different sets of identifiers.  Concretely, if you run a parser to\nproduce a part-of-speech *TokenTagging* for each *Tokenization* in a\n*Communication,* save the modified *Communication,* then run the parser again on\nthe same original *Communication,* you will get two different identifiers for\neach *TokenTagging,* even though the contents of each pair of\n*TokenTaggings*---the part-of-speech tags---may be the identical.\n\n\nValidating Concrete Communications\n----------------------------------\n\nThe Python version of the Thrift Libraries does not perform any\nvalidation of Thrift objects.  You should use the\n``validate_communication()`` function after reading and before writing\na Concrete Communication::\n\n    from concrete.util import read_communication_from_file\n    from concrete.validate import validate_communication\n\n    comm = read_communication_from_file('tests/testdata/serif_dog-bites-man.concrete')\n\n    # Returns True|False, logs details using Python stdlib 'logging' module\n    validate_communication(comm)\n\nThrift fields have three levels of requiredness:\n\n* explicitly labeled as **required**\n* explicitly labeled as **optional**\n* no requiredness label given (\"default required\")\n\nOther Concrete tools will raise an exception if a **required** field is\nmissing on deserialization or serialization, and will raise an\nexception if a \"default required\" field is missing on serialization.\nBy default, concrete-python does not perform any validation of Thrift\nobjects on serialization or deserialization.  The Python Thrift classes\ndo provide shallow ``validate()`` methods, but they only check for\nexplicitly **required** fields (not \"default required\" fields) and do\nnot validate nested objects.\n\nThe ``validate_communication()`` function recursively checks a\nCommunication object for required fields, plus additional checks for\nUUID mismatches.\n\n\n\n\n\n.. _Concrete: http://hltcoe.github.io/concrete/\n.. _`online documentation`: http://hltcoe.github.io/concrete-python/\n.. _`Apache Thrift`: http://thrift.apache.org\n.. _`Concrete GitHub repository`: https://github.com/hltcoe/concrete\n.. _serif_dog-bites-man.concrete: https://github.com/hltcoe/concrete-python/raw/main/tests/testdata/serif_dog-bites-man.concrete\n.. _CoNLL: http://ufal.mff.cuni.cz/conll2009-st/task-description.html\n.. _LICENSE: https://github.com/hltcoe/concrete-python/blob/main/LICENSE\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhltcoe%2Fconcrete-python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhltcoe%2Fconcrete-python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhltcoe%2Fconcrete-python/lists"}