{"id":23326422,"url":"https://github.com/dkpro/dkpro-cassis","last_synced_at":"2025-04-09T10:06:13.789Z","repository":{"id":41329855,"uuid":"150018465","full_name":"dkpro/dkpro-cassis","owner":"dkpro","description":"UIMA CAS processing library written in Python","archived":false,"fork":false,"pushed_at":"2024-04-23T06:08:59.000Z","size":580,"stargazers_count":84,"open_issues_count":25,"forks_count":23,"subscribers_count":6,"default_branch":"main","last_synced_at":"2024-04-23T10:45:18.672Z","etag":null,"topics":["annotation","cas","nlp","python","uima"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/dkpro-cassis/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dkpro.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":".github/CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-09-23T19:32:46.000Z","updated_at":"2024-05-01T12:24:06.582Z","dependencies_parsed_at":"2022-07-13T11:20:28.490Z","dependency_job_id":"4be8c101-6b58-441f-9076-3b1d61a8d6fd","html_url":"https://github.com/dkpro/dkpro-cassis","commit_stats":{"total_commits":327,"total_committers":7,"mean_commits":"46.714285714285715","dds":0.3883792048929664,"last_synced_commit":"97f4fe0737032d4d9a4e00a59cb84cadc34a79cc"},"previous_names":[],"tags_count":30,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dkpro%2Fdkpro-cassis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dkpro%2Fdkpro-cassis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dkpro%2Fdkpro-cassis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dkpro%2Fdkpro-cassis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dkpro","download_url":"https://codeload.github.com/dkpro/dkpro-cassis/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248018060,"owners_count":21034048,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["annotation","cas","nlp","python","uima"],"created_at":"2024-12-20T19:17:55.518Z","updated_at":"2025-04-09T10:06:13.748Z","avatar_url":"https://github.com/dkpro.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"dkpro-cassis\n============\n\n.. image:: https://github.com/dkpro/dkpro-cassis/actions/workflows/run_tests.yml/badge.svg\n  :target: https://github.com/dkpro/dkpro-cassis/actions/workflows/run_tests.yml\n\n.. image:: https://readthedocs.org/projects/cassis/badge/?version=latest\n  :target: https://cassis.readthedocs.io/en/latest/?badge=latest\n  :alt: Documentation Status\n\n.. image:: https://codecov.io/gh/dkpro/dkpro-cassis/branch/master/graph/badge.svg\n  :target: https://codecov.io/gh/dkpro/dkpro-cassis\n\n.. image:: https://img.shields.io/pypi/l/dkpro-cassis.svg\n  :alt: PyPI - License\n  :target: https://pypi.org/project/dkpro-cassis/\n\n.. image:: https://img.shields.io/pypi/pyversions/dkpro-cassis.svg\n  :alt: PyPI - Python Version\n  :target: https://pypi.org/project/dkpro-cassis/\n\n.. image:: https://img.shields.io/pypi/v/dkpro-cassis.svg\n  :alt: PyPI\n  :target: https://pypi.org/project/dkpro-cassis/\n\n.. image:: https://img.shields.io/badge/code%20style-black-000000.svg\n  :target: https://github.com/ambv/black\n  \nDKPro **cassis** (pronunciation: [ka.sis]) provides a pure-Python implementation of the *Common Analysis System* (CAS)\nas defined by the `UIMA \u003chttps://uima.apache.org\u003e`_ framework. The CAS is a data structure representing an object to\nbe enriched with annotations (the co-called *Subject of Analysis*, short *SofA*).\n\nThis library enables the creation and manipulation of annotated documents (CAS objects) and their associated type systems as well as loading\nand saving them in the `CAS XMI XML representation \u003chttps://uima.apache.org/d/uimaj-current/ref.html#ugr.ref.xmi\u003e`_\nor the `CAS JSON representation \u003chttps://github.com/apache/uima-uimaj-io-jsoncas#readme\u003e`_ in Python programs. This can ease in particular the integration of Python-based Natural Language Processing (e.g.\n`spacy \u003chttps://spacy.io\u003e`_ or `NLTK \u003chttps://www.nltk.org\u003e`_) and Machine Learning librarys (e.g.\n`scikit-learn \u003chttps://scikit-learn.org/stable/\u003e`_ or `Keras \u003chttps://keras.io\u003e`_) in UIMA-based text analysis workflows.\n\nAn example of cassis in action is the `spacy recommender for INCEpTION \u003chttps://github.com/inception-project/external-recommender-spacy\u003e`_,\nwhich wraps the spacy NLP library as a web service which can be used in conjunction with the `INCEpTION \u003chttps://inception-project.github.io\u003e`_\ntext annotation platform to automatically generate annotation suggestions.\n\nFeatures\n--------\n\nCurrently supported features are:\n\n- Text SofAs\n- Deserializing/serializing UIMA CAS from/to XMI\n- Deserializing/serializing UIMA CAS from/to JSON\n- Deserializing/serializing type systems from/to XML\n- Selecting annotations, selecting covered annotations, adding annotations\n- Type inheritance\n- Multiple SofA support\n- Type system can be changed after loading\n- Primitive and reference features and arrays of primitives and references\n\nSome features are still under development, e.g.\n\n- Proper type checking\n- XML/XMI schema validation\n\nInstallation\n------------\n\nTo install the package with :code:`pip`, just run\n\n    pip install dkpro-cassis\n\nUsage\n-----\n\nExample CAS XMI and types system files can be found under :code:`tests\\test_files`.\n\nReading a CAS file\n~~~~~~~~~~~~~~~~~~\n\n**From XMI:** A CAS can be deserialized from the UIMA CAS XMI (XML 1.0) format either\nby reading from a file or string using :code:`load_cas_from_xmi`.\n\n.. code:: python\n\n    from cassis import *\n\n    with open('typesystem.xml', 'rb') as f:\n        typesystem = load_typesystem(f)\n        \n    with open('cas.xmi', 'rb') as f:\n       cas = load_cas_from_xmi(f, typesystem=typesystem)\n\n**From JSON:** The UIMA JSON CAS format is also supported and can be loaded using :code:`load_cas_from_json`.\nMost UIMA JSON CAS files come with an embedded typesystem, so it is not necessary to specify one.\n\n.. code:: python\n\n    from cassis import *\n\n    with open('cas.json', 'rb') as f:\n       cas = load_cas_from_json(f)\n\nWriting a CAS file\n~~~~~~~~~~~~~~~~~~\n\n**To XMI:** A CAS can be serialized to XMI either by writing to a file or be\nreturned as a string using :code:`cas.to_xmi()`.\n\n.. code:: python\n\n    from cassis import *\n\n    # Returned as a string\n    xmi = cas.to_xmi()\n\n    # Written to file\n    cas.to_xmi(\"my_cas.xmi\")\n\n**To JSON:** A CAS can also be written to JSON using :code:`cas.to_json()`.\n\n.. code:: python\n\n    from cassis import *\n\n    # Returned as a string\n    xmi = cas.to_json()\n\n    # Written to file\n    cas.to_json(\"my_cas.json\")\n\nCreating a CAS\n~~~~~~~~~~~~~~\n\nA CAS (Common Analysis System) object typically represents a (text) document. When using cassis,\nyou will likely most often reading existing CAS files, modify them and then\nwriting them out again. But you can also create CAS objects from scratch,\ne.g. if you want to convert some data into a CAS object in order to create a pre-annotated text.\nIf you do not have a pre-defined typesystem to work with, you will have to define one.\n\n.. code:: python\n\n    typesystem = TypeSystem()\n\n    cas = Cas(\n        sofa_string = \"Joe waited for the train . The train was late .\",\n        document_language = \"en\",\n        typesystem = typesystem)\n\n    print(cas.sofa_string)\n    print(cas.sofa_mime)\n    print(cas.document_language)\n\nAdding annotations\n~~~~~~~~~~~~~~~~~~\n\n**Note:** type names used below are examples only. The actual CAS files you will be\ndealing with will use other names! You can get a list of the types using\n:code:`cas.typesystem.get_types()`.\n\nGiven a type system with a type :code:`cassis.Token` that has an :code:`id` and\n:code:`pos` feature, annotations can be added in the following:\n\n.. code:: python\n\n    from cassis import *\n\n    with open('typesystem.xml', 'rb') as f:\n        typesystem = load_typesystem(f)\n        \n    with open('cas.xmi', 'rb') as f:\n        cas = load_cas_from_xmi(f, typesystem=typesystem)\n       \n    Token = typesystem.get_type('cassis.Token')\n\n    tokens = [\n        Token(begin=0, end=3, id='0', pos='NNP'),\n        Token(begin=4, end=10, id='1', pos='VBD'),\n        Token(begin=11, end=14, id='2', pos='IN'),\n        Token(begin=15, end=18, id='3', pos='DT'),\n        Token(begin=19, end=24, id='4', pos='NN'),\n        Token(begin=25, end=26, id='5', pos='.'),\n    ]\n\n    for token in tokens:\n        cas.add(token)\n\nSelecting annotations\n~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n    from cassis import *\n\n    with open('typesystem.xml', 'rb') as f:\n        typesystem = load_typesystem(f)\n        \n    with open('cas.xmi', 'rb') as f:\n        cas = load_cas_from_xmi(f, typesystem=typesystem)\n\n    for sentence in cas.select('cassis.Sentence'):\n        for token in cas.select_covered('cassis.Token', sentence):\n            print(token.get_covered_text())\n            \n            # Annotation values can be accessed as properties\n            print('Token: begin={0}, end={1}, id={2}, pos={3}'.format(token.begin, token.end, token.id, token.pos)) \n\nGetting and setting (nested) features\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nIf you want to access a variable but only have its name as a string or have nested feature structures,\ne.g. a feature structure  with feature :code:`a` that has a\nfeature :code:`b` that has a feature :code:`c`, some of which can be :code:`None`, then you can use the\nfollowing:\n\n.. code:: python\n\n    fs.get(\"var_name\") # Or\n    fs[\"var_name\"]\n\nOr in the nested case,\n\n.. code:: python\n\n    fs.get(\"a.b.c\")\n    fs[\"a.b.c\"]\n\n\nIf :code:`a` or  :code:`b` or  :code:`c` are :code:`None`, then this returns instead of\nthrowing an error.\n\nAnother example would be a StringList containing :code:`[\"Foo\", \"Bar\", \"Baz\"]`:\n\n.. code:: python\n\n    assert lst.get(\"head\") == \"foo\"\n    assert lst.get(\"tail.head\") == \"bar\"\n    assert lst.get(\"tail.tail.head\") == \"baz\"\n    assert lst.get(\"tail.tail.tail.head\") == None\n    assert lst.get(\"tail.tail.tail.tail.head\") == None\n\nThe same goes for setting:\n\n.. code:: python\n\n    # Functional\n    lst.set(\"head\", \"new_foo\")\n    lst.set(\"tail.head\", \"new_bar\")\n    lst.set(\"tail.tail.head\", \"new_baz\")\n\n    assert lst.get(\"head\") == \"new_foo\"\n    assert lst.get(\"tail.head\") == \"new_bar\"\n    assert lst.get(\"tail.tail.head\") == \"new_baz\"\n\n    # Bracket access\n    lst[\"head\"] = \"newer_foo\"\n    lst[\"tail.head\"] = \"newer_bar\"\n    lst[\"tail.tail.head\"] = \"newer_baz\"\n\n    assert lst[\"head\"] == \"newer_foo\"\n    assert lst[\"tail.head\"] == \"newer_bar\"\n    assert lst[\"tail.tail.head\"] == \"newer_baz\"\n\n\nCreating types and adding features\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n    from cassis import *\n\n    typesystem = TypeSystem()\n\n    parent_type = typesystem.create_type(name='example.ParentType')\n    typesystem.create_feature(domainType=parent_type, name='parentFeature', rangeType=TYPE_NAME_STRING)\n\n    child_type = typesystem.create_type(name='example.ChildType', supertypeName=parent_type.name)\n    typesystem.create_feature(domainType=child_type, name='childFeature', rangeType=TYPE_NAME_INTEGER)\n\n    annotation = child_type(parentFeature='parent', childFeature='child')\n\nWhen adding new features, these changes are propagated. For example,\nadding a feature to a parent type makes it available to a child type.\nTherefore, the type system does not need to be frozen for consistency.\nThe type system can be changed even after loading, it is not frozen\nlike in UIMAj.\n\nSofa support\n~~~~~~~~~~~~\n\nA Sofa represents some form of an unstructured artifact that is processed in a UIMA pipeline. It contains for instance\nthe document text. Currently, new Sofas can be created. This is automatically done when creating a new view. Basic\nproperties of the Sofa can be read and written:\n\n.. code:: python\n\n    cas = Cas(\n        sofa_string = \"Joe waited for the train . The train was late .\",\n        document_language = \"en\")\n\n    print(cas.sofa_string)\n    print(cas.sofa_mime)\n    print(cas.document_language)\n\nArray support\n~~~~~~~~~~~~~\n\nArray feature values are not simply Python arrays, but they are wrapped in a feature structure of\na UIMA array type such as :code:`uima.cas.FSArray`.\n\n.. code:: python\n\n    # Setting up an annotation type with an array feature containing\n    # references to other annotations\n    typesystem = TypeSystem()\n    ArrayHolder = typesystem.create_type(name='example.ArrayHolder')\n    typesystem.create_feature(domainType=ArrayHolder, name='values', rangeType=TYPE_NAME_FS_ARRAY)\n\n    cas = Cas(typesystem=typesystem)\n\n    # Populating the document an annotation that contains references to another annotation in its array feature\n    Annotation = cas.typesystem.get_type(TYPE_NAME_ANNOTATION)\n    FSArray = cas.typesystem.get_type(TYPE_NAME_FS_ARRAY)\n    ann = Annotation(begin=0, end=1)\n    cas.add(ann)\n    holder = ArrayHolder(values=FSArray(elements=[ann, ann, ann]))\n    cas.add(holder)\n\n    # Reading the elements from the array feature\n    for e in holder.values.elements:\n        print(e)\n\nManaging views\n~~~~~~~~~~~~~~\n\nA view into a CAS contains a subset of feature structures and annotations. One view corresponds to exactly one Sofa. It\ncan also be used to query and alter information about the Sofa, e.g. the document text. Annotations added to one view\nare not visible in another view.  A view Views can be created and changed. A view has the same methods and attributes\nas a :code:`Cas` .\n\n.. code:: python\n\n    from cassis import *\n\n    with open('typesystem.xml', 'rb') as f:\n        typesystem = load_typesystem(f)\n    Token = typesystem.get_type('cassis.Token')\n\n    # This creates automatically the view `_InitialView`\n    cas = Cas()\n    cas.sofa_string = \"I like cheese .\"\n\n    cas.add_all([\n        Token(begin=0, end=1),\n        Token(begin=2, end=6),\n        Token(begin=7, end=13),\n        Token(begin=14, end=15)\n    ])\n\n    print([x.get_covered_text() for x in cas.select_all()])\n\n    # Create a new view and work on it.\n    view = cas.create_view('testView')\n    view.sofa_string = \"I like blackcurrant .\"\n\n    view.add_all([\n        Token(begin=0, end=1),\n        Token(begin=2, end=6),\n        Token(begin=7, end=19),\n        Token(begin=20, end=21)\n    ])\n\n    print([x.get_covered_text() for x in view.select_all()])\n\nMerging type systems\n~~~~~~~~~~~~~~~~~~~~\n\nSometimes, it is desirable to merge two type systems. With **cassis**, this can be\nachieved via the :code:`merge_typesystems` function. The detailed rules of merging can be found\n`here \u003chttps://uima.apache.org/d/uimaj-current/ref.html#ugr.ref.cas.typemerging\u003e`_.\n\n.. code:: python\n\n    from cassis import *\n\n    with open('typesystem.xml', 'rb') as f:\n        typesystem = load_typesystem(f)\n\n    ts = merge_typesystems([typesystem, load_dkpro_core_typesystem()])\n\nType checking\n~~~~~~~~~~~~~\n\nWhen adding annotations, no type checking is performed for simplicity reasons.\nIn order to check types, call the :code:`cas.typecheck()` method. Currently, it only\nchecks whether elements in `uima.cas.FSArray` are\nadhere to the specified :code:`elementType`.\n\nDKPro Core Integration\n----------------------\n\nA CAS using the DKPro Core Type System can be created via\n\n.. code:: python\n\n    from cassis import *\n\n    cas = Cas(typesystem=load_dkpro_core_typesystem())\n\n    for t in cas.typesystem.get_types():\n        print(t)\n\nMiscellaneous\n-------------\n\nIf feature names clash with Python magic variables\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nIf your type system defines a type called :code:`self` or :code:`type`, then it will be made\navailable as a member variable :code:`self_` or :code:`type_` on the respective type:\n\n.. code:: python\n\n    from cassis import *\n    from cassis.typesystem import *\n\n    typesystem = TypeSystem()\n\n    ExampleType = typesystem.create_type(name='example.Type')\n    typesystem.create_feature(domainType=ExampleType, name='self', rangeType=TYPE_NAME_STRING)\n    typesystem.create_feature(domainType=ExampleType, name='type', rangeType=TYPE_NAME_STRING)\n\n    annotation = ExampleType(self_=\"Test string1\", type_=\"Test string2\")\n\n    print(annotation.self_)\n    print(annotation.type_)\n\nLeniency\n~~~~~~~~\n\nIf the type for a feature structure is not found in the typesystem, it will raise an exception by default.\nIf you want to ignore these kind of errors, you can pass :code:`lenient=True` to the :code:`Cas` constructor or\nto :code:`load_cas_from_xmi`.\n\nLarge XMI files\n~~~~~~~~~~~~~~~\n\nIf you try to parse large XMI files and get an error message like :code:`XMLSyntaxError: internal error: Huge input lookup`,\nthen you can disable this security check by passing :code:`trusted=True` to your calls to :code:`load_cas_from_xmi`.\n\nCiting \u0026 Authors\n----------------\n\nIf you find this repository helpful, feel free to cite\n\n.. code:: bibtex\n\n    @software{klie2020_cassis,\n      author       = {Jan-Christoph Klie and\n                      Richard Eckart de Castilho},\n      title        = {DKPro Cassis - Reading and Writing UIMA CAS Files in Python},\n      publisher    = {Zenodo},\n      doi          = {10.5281/zenodo.3994108},\n      url          = {https://github.com/dkpro/dkpro-cassis}\n    }\n\nDevelopment\n-----------\n\nThe required dependencies are managed by **pip**. A virtual environment\ncontaining all needed packages for development and production can be\ncreated and activated by\n\n::\n\n    virtualenv venv --python=python3 --no-site-packages\n    source venv/bin/activate\n    pip install -e \".[test, dev, doc]\"\n\nThe tests can be run in the current environment by invoking\n\n::\n\n    make test\n\nor in a clean environment via\n\n::\n\n    tox\n\nRelease\n-------\n\n- Make sure all issues for the milestone are completed, otherwise move them to the next\n- Checkout the ``main`` branch\n- Bump the version in ``pyproject.toml`` to a stable one, e.g. ``__version__ = \"0.6.0\"``, commit and push, wait until the build completed. An example commit message would be ``No issue. Release 0.6.0``\n- Create a tag for that version via e.g. ``git tag v0.6.0`` and push the tags via ``git push --tags``. Pushing a tag triggers the release to pypi\n- Bump the version in ``pyproject.toml`` to the next development version, e.g. ``0.7.0-dev``, commit and push that. An example commit message would be ``No issue. Bump version after release``\n- Once the build has completed and pypi accepted the new version, go to the Github release and write the changelog based on the issues in the respective milestone\n- Create a new milestone for the next version\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdkpro%2Fdkpro-cassis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdkpro%2Fdkpro-cassis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdkpro%2Fdkpro-cassis/lists"}