{"id":14987905,"url":"https://github.com/apache/uima-uimaj","last_synced_at":"2025-04-08T08:15:39.025Z","repository":{"id":12191858,"uuid":"14795578","full_name":"apache/uima-uimaj","owner":"apache","description":"Apache UIMA Java SDK","archived":false,"fork":false,"pushed_at":"2025-02-17T16:31:31.000Z","size":68427,"stargazers_count":64,"open_issues_count":32,"forks_count":37,"subscribers_count":16,"default_branch":"main","last_synced_at":"2025-04-01T05:37:34.875Z","etag":null,"topics":["apache","java","text-analysis","uima"],"latest_commit_sha":null,"homepage":"https://uima.apache.org","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"vemmaverve/gratipay-or-bountysource.guide","license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/apache.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2013-11-29T08:00:16.000Z","updated_at":"2025-01-23T08:02:10.000Z","dependencies_parsed_at":"2023-02-17T16:45:49.171Z","dependency_job_id":"8b14639d-3ad1-4834-b017-9839bada8763","html_url":"https://github.com/apache/uima-uimaj","commit_stats":{"total_commits":7412,"total_committers":25,"mean_commits":296.48,"dds":0.3869400971397733,"last_synced_commit":"c6b747aa8c682c026f87443fb296655120e14c73"},"previous_names":[],"tags_count":75,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apache%2Fuima-uimaj","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apache%2Fuima-uimaj/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apache%2Fuima-uimaj/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apache%2Fuima-uimaj/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/apache","download_url":"https://codeload.github.com/apache/uima-uimaj/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247714981,"owners_count":20983968,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache","java","text-analysis","uima"],"created_at":"2024-09-24T14:15:40.894Z","updated_at":"2025-04-08T08:15:38.995Z","avatar_url":"https://github.com/apache.png","language":"Java","funding_links":[],"categories":["人工智能"],"sub_categories":[],"readme":"Welcome to the Apache UIMA Java SDK\n-----------------------------------\n\n[Apache UIMA][UIMA] helps you managing unstructured data (such as texts) that is enriched useful\ninformation. For example, if you want to identify a mention of an entity in a text or possible\nlink that entity to a reference dataset, then Apache UIMA provides:\n\n* a convenient data structure --the Common Analysis Structure (CAS)-- to represent that data\n* a type system concept service as a schema for the enriched data that is stored in the CAS\n* a component model consisting of reader, analysis engines (processors) and consumers (writers) to\n  process that data\n* a model for aggregating multiple analysis engines into pipelines and executing them (optionally \n  parallelized)\n* various options for (de)serializing the CAS from/to different formats\n* any many additional features!\n\nNote the Apache UIMA Java SDK only provides a framework for building analytics but it does not \nprovide any analytics. However, there are various [third-parties](#uima-component-providers) that\nbuild on Apache UIMA and that provide collections of analysis components or ready-made solutions.\n\n#### System requirements\n\nApache UIMA v3.6.0 and later requires Java version 17 or later.\n\nRunning the Eclipse plugin tooling for UIMA requires you start Eclipse 4.25 (2022-09) or later using a Java 17 or later.\n\nRunning the migration tool on class files requires running with a Java JDK, not a Java JRE.\n\nThe supported platforms are: Windows, Linux, and macOS. Other Java platform implementations should\nwork but have not been significantly tested.\n\nMany of the scripts in the `/bin` directory invoke Java. They use the value of the environment \nvariable, `JAVA_HOME`, to locate the Java to use; if it is not set, they invoke `java` expecting to find\nan appropriate Java in your `PATH` variable. \n\n\n#### Using Apache UIMA Java SDK\n\nYou can add the Apache UIMA Java SDK to your project easily in most build tools by importing it from \n[Maven Central][MAVEN-CENTRAL]. For example if you use Maven, you can add the following dependency\nto your project:\n\n```xml\n\u003cdependency\u003e\n  \u003cgroupId\u003eorg.apache.uima\u003c/groupId\u003e\n  \u003cartifactId\u003euimaj-core\u003c/artifactId\u003e\n  \u003cversion\u003e3.6.0\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\nNext, we give a few brief examples of how to use the Apache UIMA Java SDK and the Apache uimaFIT library.\nApache uimaFIT is a separate dependency that you can add:\n\n```xml\n\u003cdependency\u003e\n  \u003cgroupId\u003eorg.apache.uima\u003c/groupId\u003e\n  \u003cartifactId\u003euimafit-core\u003c/artifactId\u003e\n  \u003cversion\u003e3.6.0\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n##### Creating a type system\n\nThe type system defines the type of information that we want to attach to the unstructured information (here a text document). In our example, we want to identify mentions of entities, so we define a type my.Entity with a feature category which can be used to store the category the entity belongs to.\n\nTo illustrate the information UIMA internally maintains about the annotation schema, we write the generated schema as XML to screen.\n\n```java\nString TYPE_NAME_ENTITY = \"my.Entity\";\nString TYPE_NAME_TOKEN = \"my.Token\";\nString FEAT_NAME_CATEGORY = \"category\";\n\nvar tsd = UIMAFramework.getResourceSpecifierFactory().createTypeSystemDescription();\ntsd.addType(TYPE_NAME_TOKEN, \"\", CAS.TYPE_NAME_ANNOTATION);\nvar entityTypeDesc = tsd.addType(TYPE_NAME_ENTITY, \"\", CAS.TYPE_NAME_ANNOTATION);\nentityTypeDesc.addFeature(FEAT_NAME_CATEGORY, \"\", CAS.TYPE_NAME_STRING);\n\ntsd.toXML(System.out);\n```\n##### Creating a Common Analaysis Structure object\n\nNow we create a Common Analysis Structure (CAS) object into which we store the text that we want to analyse.\n\nAgain, to illustrate the information that UIMA internally stores in the CAS object, we write an XML representation of the object to screen.\n\n```java\nvar cas = CasFactory.createCas(tsd);\ncas.setDocumentText(\"Welcome to Apache UIMA.\");\ncas.setDocumentLanguage(\"en\");\n\nCasIOUtils.save(cas, System.out, SerialFormat.XMI_PRETTY);\n```\n\n##### Adding and retrieving annotations\n\nNow, we create an annotation of the type `my.Entity` to identify the mention of `Apache UIMA` in the example text.\n\nFinally, we iterate over all annotations in the CAS and print them to screen. This includes the default `DocumentAnnotation` that is always created by UIMA as\nwell as the `my.Entity` annotation that we created ourselves.\n\n```java\nvar entityType = cas.getTypeSystem().getType(TYPE_NAME_ENTITY);\nvar entity = cas.createAnnotation(entityType, 11, 22);\ncas.addFsToIndexes(entity);\n\nfor (var anno : cas.\u003cAnnotation\u003eselect(entityType)) {\n   System.out.printf(\"%s: [%s]%n\", anno.getType().getName(), anno.getCoveredText());\n}\n```\n\n##### Working with analysis components\n\nIn order to organize different types of analysis into steps, we usually package them into individual analysis engines. We illustrate now how such components can be built and how they can be put executed as an analysis pipeline.\n\n```java\nclass TokenAnnotator extends CasAnnotator_ImplBase {\n  public void process(CAS cas) throws AnalysisEngineProcessException {\n    var tokenType = cas.getTypeSystem().getType(TYPE_NAME_TOKEN);\n    var bi = BreakIterator.getWordInstance();\n    bi.setText(cas.getDocumentText());\n    int begin = bi.first();\n    int end;\n    for (end = bi.next(); end != BreakIterator.DONE; end = bi.next()) {\n      var token = cas.createAnnotation(tokenType, begin, end);\n      cas.addFsToIndexes(token);\n      begin = end;\n    }\n  }\n}\n\nclass EntityAnnotator extends CasAnnotator_ImplBase {\n  public void process(CAS cas) throws AnalysisEngineProcessException {\n    var tokenType = cas.getTypeSystem().getType(TYPE_NAME_TOKEN);\n    var entityType = cas.getTypeSystem().getType(TYPE_NAME_ENTITY);\n    for (var token : cas.\u003cAnnotation\u003eselect(tokenType)) {\n      if (Character.isUpperCase(token.getCoveredText().charAt(0))) {\n        var entity = cas.createAnnotation(entityType, token.getBegin(), token.getEnd());\n        cas.addFsToIndexes(entity);\n      }\n    }\n  }\n}\n\ncas = CasFactory.createCas(tsd);\ncas.setDocumentText(\"John likes Apache UIMA.\");\ncas.setDocumentLanguage(\"en\");\n\nvar pipeline = AnalysisEngineFactory.createEngineDescription(\n  AnalysisEngineFactory.createEngineDescription(TokenAnnotator.class),\n  AnalysisEngineFactory.createEngineDescription(EntityAnnotator.class));\n\nSimplePipeline.runPipeline(cas, pipeline);\n\nfor (var anno : cas.\u003cAnnotation\u003eselect(entityType)) {\n   System.out.printf(\"%s: [%s]%n\", anno.getType().getName(), anno.getCoveredText());\n}\n```\n\n\n#### Using uimaFIT\n\nConfiguring UIMA components is generally achieved by creating XML descriptor\nfiles which tell the framework at runtime how components should be\ninstantiated and deployed. These XML descriptor files are very tightly\ncoupled with the Java implementation of the components they describe.\nWe have found that it is very difficult to keep the two consistent\nwith each other especially when code refactoring is very frequent.\nuimaFIT provides Java annotations for describing UIMA components which\ncan be used to directly describe the UIMA components in the code. This\ngreatly simplifies refactoring a component definition (e.g. changing a\nconfiguration parameter name). It also makes it possible to generate\nXML descriptor files as part of the build cycle rather than being\nperformed manually in parallel with code creation. uimaFIT also makes\nit easy to instantiate UIMA components without using XML descriptor\nfiles at all by providing a number of convenience factory methods\nwhich allow programmatic/dynamic instantiation of UIMA components.\nThis makes uimaFIT an ideal library for testing UIMA components\nbecause the component can be easily instantiated and invoked without\nrequiring a descriptor file to be created first. uimaFIT is also\nhelpful in research environments in which programmatic/dynamic\ninstantiation of a pipeline can simplify experimentation. For example,\nwhen performing 10-fold cross-validation across a number of\nexperimental conditions it can be quite laborious to create a\ndifferent set of descriptor files for each run or even a script that\ngenerates such descriptor files. uimaFIT is type system agnostic and\ndoes not depend on (or provide) a specific type system.\n\nuimaFIT is a library that provides factories, injection, and testing \nutilities for UIMA. The following list highlights some of the features \nuimaFIT provides:\n\n* **Factories:** simplify instantiating UIMA components programmatically \n  without descriptor files. For example, to instantiate an AnalysisEngine a\n  call like this could be made:\n\n      AnalysisEngineFactory.createEngine(MyAEImpl.class, myTypeSystem,\n        paramName1, paramValue2, \n        paramName2, paramValue2, \n        ...)\n\n* **Injection:** handles the binding of configuration parameter values to the \n  corresponding member variables in the analysis engines and handles the binding of \n  external resources. For example, to bind a configuration parameter just annotate \n  a member variable with `@ConfigurationParameter`. External resources can likewise \n  by injected via the `@ExternalResource` annotation.\n  Then add one line of code to your initialize method:\n\n      ConfigurationParameterInitializer.initialize(this, uimaContext).\n\n   This is handled automatically if you extend the uimaFIT `JCasAnnotator_ImplBase` class. \n\n* **Testing:** uimaFIT simplifies testing in a number of ways described in the \n   documentation. By making it easy to instantiate your components without \n   descriptor files a large amount of difficult-to-maintain and unnecessary XML can \n   be eliminated from your test code. This makes tests easier to write and \n   maintain. Also, running components as a pipeline can be accomplished with a\n   method call like this:\n\n      SimplePipeline.runPipeline(reader, ae1, ..., aeN, consumer1, ... consumerN)\n\nuimaFIT is a part of the Apache UIMA(TM) project. uimaFIT can only be used in \nconjunction with a compatible version of the Java version of the Apache UIMA SDK. \nFor your convenience, the binary distribution package of uimaFIT includes all \nlibraries necessary to use uimaFIT. In particular for novice users, it is strongly \nadvised to obtain a copy of the full UIMA SDK separately.\n\nuimaFIT is available via Maven Central. If you use Maven for your build \nenvironment, then you can add uimaFIT as a dependency to your pom.xml file with the \nfollowing:\n\n    \u003cdependencies\u003e\n      \u003cdependency\u003e\n        \u003cgroupId\u003eorg.apache.uima\u003c/groupId\u003e\n        \u003cartifactId\u003euimafit-core\u003c/artifactId\u003e\n        \u003cversion\u003e3.6.0\u003c/version\u003e\n      \u003c/dependency\u003e\n    \u003c/dependencies\u003e\n    \n\n**Modules**\n- **uimafit-core** - the main uimaFIT module\n- **uimafit-cpe** - support for the Collection Processing Engine \n  (multi-threaded pipelines)\n- **uimafit-maven** - a Maven plugin to automatically enhance UIMA components with \n  uimaFIT metadata and to generate XML descriptors for uimaFIT-enabled components.\n- **uimafit-junit** - convenience code facilitating the implementation of UIMA/\n  uimaFIT tests in JUnit tests\n- **uimafit-assertj** - adds assertions for UIMA/uimaFIT types via the AssertJ \n  framework\n- **uimafit-spring** - an experimental module serving as a proof-of-concept for the \n  integration of UIMA with the Spring Framework. It is currently not considered \n  finished and uses invasive reflection in order to patch the UIMA framework such \n  that it passes all components created by UIMA through Spring to provide for the\n  wiring of Spring context dependencies. This module is made available for\n  the adventurous but currently not considered stable, finished, or even a\n  proper part of the package. E.g. it is not included in the binary\n  distribution package.\n\n\n#### Building\n\nTo build Apache UIMA, you need at least a Java 17 JDK and a recent Maven 3 version.\n\nAfter extracting the source distribution ZIP or cloning the repository, change into the created\ndirectory and run the following command:\n\n```\nmvn clean install\n```\n\nFor more details, please see http://uima.apache.org/building-uima.html\n\n\n#### Running examples from the source/binary distribution\n\nYou can download the source and binary distributions from the\n[Apache UIMA website](https://uima.apache.org/downloads.cgi).\n\n##### Environment Variables\n\nAfter you have unpacked the Apache UIMA distribution from the package of your choice (e.g. `.zip` or \n`.gz`), perform the steps below to set up UIMA so that it will function properly.\n\n* Set `JAVA_HOME` to the directory of your JRE installation you would like to use for UIMA.  \n* Set `UIMA_HOME` to the `apache-uima` directory of your unpacked Apache UIMA distribution\n* Append `UIMA_HOME/bin` to your `PATH`\n* Please run the script `UIMA_HOME/bin/adjustExamplePaths.bat` (or `.sh`), to update \n  paths in the examples based on the actual `UIMA_HOME` directory path. \n  This script runs a Java program; you must either have `java` in your `PATH` or set the environment \n  variable `JAVA_HOME` to a suitable JRE.\n\n    Note: The Mac OS X operating system procedures for setting up global environment\n    variables are described here: see http://developer.apple.com/qa/qa2001/qa1067.html.\n      \n##### Verifying Your Installation\n\nTo test the installation, run the `documentAnalyzer.bat` (or `.sh`) file located in the `bin` subdirectory. \nThis should pop up a *Document Analyzer* window. Set the values displayed in this GUI to as follows:\n\n* Input Directory: `UIMA_HOME/examples/data`\n* Output Directory: `UIMA_HOME/examples/data/processed`\n* Location of Analysis Engine XML Descriptor: `UIMA_HOME/examples/descriptors/analysis_engine/PersonTitleAnnotator.xml`\n\nReplace `UIMA_HOME` above with the path of your Apache UIMA installation.\n\nNext, click the *Run* button, which should, after a brief pause, pop up an *Analyzed Results* window. \nDouble-click on one of the documents to display the analysis results for that document.\n\n\n#### UIMA component providers\n\nHere is list of several well-known projects that provide their analysis tools as UIMA components\nor that wrap third-party analysis tools as UIMA components:\n\n* [Apache cTAKES](https://ctakes.apache.org) - Natural language processing system for extraction of information from electronic medical record clinical free-text.\n* [Apache OpenNLP](https://opennlp.apache.org/docs/) - Wraps OpenNLP for UIMA. Adaptable to different type systems.\n* [Apache Ruta](https://uima.apache.org/ruta.html) - Generic rule-based text analytics. Works with any type system.\n* [ClearTK](https://cleartk.github.io/cleartk/) - Wraps several third-party tools (OpenNLP, CoreNLP, etc.) and offers a flexible framework for training own machine learning models. Uses CleartK type system.\n* [DKPro Core](https://dkpro.github.io/dkpro-core/) - Wraps many third-party tools (OpenNLP, CoreNLP, etc.) and supporting a wide range of data formats. Uses DKPro Core type system.\n* [JULIE Lab Component Repository (JCoRe)](https://github.com/JULIELab/jcore-base) Wraps several third-party tools (OpenNLP, CoreNLP, etc.) and supporting a wide range of data formats, in particular from the biomed domain. Uses JCore type system.\n\nThis is not an exhaustive list. If you feel any particular project should be listed here, please let us know.\nYou could find additional ones e.g. by:\n\n* following the [GitHub dependency graph](https://github.com/apache/uima-uimaj/network/dependents?package_id=UGFja2FnZS0xNzk4MzkxNTI%3D)\n* searching [Google Scholar for UIMA](https://scholar.google.com/scholar?hl=en\u0026q=uima)\n\n#### Interoperability\n\nThe Apache UIMA Java SDK can be used with any programming language based on the Java Virtual Machine\nincluding Java, Groovy, Scala, and many other languages.\n\nInteroperability with Python can for example be achieved via the third-party \n[DKPro Cassis][DKPRO-CASSIS] library which can be used to read, manipulate and write CAS data in the\nXMI format.\n\n#### Further reading\n\nThe Apache UIMA Java SDK is a Java-based implementation of the [UIMA specification][OASIS-UIMA].\n\n#### Support\n\nPlease direct questions to user@uima.apache.org.\n\n#### Reference\n\nIf you use uimaFIT to support academic research, then please consider citing the \nfollowing paper as appropriate:\n\n    @InProceedings{ogren-bethard:2009:SETQA-NLP,\n      author    = {Ogren, Philip  and  Bethard, Steven},\n      title     = {Building Test Suites for {UIMA} Components},\n      booktitle = {Proceedings of the Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing (SETQA-NLP 2009)},\n      month     = {June},\n      year      = {2009},\n      address   = {Boulder, Colorado},\n      publisher = {Association for Computational Linguistics},\n      pages     = {1--4},\n      url       = {http://www.aclweb.org/anthology/W/W09/W09-1501}\n    }\n\n#### History\n\n* **Early 2000s:** UIMA was originally developed by IBM as part of research into analyzing unstructured information (like text, audio, and video). It was designed to process large volumes of unstructured data in a scalable way, targeting natural language processing (NLP) applications.\n\n* **2004:** UIMA was open-sourced allowing for broader use and contributions from outside IBM.\n\n* **2006:** The UIMA project was accepted into the Apache Incubator, starting the formal process of becoming an Apache project.\n\n* **2008:** UIMA graduated from the Apache Incubator and became a top-level Apache project, signifying its maturity and active development.\n\n* **2009:** Apache UIMA-AS (Asynchronous Scaleout) was introduced, enabling distributed and asynchronous processing of UIMA pipelines.\n\n* **2012:** uimaFIT was contributed to the Apache UIMA project. Apache uimaFIT was formerly known as uimaFIT, which in turn was formerly known as UUTUC. Prior to its contribution, is was collaborative\neffort between the Center for Computational Pharmacology at the University of Colorado Denver, the\nCenter for Computational Language and Education Research at the University of Colorado at Boulder,\nand the Ubiquitous Knowledge Processing (UKP) Lab at the Technische Universität Darmstadt.\n\n* **2013:** UIMA DUCC (Distributed UIMA Cluster Computing) was introduced as a sub-project of Apache UIMA.\n\n* **2016:** Apache UIMA Ruta (Rule-based Text Annotation) was introduced as an extension, providing a scripting language for rule-based text processing.\n\n* **2023:** UIMA DUCC and UIMA-AS were retired.\n\n* **2024:** uimaFIT has been merged into the UIMA Java SDK\n\n[UIMA]: https://uima.apache.org\n[OASIS-UIMA]: https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=uima\n[MAVEN-CENTRAL]: https://search.maven.org/search?q=org.apache.uima\n[DKPRO-CASSIS]: https://github.com/dkpro/dkpro-cassis\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapache%2Fuima-uimaj","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fapache%2Fuima-uimaj","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapache%2Fuima-uimaj/lists"}