{"id":44437157,"url":"https://github.com/levelfourab/lect","last_synced_at":"2026-02-12T14:01:15.438Z","repository":{"id":56606795,"uuid":"96197872","full_name":"LevelFourAB/lect","owner":"LevelFourAB","description":"Pipeline for natural language analysis","archived":false,"fork":false,"pushed_at":"2020-10-29T05:47:28.000Z","size":185,"stargazers_count":0,"open_issues_count":10,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-07-16T06:02:40.459Z","etag":null,"topics":["java","natural-language-analysis","natural-language-processing"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LevelFourAB.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-07-04T09:01:09.000Z","updated_at":"2018-07-23T18:45:49.000Z","dependencies_parsed_at":"2022-08-15T21:50:25.294Z","dependency_job_id":null,"html_url":"https://github.com/LevelFourAB/lect","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/LevelFourAB/lect","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LevelFourAB%2Flect","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LevelFourAB%2Flect/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LevelFourAB%2Flect/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LevelFourAB%2Flect/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LevelFourAB","download_url":"https://codeload.github.com/LevelFourAB/lect/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LevelFourAB%2Flect/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29367814,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-12T08:51:36.827Z","status":"ssl_error","status_checked_at":"2026-02-12T08:51:26.849Z","response_time":55,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["java","natural-language-analysis","natural-language-processing"],"created_at":"2026-02-12T14:01:13.994Z","updated_at":"2026-02-12T14:01:15.431Z","avatar_url":"https://github.com/LevelFourAB.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Lect\n\nLect is a pipeline for natural language analysis that can be created from and\nexecuted on different formats such as plain text, HTML and Markdown. Lect\nparses the original format into paragraphs, sentences and words while keeping\ntrack of the location in the source.\n\nLect can be used to build things such as spell and grammar checking,\nentity tagging, keyword extraction, summarization algorithms and many other\napplications that require robust text handling.\n\n```java\nSource source = PlainTextSource.forString(\"Simple plain text\");\n\nAtomicInteger wordCount = Pipeline.over(source)\n  .language(ICULanguage.forLocale(Locale.ENGLISH))\n  .collector(new AtomicInteger())\n  .with(encounter -\u003e new DefaultHandler() {\n    private int count = 0;\n    \n    public void word(Token token) {\n      count++;\n    }\n    \n    public void done() {\n      encounter.collector().set(count);\n    }\n  })\n  .run();\n\nSystem.out.println(wordCount + \" words\");\n```\n\n## Paragraphs, sentences and tokens\n\nThree things are currently tracked in a source starting with paragraphs. The\nparagraphs in Lect are used to group text content that is logically connected\ninstead of visually connected. For a format such as HTML or Markdown this\nmeans that explicit paragraphs, headings and list items are all turned into\nparagraphs. Handlers receive paragraph boundaries via the `startParagraph`\nand `endParagraph` methods.\n\nWhen a paragraph has been found the text in the paragraph is run through a\n`LanguageParser` to turn it into sentences and tokens. Sentence boundaries are\npassed to handlers via `startSentence` and `endSentence`.\n\nTokens are the individual parts that make up the actual content. Most of the\ntokens are emitted for sentences, but white-space tokens can be found between\nsentences and paragraphs.\n\nFour types of tokens exists and map white-space, words, symbols and special.\n\n* White-space is anything that matches space in the source, within our outside\nsentences.\n* Words are anything that could be a word in the language specified.\n* Symbols are individual symbols, such as punctuation.\n* Special tokens are things such as URLs, e-mails and phone numbers.\n\n## Languages\n\nLanguages are supported via the interface `LanguageParser` which is responsible\nfor turning text into sentences and tokens (words, symbols and whitespace).\nA parser implemented using ICU4J is available that uses `BreakIterator` to split\nthings into tokens. This parser is suitable for some uses, such as spell\nchecking but is not recommended for more advanced NLP tasks.\n\n```java\nLanguageFactory lang = ICULanguage.forLocale(Locale.ENGLISH);\n```\n\n`TokenizingLanguage` is available for use with two types of tokenizers, one that\nsplits a paragraph into sentences and one that splits a sentence into tokens:\n\n```java\nLanguageFactory lang = TokenizingLanguage.create(Locale.ENGLISH,\n  SentenceTestTokenizer::new,\n  WhitespaceTokenizer::new\n);\n```\n\n## Tokenizers\n\nTokenizers are objects responsible for tokenizing input, such as strings,\ninto tokens. In Lect they are a interesting mostly when implementing a\n`LanguageParser`. The `TokeningLanguage` class makes implementing the parsing a two\nstep process, first implement a tokenizer that splits text into sentences\nand secondly a tokenizer that splits sentences into tokens.\n\nA good starting point for custom tokenizers is `OffsetTokenizer` which helps with\ncreating tokenizers that use `OffsetLocation` for location tracking.\n\n## Token matching\n\nLect includes utilities for matching patterns of tokens. `TokenPattern` can be\nused to compile and match a sequence of tokens. Matching is usually done\nstreaming so it can be used with handlers:\n\n```java\nTokenPattern pattern = TokenPattern.compile(\"symbol='$' word\");\nTokenMatcher matcher = pattern.matcher();\n\nif(matcher.add(token)) {\n  // The token matched\n}\n```\n\nMany variants of patterns are supported:\n\n```java\n// Match any token\nTokenPattern.compile(\"any\");\n// Match a word\nTokenPattern.compile(\"word\");\n// Match against token.getText()\nTokenPattern.compile(\"word='Test'\");\n// Shortcut to match the text of any type of token\nTokenPattern.compile(\"'Test'\");\n// Match against TokenProperty.NORMALIZED\nTokenPattern.compile(\"word,normalized='test'\");\n// Match word followed by symbol\nTokenPattern.compile(\"word symbol\")\n// Match against regular expression\nTokenPattern.compile(\"word=/test/i\");\n// Shortcut to match via regex for any type of token\nTokenPattern.compile(\"/test/i\");\n// Use parenthesis to create an optional group of Mrs + period\nTokenPattern.compile(\"(word,normalized='mrs' symbol,text='.',continuation)? word\");\n// Use brackets to create an OR between tokens or groups\nTokenPattern.compile(\"[word,normalized='mrs' word,normalized='mr'] symbol,text='.',continuation?\");\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flevelfourab%2Flect","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flevelfourab%2Flect","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flevelfourab%2Flect/lists"}