{"id":19619795,"url":"https://github.com/ramtinms/tokenquery","last_synced_at":"2025-04-28T03:31:31.819Z","repository":{"id":57476127,"uuid":"72230999","full_name":"ramtinms/tokenquery","owner":"ramtinms","description":"TokenQuery (regular expressions over tokens)","archived":false,"fork":false,"pushed_at":"2017-03-01T18:03:31.000Z","size":201,"stargazers_count":28,"open_issues_count":7,"forks_count":13,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-11-06T18:00:21.828Z","etag":null,"topics":["machine-learning","natural-language-processing","nlp","regex","regular-expressions"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ramtinms.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-10-28T18:11:21.000Z","updated_at":"2022-08-29T05:17:36.000Z","dependencies_parsed_at":"2022-09-26T17:41:01.271Z","dependency_job_id":null,"html_url":"https://github.com/ramtinms/tokenquery","commit_stats":null,"previous_names":["ramtinms/tokenregex"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ramtinms%2Ftokenquery","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ramtinms%2Ftokenquery/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ramtinms%2Ftokenquery/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ramtinms%2Ftokenquery/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ramtinms","download_url":"https://codeload.github.com/ramtinms/tokenquery/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224091901,"owners_count":17254152,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","natural-language-processing","nlp","regex","regular-expressions"],"created_at":"2024-11-11T11:15:01.692Z","updated_at":"2024-11-11T11:15:03.025Z","avatar_url":"https://github.com/ramtinms.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"left\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/ramtinms/tokenquery/master/resources/Token_query_logo.png\" width=\"350\"/\u003e\n\u003c/p\u003e\n\n**TokenQuery** is a query language over any labeled text (sequence of tokens); very similar to regular expressions but on top of tokens. TokenQuery can be viewed as an interface to query for specific patterns in a sequence of tokens using information provided by diverse NLP engines.\n\n\n## What is a `Token`?\nIn order to process text (natural language text), the common approach for natural language processing (NLP) is to break the text down into smaller processing units (tokens). Options include phonemes, morphemes, lexical information, phrases, sentences, or paragraphs. For example, this sentence :`President Obama delivered his Farewell Address in Chicago on January 10, 2017.` can be divided into tokens shown in blue highlights. \n\n\u003cp align=\"left\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/ramtinms/tokenquery/master/resources/TokenQuery_example_1.png\" /\u003e\n\u003c/p\u003e\nInside TokenQuery each token contains a text (textual content of token), start and end index of the span inside the original text and a set of labels (i.e. key/value pairs) provided by NLP engines. In our example, the red labels (POS tags) are coming from Stanford POS tagger, the orange labels are from Google NLP API, and purple ones are coming from an internal topic extractor. One of the challeneges for natural language processing, is the fact that each unit is providing isolated information about each token in different formats and currently is really hard to have a query considering labels coming from different processing units. \n\nTokenQuery enables us to \n- Combine labels from different NLP engines \n- Query and reasoning over tokenized text\n- Defining extentions for desired query functions\n\nThe inital idea came from *Angel Chang* and *Christopher Manning* presented in [this paper](http://nlp.stanford.edu/pubs/tokensregex-tr-2014.pdf). They have implemeneted it (TOKENSREGEX) in Java inside *Stanford CoreNLP* software package. Our version uses a different language for the query which is extensible, more structured, and supporting more features. \n\n\n## TokenQuery language\nThe language is defined as follow. Each query consists of a group of tokens shown each inside `[` `]`s. If you want to use `]` inside your token matches you can simply use `\\` to skip.\n\n\n```\n[expr_for_token1][expr_for_token2][expr_for_token3]\n```\nwhich means we are searching for a sequence of three tokens that the first token satisfies the condition provided by `expr_for_token1`, the second token satisfies the condition provided by `expr_for_token2` and so on. \n\n## Quantifiers\nLikewise regular expressions, you can use quantifiers to have more compact queries. For example, the following query will match zero or more tokens satisfying the condition provided by `expr_for_token1` followed by another token satisfies condition provided by `expr_for_token2`.\n```\n[expr_for_token1]*[expr_for_token2]\n```\n| type | occurrence | example |\n| ----  | ---- | ---- | \n| `?` | once or not at all | `[expr_for_token]?` |\n| `*` | zero or more times | `[expr_for_token]*` |\n| `+` | one or more times | `[expr_for_token]+` |\n| `{x}` | x number of times | `[expr_for_token]{3}` |\n| `{x,y}` | between x and y number of times | `[expr_for_token]{3,5}` |\n\n## Capturing Groups\nLike reguar expressions, you can define capturing groups by parentheses. \nfor example `([expr_for_token1]+) [expr_for_token2] [expr_for_token3]` returns a group containing sequence of tokens with satisfies the condition provided by expr_for_token1. Hence, `([expr_for_token1]+) [expr_for_token2] ([expr_for_token3])` returns two groups (`chunk1` and `chunk2`) with a list of tokens matched inside each parentheses. You can also use named capturing by using `(name \u003cdesired_pattern\u003e)`. For example `(name [expr_for_token1])` captures results under the name of `name`.\nIf you don't provide any, it will capture all as a single group; in other words,  `[expr_for_token1]+ [expr_for_token2] [expr_for_token3]` is equal to `([expr_for_token1]+ [expr_for_token2] [expr_for_token3])`.\n\n## Token Expression\nExpressions (like `expr_for_token1` in the above examples) can be viewed as a list of acceptors for each token.\n\n### Basic expressions\n`[label:operation(operation_input)]` is the base unit for defining a token expression, which means running `operation` on the value of `label` for this token returns if we should accept this token or not. `operation` is a function that accepts a token and optional extra setting string (`operation_input`) and returns `True` or `False`. \nFor example, `[pos:str_eq(VBZ)]` matches any token that has a label `pos` and the string value for that is equal to `VBZ`. `str_eq` is an standard string operation check if the string is equal the extra setting string. \nor `[pos:str_reg(V.*)]` matches any token that has a `pos` label and the value for that label matches regex `V.*`. (i.e. any verbs)\nNote: If you want to check if a label exists or not and you don't care about the value of the label you can simply use this `[pos:str_reg(.*)]`.\nIf no label provided the default will consider the text of the token. For example, `[str_reg(.*)]` will match any token or `[str_reg('painter')]` matches any token that has 'painter' as text.\n\n### core operations (acceptors)\nHere is the list of predefined operations. You can extend this framework with your own defined operations.  \n\n#### String \n   This package provides string operations described below.\n   \n| operation | description | examples | \n| ----  | ---- | ---- |\n| `str_eq` | string equals to extra setting string | `[str_eq(Obama)]` , `[pos:str_eq(VBZ)]` |\n| `str_reg` | string matches regex provided by extra setting string | `[str_req(an?)]`, `[pos:str_eq(V.*)]`  |\n| `str_len` | lenght of the string compared to the value of extra setting string. (`==`, `\u003e`, `\u003c`, `!=` ,`\u003e=`, `\u003c=`)  | `[str_len(=12)]`, `[ner:str_len(\u003e6)]`, `[str_len(!=2)]` |\n\n**Shortened versions**   \n  For the convinence of use, exact string match is possible by having the text you want to match inside `\"`s.\nFor example `[\"painter\"]` will match any token that its text is `painter` but not `Painter`.\nIf you want to find tokens that matches a regex you can have your regex inside `/`s . For example `[/an?/]` matches tokens having text `a` or `an`. \n`[/Al.*/]` matches any token starting with `Al`.\n`[/km|kilometers?/]` matches `km`, `kilometer` and `kilometers`\n\n#### Int\n This package provides operations that will cast the value of labels into an integer and apply arithmetic operations on that. \n   \n| operation | description | examples | \n| ----  | ---- | ---- |\n| `int_value` | casts the value of label into an integer and compare it to the integer provided by extra setting string  (`==`, `\u003e`, `\u003c`, `!=` ,`\u003e=`, `\u003c=`)  | `[int_value(==5)]` , `[month:int_value(\u003e1)]`, `[year:int_value(\u003e1990)]` |\n| `int_e` | casts the value of label into an integer and check if it is equal to int provided by extra setting string.  | `[int_value(5)]` , `[month:int_value(1)]` |\n| `int_g` | `int_g(X)` is equivalent to use `int_value(\u003eX)`  | `[month:int_g(1)]` |\n| `int_l` | `int_l(X)` is equivalent to use `int_value(\u003cX)`  | `[month:int_l(6)]` |\n| `int_ne` | `int_ne(X)` is equivalent to use `int_value(!=X)`  |  `[int_ne(13)]` |    \n| `int_le` | `int_le(X)` is equivalent to use `int_value(\u003c=X)` | `[int_le(12)]` |    \n| `int_ge` | `int_ge(X)` is equivalent to use `int_value(\u003e=X)` | `[int_ge(0)]` |\n\n#### Web\nThis package provides operations for capturing meaningful web patterns.\n\n| operation | description | examples | \n| ----  | ---- | ---- |\n| `web_is_url` | the string is a web url  | `[text:web_is_url()]` , `[freebase_id:web_is_url()]`|\n| `web_is_email` | the string is an email  | `[text:web_is_email()]` , `[contact:web_is_email()]`|\n| `web_is_emoji` | the string is an emoji or emojicon  | `[text:web_is_emoji()]` |\n| `web_is_hex_code` | the string is a hex code | `[color:web_is_hex_code()]` |\n| `web_is_hashtag` | the string is a hashtag  | `[tag:web_is_hashtag()]` |\n  \n#### Date\nThis package provides operations for working with date and time info in iso format.\n\n| operation | description | examples | \n| ----  | ---- | ---- |\n| `date_is` | the date in iso format is same as extra setting string  | `[date:date_is(2008-09-15T15:53:00)]`|\n| `date_is_after` | the date in iso format is after the date in extra setting string | `[date:date_is_after(2008-09-15)]` |\n| `date_is_before` | the date in iso format is before the date in extra setting string | `[date:date_is_before(2008-09-15)]` |\n| `date_y_is` | the year of the date in iso format is equal to the month in extra setting | `[date:date_y_is(2008)]` |\n| `date_m_is` | the month of the date in iso format is equal to the month in extra setting | `[date:date_m_is(9)]`,  `[date:date_m_is(09)]`|\n| `date_d_is` | the day of the date in iso format is equal to the month in extra setting | `[date:date_y_is(15)]` |\n\n#### Vector \n\n| operation | description | examples | \n| ----  | ---- | ---- |\n| `vec_cos_sim` | cosine similarity between two vectors  | `[word2vec:vec_cos_sim([1, 0, -2, 1.5]\u003e0.5)]`|\n| `vec_cos_dist` | cosine distance between two vectors | `[word2vec:vec_cos_dist([1, 0, -2, 1.5]==0)]` |\n| `vec_man_dist` | manhattan distance between two vectors | `[word2vec:vec_man_dist([1, 0, -2, 1.5]\u003e=10)]` |\n\n\n## compound expressions\nFor each token is possible to compound several basic expressions to support more complex patterns. Compounding is done using `!` (not), `\u0026` (and) and `|` (or) symbols. For example, `[!pos:str_reg(V.*)]` means any token that it is not a verb. \n`[pos:str_reg(V.*)\u0026!str_eq(is)]` matches any verb except `is`. \n The `!` has the highest proiority and the `\u0026` and `|` has same priority and right associative. You can change the priority by using parentheses. \n```\n!X and Y        \u003c=\u003e   ( (!(X)) and Y )\n!(X and Y)      \u003c=\u003e   ( !(X and Y) )\n!(X and Y) or Z \u003c=\u003e   ( ( !(X and Y) ) or Z )\n(X and Y) or Z  \u003c=\u003e   ( ( X and Y) or Z )\nX and Y or Z    \u003c=\u003e   ( X and (Y or Z) )\n```\n\n# How to install \n```\npip install tokenquery\n```\nIt has been test to work on python 2.7+ \n\n## How to use\nYou can use your own tokenizer and create tokens or use our nltk wrapper to do the tokenization (see examples).\nWe highly recommend to use a tokenizer that provides start and end of each token in the original text and the normalized value. This is surprizing helpful for visualization and debugging. For instance NLTK PTB tokenizer does not provide these info; so we wrote an script to estimate these from the output for our goal.\nYes, this tool can be seen as an attempt to combine different types of information provided by NLP technologies considering using same tokenization. Currently we have integration with NLTK tokenizer and POS tagger and we are working to connect it to Spacy and google NLP API.\n\n## NLP Examples\nWe belive a big portion of NLP information can be expressed in terms of labels on top of tokens. Here is a list of the ones currently we use and how we represent it. \n- Part Of Speech tags (e.g. `[pos:/V.*/]`)\n\n- Lemma  (e.g. `[lemma:'be']`)\n\n- Named-Entity tags (e.g. `[ner:\"PERSON\"]`)\n\n- Brown clusters\n\n    | label | We | need | a | lawyer | . |\n    |----|----|----|----|----|----|\n    | POS | `PRP` | `VBP` | `DT` | `NN` | `.` |\n    | bcluster| | | |`1000001101000` | \n    \n    And we can query members inside a cluster by tokenquery like this:\n  `[bcluster:/100000110[0-1]+/])`\n   which will match all of these and more. (for more info see Miller et al., NAACL 2004)\n   \n    | word | code |\n    |--------|-----|\n    | lawyer | 1000001101000 |\n    | newspaperman | 100000110100100 |\n    | stewardess | 100000110100101 |\n    | toxicologist | 10000011010011 |\n    | slang | 1000001101010 |\n    | babysitter | 100000110101100 |\n    | conspirator | 1000001101011010 |\n    | womanizer | 1000001101011011 |\n    | mailman | 10000011010111 |\n    | salesman | 100000110110000 |\n    | bookkeeper | 1000001101100010 |\n    | troubleshooter | 10000011011000110 |\n    | bouncer | 10000011011000111 |\n    | technician | 1000001101100100 |\n    | janitor | 1000001101100101 |\n    | saleswoman | 1000001101100110 |\n\n\n- Word embeddings\n  For word embeddings you can use exact match. You can also define fancy metrics for comparision like cosine similarity as an operation. implemente more  like . \n  e.g. `[w2v:cos_sim(A0F892\u003c0.5)])`\n\n#### Chunks and Phrases\n  For chunks we recommend to use IOB formatting.\n  \n- Noun phrases \n  We use label `N-PH` for noun phrase, `B-NP` as a value for starting a noun phrase and `I-NP` for Continue of a noun phrase. Or you can use directly `B-NP` as lable and keep the value for the id of that phrase in your knowledge base if any.\n\n### Examples \n\n#### Detecting name of painters\n```\nfrom tokenquery.nlp.tokenizer import Tokenizer\nfrom tokenquery.nlp.pos_tagger import POSTagger\nfrom tokenquery.tokenquery import TokenQuery\n\n# Penn Tree Bank Tokenizer\ntokenizer = Tokenizer('PTBTokenizer')\n# NLTK POS tagger\npos_tagger = POSTagger()\n\n# Test sentence\ninput_text = 'David is a painter and I work as a writer.'\n# Tokenizing the sentence\ninput_tokens = tokenizer.tokenize(input_text)\n# adding pos tags\ninput_tokens = pos_tagger.tag(input_tokens)\n\n# token regex to extract name of the painters\ntoken_query_1 = TokenQuery('([pos:\"NNP\"]) [pos:\"VBZ\"] [/an?/] [\"painter\"]')\ntoken_query_1.match_tokens(input_tokens)\n\n# lets change the sentence\ninput_text = 'David is a famous painter and I work as a writer.'\ninput_tokens = tokenizer.tokenize(input_text)\ninput_tokens = pos_tagger.tag(input_tokens)\n\n# because of `famous` now your token regex 1 isn't working anymore\ntoken_query_1.match_tokens(input_tokens)\n\n# Adding possible adjectives\ntoken_query_2 = TokenQuery('([pos:\"NNP\"]) [pos:\"VBZ\"] [/an?/] [pos:\"JJ\"]* [\"painter\"]')\ntoken_query_2.match_tokens(input_tokens)\n\n# You can add labels directly\ninput_tokens[0].add_a_label('ner', 'PERSON')\n\n# A mixture of labels will give you the same result\ntoken_query_3 = TokenQuery('([ner:\"PERSON\"]) [pos:\"VBZ\"] [/an?/] [pos:\"JJ\"]* [\"painter\"]')\ntoken_query_3.match_tokens(input_tokens)\n\n# To cover names with more tokens\ntoken_query_4 = TokenQuery('([ner:\"PERSON\"]+) [pos:\"VBZ\"] [/an?/] [pos:\"JJ\"]* [\"painter\"]')\ntoken_query_4.match_tokens(input_tokens)\n    \n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Framtinms%2Ftokenquery","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Framtinms%2Ftokenquery","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Framtinms%2Ftokenquery/lists"}