{"id":17063958,"url":"https://github.com/rk2900/ngramtools","last_synced_at":"2025-03-23T09:17:04.473Z","repository":{"id":70335696,"uuid":"39780149","full_name":"rk2900/ngramtools","owner":"rk2900","description":"Automatically exported from code.google.com/p/ngramtools","archived":false,"fork":false,"pushed_at":"2015-07-27T14:54:24.000Z","size":528,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-01-28T15:49:19.438Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rk2900.png","metadata":{"files":{"readme":"README.txt","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-07-27T14:51:51.000Z","updated_at":"2016-01-21T10:12:54.000Z","dependencies_parsed_at":"2023-02-27T23:01:14.684Z","dependency_job_id":null,"html_url":"https://github.com/rk2900/ngramtools","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rk2900%2Fngramtools","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rk2900%2Fngramtools/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rk2900%2Fngramtools/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rk2900%2Fngramtools/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rk2900","download_url":"https://codeload.github.com/rk2900/ngramtools/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245078201,"owners_count":20557284,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-14T10:53:27.066Z","updated_at":"2025-03-23T09:17:04.452Z","avatar_url":"https://github.com/rk2900.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"This directory contains programs for n-gram search and pattern\nmatching.\n\nThere are three executables:\n\n./ngram_server INDEX_FILE [PORT]\n  Start an server for looking up the ngrams corresponding to\n  INDEX_FILE. The default port number is 6700\n  For example:\n  ./ngram_server /export/ws09/dlin/GoogleV2/rotated.index 33333\n\n./search_prefix [INDEX_FILE|HOSTNAME:PORT] PREFIX [EXTRACTOR]\n   Search the PREFIX in the ngram data collection. If an EXTRACTOR is\n   give, it will be used to process the results. The ngrams can be\n   looked up eithers with the INDEX_FILE or from an ngram server at\n   the specified host:port.\n   For example:\n   ./search_prefix localhost:33333 \"time flies\"\n\n./batch_counting EXTRACTOR or ./batch_counting -file FILE\n   The batch_counting program can be run with Hadoop. The file vbn_vbd.sh\n   contains an exmaple.\n\nThe EXTRACTORs for the above executables can be the following:\n  (count PATTERN :format FORMAT [:max-match N] [:prefix-query PHRASE])\n    Whenever an n-ngram is found to match the pattern, we aggregate\n    the count of the string with the given format. For example\n    head -1000000 /export/ws09/dlin/GoogleV2/rotated-878.txt | ./batch_counting '\n        (count (seq (+ (tag ~ [NJ].*) :name noun)\n                    (or (word = who :name animate)\n                        (word = which :name inanimate)))\n           :format $[noun]:[animate|inanimate])'\n\n    The special characters in a FORMAT string include '[', ']', '|',\n    and '$'. The square brackets should contain one or more (separated\n    by  '|') names of subpatterns. If a '$' precedes the '[', the\n    matching subsequence in the ngram is used to construct the counted\n    string. Otherwise the subpattern name is used.\n\n  (print-ngram PATTERN [:max-match N] [:prefix-query PHRASE])\n     Print the ngrams that matche the pattern.\n\n  (count-key-val PATTERN key: NAME [:val-inst] [:log-count] [:max-match N] [:prefix-query PHRASE]):\n     Output the counts of key-value pairs defined in the pattern. The\n     key is whatever matched the component of the PATTERN with the\n     given NAME. The corresponding value is whatever matches another\n     named component (with a different name). Therefore there must be\n     other named components in the PATTERN in addition to the key. For\n     example, the following extractor counts the animate and inanimate\n     features for each noun/adjective sequence.\n\n     (count-key-val (seq (+ (tag ~ [NJ].*) :name noun)\n                         (or (word = who :name animate)\n                             (word = which :name inanimate)))\n                    :key noun)\n \t\t    \n     For another example, the following extractor counts the determiners\n     for noun phrases. \n     (count-key-val (seq (or (tag = IN :name none)\n                             (and (tag = DT)\n                                  (or (word in (the The this This) :name definite)\n                                      (word in (a A an An) :name indefinite))))\n                         (+ (tag ~ [NJ].*) :name noun)\n                         (tag ~ \"(:|,|IN|V.*)\" :max-count-only))\n                    :key noun)\n\n    By default, the names of component patterns are used as values\n    (e.g., definite and indefinite in the above pattern). If the\n    option :val-inst is given, the token sequences that matched the\n    component pattern are treated as the values begin counted.\n\n  (count-named PATTERN [:max-match N] [:prefix-query PHRASE])\n    Outputs the total counts of all ngrams that matched the named\n    components in the pattern.\n\n  (extractor-set EXTRACTOR EXTRACTOR ...EXTRACTOR)\n    \n\nN-gram Patterns\n\nAtomic Patterns:\n\n(word = WORD) or (word ~ REGEXP) or (word in LIST)\n  matches a single token that is equal to the word or matches the\n  regular expression. Here, LIST can either be a LISP-like list, e.g.,\n  (one two three), or the name of a text file where each line is an\n  element in the list.\n\n(tag = TAG [:max-count-only]) or (tag ~ REGEXP [:max-count-only]) or\n(tag in LIST [:max-count-only])\n  matches a single tag that is equal to the word or matches the\n  regular expression. When the flag :max-count-only is present, only\n  the most frequent tag sequence is considered during the match.\n\n\n(tag-seq in FILE)\n(tag-seq in (\"SPACE SEPARATED TAGS\" ... \"SPACE SEPARATED TAGS\")\n(tag-seq (REGEXP ... REGEXP))\n(tag-seq = \"SPACE SEPARATED TAGS\")\n(tag-seq ~ \"SPACE SEPARATED REGEXPS\")\n(tag-seq (REGEXP ... REGEXP))\n  matches a single POS tag sequence where each component statisfies\n  the corresponding regular expression or is equal to (one of) the\n  given tag sequence. If a file name is given, each line in the file\n  is assumed to be a space separated sequence.\n\n(word-seq in FILE)\n(word-seq in (\"SPACE SEPARATED WORDS\" ... \"SPACE SEPARATED WORDS\")\n(word-seq (REGEXP ... REGEXP))\n(word-seq = \"SPACE SEPARATED WORDS\")\n(word-seq ~ \"SPACE SEPARATED REGEXPS\")\n(word-seq (REGEXP ... REGEXP))\n  matches a sequence of words where each component statisfies\n  the corresponding regular expression or is equal to (one of) the\n  given word sequence. If a file name is given, each line in the file\n  is assumed to be a space separated sequence.\n\n(t) matches any token\n\n(\u003e\u003c)\n  matches the token immediately after the divider. In the rotated\n  n-gram files, this is always the first token in a line.\n\nSingle-component Patterns\n(+ PATTERN [:min N1] [:max N2])\n  matches N1 to N2 (both inclusive) subsequences that matches the\n  PATTERN (N1 \u003e= 1)\n\n(* PATTERN [:max N])\n  matches 0 to N (inclusive) subsequences that matches the PATTERN\n\nMulti-component Patterns\n(seq PATTERN1 PATTERN2 ...... PATTERNn)\n  matches a sequence of tokens\n\n(and PATTERN1 PATTERN2 ...... PATTERNn)\n  matches a sequence if the sequence matches all the patterns PATTERN1\n  PATTERN2 ...... PATTERNn. \n\n(or  PATTERN1 PATTERN2 ...... PATTERNn)\n  matches a sequence if the sequence matches any of the patterns PATTERN1\n  PATTERN2 ...... PATTERNn. \n\nWhole N-gram Patterns\n(grep REGEXP)\n  match a whole n-ngram if the concatenated string of the n-gram (joined\n  with spaces) matches the regular expression.\n\n(grep-tag REGEXP)\n  match a whole n-gram if the concatenated tags of the n-gram (joined\n  with '|') matches the regular expression.\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frk2900%2Fngramtools","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frk2900%2Fngramtools","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frk2900%2Fngramtools/lists"}