{"id":16958488,"url":"https://github.com/dohliam/aligned-corpus-search","last_synced_at":"2025-08-16T23:06:30.838Z","repository":{"id":88991679,"uuid":"138774184","full_name":"dohliam/aligned-corpus-search","owner":"dohliam","description":"Simple aligned corpus search tool","archived":false,"fork":false,"pushed_at":"2018-06-26T18:00:41.000Z","size":4,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-21T14:46:48.387Z","etag":null,"topics":["corpora","corpus"],"latest_commit_sha":null,"homepage":null,"language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dohliam.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-06-26T17:56:52.000Z","updated_at":"2018-06-26T18:00:43.000Z","dependencies_parsed_at":"2023-06-13T11:15:38.874Z","dependency_job_id":null,"html_url":"https://github.com/dohliam/aligned-corpus-search","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/dohliam/aligned-corpus-search","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dohliam%2Faligned-corpus-search","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dohliam%2Faligned-corpus-search/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dohliam%2Faligned-corpus-search/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dohliam%2Faligned-corpus-search/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dohliam","download_url":"https://codeload.github.com/dohliam/aligned-corpus-search/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dohliam%2Faligned-corpus-search/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270781393,"owners_count":24643820,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-16T02:00:11.002Z","response_time":91,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["corpora","corpus"],"created_at":"2024-10-13T22:42:43.131Z","updated_at":"2025-08-16T23:06:30.799Z","avatar_url":"https://github.com/dohliam.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Aligned Corpus Search - Simple aligned corpus search tool\n\nAligned Corpus Search is an extremely simple no-frills aligned corpus search tool that comes with support for configurable context, colour-highlighted results, regular expressions, and plain text output. It can be used to search through either a single file or an entire directory.\n\nThis tool has been designed specifically for the purpose of searching through CJK databases for patterns, but also works fine on other text. For non-CJK corpora, you will probably want to use the `--half-width` spacing option for alignment, and perhaps adjust the context (`-c`) to something wider than 10 characters (e.g., something more approximate to 10 words in the language you are searching in).\n\n## Features\n\nIf you're wondering what benefits this might have over just grepping a bunch of files, the answer is really just \"alignment\" (hence the name) as well as default searching through files in the `data` directory.\n\nSample output:\n\n![](https://user-images.githubusercontent.com/9295750/41930469-fc7d50de-792f-11e8-9292-1ac41b06932d.png)\n\n![](https://user-images.githubusercontent.com/9295750/41930472-002c095a-7930-11e8-9714-abde3bab2fff.png)\n\n## Usage\n\nThe easiest way to use Aligned Corpus Search is just to put some data into the `data` folder (one or more text files of any sort), and issue the following command, replacing `[KEYWORD]` with the search term(s) of your choice:\n\n    ./aligned.rb -k \"[KEYWORD]\"\n\nThis should immediately give you a list of results from your data, aligned so that the keyword is in a separate highlighted column.\n\n### Single file\n\nIf your data is somewhere other than the default `data` directory, you can specify to search in a specific file using the `-i` option and the path to your file:\n\n    ./aligned.rb [options] -i [INPUT_FILE]\n\n### All files in directory\n\nYou can also point `aligned.rb` at an entire directory with the `-d` option and it will output results from all files in the directory:\n\n    ./aligned.rb [options] -d [DIRECTORY]\n\n### Backreferences\n\nBackreferences can be used with parentheses around the initial pattern and two backslashes followed by the number of the reference. It is important to note, however, that backreferences begin from `\\\\2` (_not_ `\\\\1`).\n\nFor example:\n\n* `-k \"一(.)\\\\2\"` will return results (however `\"一(.)\\\\1\"` will **not** work)\n\nNote also that when using parentheses, the match and submatch will be displayed in the results on separate lines.\n\n## Options\n\nThe following command-line options are available:\n\n* `-c` (`--context CONTEXT`) - _Specify amount of surrounding context (in characters)_\n* `-C` (`--count-collocations`) - _Print a count of all collocated characters (together with -N or -P, and optionally -c)_\n* `-d` (`--directory DIRECTORY`) - _Specify source directory_\n* `-h` (`--half-width`) - _Use half-width spacing for alignment_\n* `-H` (`--highlight-color OPTIONS`) - _Specify highlight, foreground, and background text colors_\n* `-i` (`--input-file FILE`) - _Specify input file_\n* `-k` (`--keyword KEYWORD`) - _Specify keyword to search for_\n* `-K` (`--keyword-frequency`) - _Show only matching keywords arranged in order of frequency_\n* `-N` (`--collocated-next`) - _Print sorted list of collocations (following)_\n* `-p` (`--plain-text`) - _Output plain text without highlighting_\n* `-P` (`--collocated-previous`) - _Print sorted list of collocations (preceding)_\n\nIn general, lowercase short options (e.g., `-c`, `-k`, `-i`) adjust parameters of the input or output, while uppercase short options (e.g., `-C`, `-N`, `-P`) change the basic type of search performed.\n\n## License\n\nMIT.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdohliam%2Faligned-corpus-search","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdohliam%2Faligned-corpus-search","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdohliam%2Faligned-corpus-search/lists"}