{"id":21905044,"url":"https://github.com/textcorpuslabs/building-blocks","last_synced_at":"2025-03-22T07:14:53.122Z","repository":{"id":60840941,"uuid":"285774653","full_name":"TextCorpusLabs/building-blocks","owner":"TextCorpusLabs","description":"Building blocks for text pre-processing","archived":false,"fork":false,"pushed_at":"2022-10-05T11:51:11.000Z","size":147,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-01-27T07:27:29.754Z","etag":null,"topics":["python3","text-processing"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TextCorpusLabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-08-07T08:09:04.000Z","updated_at":"2022-09-28T22:16:26.000Z","dependencies_parsed_at":"2022-10-05T14:45:49.166Z","dependency_job_id":null,"html_url":"https://github.com/TextCorpusLabs/building-blocks","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TextCorpusLabs%2Fbuilding-blocks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TextCorpusLabs%2Fbuilding-blocks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TextCorpusLabs%2Fbuilding-blocks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TextCorpusLabs%2Fbuilding-blocks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TextCorpusLabs","download_url":"https://codeload.github.com/TextCorpusLabs/building-blocks/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244918710,"owners_count":20531686,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["python3","text-processing"],"created_at":"2024-11-28T16:20:34.195Z","updated_at":"2025-03-22T07:14:53.100Z","avatar_url":"https://github.com/TextCorpusLabs.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Building Blocks\n\n![Python](https://img.shields.io/badge/python-3.x-blue.svg)\n![MIT license](https://img.shields.io/badge/License-MIT-green.svg)\n![Last Updated](https://img.shields.io/badge/Last%20Updated-2022.09.28-success.svg)\n\nBelow is a list of the corpus tools we use at Text Corpus Labs.\nThey are intended to be general purpose building blocks allowing for conversion between our different processes.\n\n**NOTE**: This project is currently in the process of undergoing a retrofit.\nThe below checklist now shows conversion status.\nWhile in progress, the old cold will still work, it is just nested in a subfolder.\n\n# Operation\n\n## Install\n\nYou can install the package using the following steps:\n\n1. `pip` install using an _admin_ prompt.\n   ```{ps1}\n   pip uninstall buildingblocks\n   python -OO -m pip install -v git+https://github.com/TextCorpusLabs/building-blocks.git\n   ```\n\n## Run\n\nYou can run the package in the following ways:\n\n### Extract\n\n1. Pull fields from every JSON object in a JSONL file into a CSV file\n   ```{ps1}\n   buildingblocks extract jsonl_to_csv `\n      -source d:/data/corpus `\n      -dest d:/data/corpus.csv\n   ```\n   The following are optional parameters\n   * `fields` are the names of the fields to extract.\n     It defaults to \"id\"\n\n### Transform\n\n1. Counts the n-grams in a JSONL file.\n   ```{ps1}\n   buildingblocks transform ngram `\n      -source d:/data/corpus `\n      -dest d:/data/corpus.ngrams.csv\n   ```\n   The following are optional parameters\n   * `fields` are the names of the fields to process.\n     It defaults to \"text\"\n   * `size` is the length of the n-gram.\n     It defaults to 1\n   * `top` is the number of n-grams to save.\n     It defaults to 10K\n   * `chunk` controls the amount of n-grams to chunk to disk to prevent OOM.\n     Higher values use more ram, but compute the overall value faster.\n     It defaults to 10M.\n   * `keep_case` (flag) keeps the casing of `fields` as-is before converting to tokens for counting.\n   * `keep_punct` (flag) keeps all punctuation of `fields` as-is before converting to tokens for counting.\n\n# TODO\n\nAll script commands are presented in PowerShell syntax.\nIf you use a different shell, your syntax will be different.\n\nAdding `-O` to the front of any script runs it in \"optimized\" mode.\nThis can give as much as a 50% boost in some cases, but prevents errors from making sense.\nIf there is an error in a run, remove the `-O`, capture the error, and submit an [issue](https://github.com/TextCorpusLabs/building-blocks/issues).\n\n## Combine\n01. - [x] [Combine](./docs/combine_json_to_jsonl.md) a folder of `JSON` files into a single `JSONL` file.\n02. - [x] [Combine](./docs/combine_txt_to_jsonl.md) a folder of `TXT` files into a single `JSONL` file.\n\n## Convert\n01. - [x] [Convert](./docs/convert_jsonl.md) a `JSONL` file into a _smaller_ `JSONL` file by keeping only some elements.\n02. - [x] [Convert](./docs/convert_txt.md) a folder of `TXT` files into a folder of _bigger_ `TXT` files.\n03. - [x] [Convert](./docs/convert_jsonl_to_jsont.md) a `JSONL` file into a `JSONT` file.\n03. - [x] [Convert](./docs/convert_jsont_to_jsonl.md) a `JSONT` file into a `JSONL` file.\n\n## Extract\n02. - [x] [Extract](./docs/extract_itxt_from_jsonl.md) a folder of _interleaved_ `TXT` files from a `JSONL` file.\n03. - [x] [Extract](./docs/extract_json_from_jsonl.md) a folder of `JSON` files from a a `JSONL` file.\n04. - [x] [Extract](./docs/extract_txt_from_jsonl.md) a folder of `TXT` files from a `JSONL` file.\n\n## Merge\n01. - [x] [Merge](./docs/merge_json_folders.md) _several_ folders of `JSON` files into a _single_ folder of `JSON` files based on their file name.\n02. - [x] [Merge](./docs/merge_txt_folders.md) _several_ folders of `TXT` files into a _single_ folder of `TXT` files based on their file name.\n\n## Transform\n01. - [x] [Tokenize](./docs/tokenize_jsonl.md) a `JSONL` file using the NLTK defaults (Punkt + Penn Treebank).\n\n# Development\n\nUse the below instructions to setup the module for local development.\n\n1. Clone this repository then open an _Admin_ shell to the `~/` directory.\n2. Install the required modules.\n   ```{shell}\n   pip uninstall buildingblocks\n   pip install -e c:/repos/TextCorpusLabs/building-blocks\n   ```\n3. Setup the `~/.vscode/launch.json` file (VS Code only)\n   1. Click the \"Run and Debug Charm\"\n   2. Click the \"create a launch.json file\" link\n   3. Select \"Python\"\n   4. Select \"module\" and enter _buildingblocks_\n   5. Select one of the following modes and add the below `args` to the launch.json file.\n      The `args` node should be a sibling of the `module` node.\n      You will need to change your pathing and arguments.\n      The first two arguments determine the command, the other arguments are the command's parameters.\n      ```{json}\n      \"args\" : [\n         \"extract\", \"jsonl_to_csv\",\n         \"-source\", \"d:/data/corpus\",\n         \"-dest\", \"d:/data/corpus.csv\",\n         \"-fields\", \"id,text\"]\n      ```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftextcorpuslabs%2Fbuilding-blocks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftextcorpuslabs%2Fbuilding-blocks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftextcorpuslabs%2Fbuilding-blocks/lists"}