https://github.com/textcorpuslabs/vlngramcounter
NGram counter for large datasets
https://github.com/textcorpuslabs/vlngramcounter
ngrams python
Last synced: 7 months ago
JSON representation
NGram counter for large datasets
- Host: GitHub
- URL: https://github.com/textcorpuslabs/vlngramcounter
- Owner: TextCorpusLabs
- License: mit
- Created: 2022-12-27T10:22:25.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2023-01-20T13:38:30.000Z (over 2 years ago)
- Last Synced: 2025-01-27T07:27:30.171Z (8 months ago)
- Topics: ngrams, python
- Language: Python
- Homepage:
- Size: 56.6 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Very Large NGram Counter


NGram counter for large corpuses
# Operation
## Install
You can install the package using the following steps:
`pip` install using an _admin_ prompt.
```{ps1}
pip uninstall VLNGramCounter -y
python -OO -m pip install -v git+https://github.com/TextCorpusLabs/VLNGramCounter.git
```or if you have the code local
```{ps1}
pip uninstall VLNGramCounter -y
python -OO -m pip install -v c:/repos/TextCorpusLabs/VLNGramCounter
```## Run
Counts the n-grams contained in a folder of TXT files.
```{ps1}
VLNGramCounter -source d:/data/corpus -dest d:/data/corpus.ngrams.csv
```The following are required parameters:
* `source` is the folder containing the TXT files.
* `dest` is the CSV file used to store the ngram results.The following are optional parameters:
* `length` is the length of the n-gram.
The default is 1.
* `chunk_size` is the amount of items in used by the control structure before chunking.
Higher values use more ram, but compute the overall value faster.
The default is 1M.
* `include` count only values in this CSV list.
The default is count everything.
* `exclude` ignore values in this CSV list.
The default is exclude nothing.
**Note**: due to the order of operations, it only makes seance to `exclude` single tokens.
* `cutoff` is the minimum value count to keep.
The default is 2.
* `top` is the number of n-grams to save.
The default is to keep 10K.
* `keep_case` (flag) keeps the casing as-is before converting to tokens for counting.
The default is to upper case everything.
* `keep_punct` (flag) keeps all punctuation as-is before converting to tokens for counting.
The default is to remove all tokens that are only punctuation.**NOTE**: The order of operations for complex counting is as follows:
1. Transformation (`keep_case`)
2. Exclusion (`keep_punct` > `exclude`)
3. Inclusion (`include`)
4. Filter (`cutoff` > `top`)## Debug/Test
The code in this repo is setup as a module.
[Debugging](https://code.visualstudio.com/docs/python/debugging#_module) and [testing](https://code.visualstudio.com/docs/python/testing) are based on the assumption that the module is already installed.
In order to debug (F5) or run the tests (Ctrl + ; Crtl + A), make sure to install the module as editable (see below).```{ps1}
pip uninstall VLNGramCounter -y
python -m pip install -e c:/repos/TextCorpusLabs/VLNGramCounter
```When debugging in VSCode for the first time, consider adding the below config to the _launch.json_ file.
```{json}
"args" : [
"-source", "d:/data/corpus",
"-dest", "d:/data/corpus.ngrams.csv",
"-length", "1"]
```