Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/axsaucedo/wcpy
https://github.com/axsaucedo/wcpy
Last synced: about 7 hours ago
JSON representation
- Host: GitHub
- URL: https://github.com/axsaucedo/wcpy
- Owner: axsaucedo
- License: mit
- Created: 2017-06-24T13:28:04.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2017-06-25T18:52:05.000Z (over 7 years ago)
- Last Synced: 2024-04-24T13:16:47.496Z (7 months ago)
- Language: JavaScript
- Size: 55.6 MB
- Stars: 0
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# WordCount Python - [ wc.py ]
Utility tool to count the occurrences of words in text using an NLP tokenizer.
## Overview
This repository contains the CLI and SDK for the WordCount Python [ wc.py ].
`wc.py` provides a set of tools to analyse the number of occurences of words across a single or multiple documents. It can be accessed through the CLI, or directly through the SDK provided by the `WCExtractor` and the `WCCore` classes in the `wcpy` module.
For the **CLI interface quickstart** please refer to the **User Guide below**.
For the **SDK interface quickstart** please refer to the **SDK Interface below**.
For more advanced documnetation please refer to the official [WCPY documentation](https://axsauze.github.io/wcpy/).
### WCPY Interface
The word count python interface also has an external interface that can be used to interact with.
![Image of wcpy API](https://github.com/axsauze/wcpy-api/blob/master/assets/wcpy-api.jpg)
This interface can be found at [https://github.com/axsauze/wcpy-api](https://github.com/axsauze/wcpy-api)
# Installation
You can install it from pip by running:
```
pip install wc.py
```This will install the script in your computer so you'll be able to call it directly with `wc.py`.
# CLI User Guide
## Usage
After installing it you can view the usage and options with `wc.py -h`:
```
usage: wc.py [-h] [-v] [--limit LIMIT] [--reverse]
[--filter-words FILTER_WORDS [FILTER_WORDS ...]]
[--file-ext FILE_EXT] [--truncate TRUNCATE]
[--columns COLUMNS [COLUMNS ...]] [--output-file OUTPUT_FILE]
paths [paths ...]Count the number of words in the files on a folder
positional arguments:
paths (REQUIRED) Path(s) to folders and/or files to count words fromoptional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
--limit LIMIT (Optional) Limit the number of results that you would like to display.
--reverse (Optional) List is sorted in ascending order by default, use this flag to reverse sorting to descending order.
--filter-words FILTER_WORDS [FILTER_WORDS ...]
(Optional) You can get results filtered to only the list of words provided.
--file-ext FILE_EXT (Optional) This is the default file extention for the files being used
--truncate TRUNCATE (Optional) Output is often quite large, you can truncate the output by passing a number greater than 5
--columns COLUMNS [COLUMNS ...]
(Optional) This argument allows you to choose the columns to be displayed in the output. Options are: word, count, files and sentences.
--output-file OUTPUT_FILE
(Optional) Define an output file to save the outputEXAMPLE USAGE:
wc.py ./
wc.py ./ --limit 10
wc.py doc1.txt doc2.txt --filter-words tool awesome an
wc.py docs/ tests/ --truncate 100 --columns word count
wc.py ./ --filter-words tool awesome an --truncate 50 --output output.txt
```## Examples
#### Counts of word occurences in documents in this folder recusively
```
wc.py ./
```#### Word occurrences in this folder docs with limit of the top 10
```
wc.py ./ --limit 10
```#### Word occurences in multiple files showing only specific words
```
wc.py doc2.txt doc1.txt --filter-words tool awesome an
```#### Word occurences in folder with output truncated and only 2 columns
```
wc.py tests/test_data/ --truncate 20 --columns word count
```#### Saving output to file
```
wc.py ./ --filter-words tool awesome an --truncate 50 --output output.txt
```#### Get the current version
```
wc.py -v
```# SDK Interface
It is possible to interact with the SDK in multiple levels, the two most common usecases will be:
* WCCore class - Interact with filepaths
* WCExtractor class - Interact with files and text## WCCore class
### generate_wc_dict(self, paths)
This function finds all the files in a given set of paths, and builds a dictionary with the following structure:
### generate_wc_list(self, paths)
This function finds all the files in a given set of paths, and builds a sorted list (by word count) of the following structure
## WCExtractor class
### extract_wc_from_file
This function extracts all the text from a file and builds a wc_dict object
### extract_wc_from_line
This function extracts all the words from a line and adds it to a wc_dict object
## WCExtractorProcessor class
This class does all the processing to convert a WC Dict into a sorted WCList object.
### process_dict_wc_to_list
As function name suggest, this function converts a WCDict object into a sorted WCList object.
## Core WC Types
### WCDict
```
{
: {
word_count: ,
files: {
: [
,
,
…
]
},
{
…
}
},
: ...
}
```### WCList
```
[
{
"word": ,word_count: ,
files: {
: [
,
,
…
]
}
},
{
"word": ,
...}
]
```# Contributing
If you'd like to contribute, feel free to submit a pull request, open bugs/issues and join the discussion.
## Install VirtualEnv and Requirements
Python 3.X is used, and it's strongly recommended to set up the project in a virtual environment:
```
virtualenv --no-site-packages -p python3 venv
```Then install it using the setup.py command
```
python setup.py install_data
```You can also install the requirements directly by running
```
python -r requirements.txt
```## NLTK
This package uses the NLTK `english.pickle` dataset. The dataset includes in both, the repository and the PyPi package, however if you want to donwload more of the languages you can do so with the following command:
```
python -c "import nltk; nltk.download('punkt')"
```## Testing
`py.test` is used to run the tests, in order to run it simply run:
```
python setup.py test
```## Cleaning
To clean all the files generated during runtime simply run:
```
python setup.py clean
```# Roadmap
* Support multiple types of documents