Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/axsaucedo/wcpy


https://github.com/axsaucedo/wcpy

Last synced: about 7 hours ago
JSON representation

Awesome Lists containing this project

README

        

# WordCount Python - [ wc.py ]

Utility tool to count the occurrences of words in text using an NLP tokenizer.

## Overview

This repository contains the CLI and SDK for the WordCount Python [ wc.py ].

`wc.py` provides a set of tools to analyse the number of occurences of words across a single or multiple documents. It can be accessed through the CLI, or directly through the SDK provided by the `WCExtractor` and the `WCCore` classes in the `wcpy` module.

For the **CLI interface quickstart** please refer to the **User Guide below**.

For the **SDK interface quickstart** please refer to the **SDK Interface below**.

For more advanced documnetation please refer to the official [WCPY documentation](https://axsauze.github.io/wcpy/).

### WCPY Interface

The word count python interface also has an external interface that can be used to interact with.

![Image of wcpy API](https://github.com/axsauze/wcpy-api/blob/master/assets/wcpy-api.jpg)

This interface can be found at [https://github.com/axsauze/wcpy-api](https://github.com/axsauze/wcpy-api)

# Installation

You can install it from pip by running:

```
pip install wc.py
```

This will install the script in your computer so you'll be able to call it directly with `wc.py`.

# CLI User Guide

## Usage

After installing it you can view the usage and options with `wc.py -h`:

```
usage: wc.py [-h] [-v] [--limit LIMIT] [--reverse]
[--filter-words FILTER_WORDS [FILTER_WORDS ...]]
[--file-ext FILE_EXT] [--truncate TRUNCATE]
[--columns COLUMNS [COLUMNS ...]] [--output-file OUTPUT_FILE]
paths [paths ...]

Count the number of words in the files on a folder

positional arguments:
paths (REQUIRED) Path(s) to folders and/or files to count words from

optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
--limit LIMIT (Optional) Limit the number of results that you would like to display.
--reverse (Optional) List is sorted in ascending order by default, use this flag to reverse sorting to descending order.
--filter-words FILTER_WORDS [FILTER_WORDS ...]
(Optional) You can get results filtered to only the list of words provided.
--file-ext FILE_EXT (Optional) This is the default file extention for the files being used
--truncate TRUNCATE (Optional) Output is often quite large, you can truncate the output by passing a number greater than 5
--columns COLUMNS [COLUMNS ...]
(Optional) This argument allows you to choose the columns to be displayed in the output. Options are: word, count, files and sentences.
--output-file OUTPUT_FILE
(Optional) Define an output file to save the output

EXAMPLE USAGE:
wc.py ./
wc.py ./ --limit 10
wc.py doc1.txt doc2.txt --filter-words tool awesome an
wc.py docs/ tests/ --truncate 100 --columns word count
wc.py ./ --filter-words tool awesome an --truncate 50 --output output.txt
```

## Examples

#### Counts of word occurences in documents in this folder recusively

```
wc.py ./
```

#### Word occurrences in this folder docs with limit of the top 10

```
wc.py ./ --limit 10
```

#### Word occurences in multiple files showing only specific words

```
wc.py doc2.txt doc1.txt --filter-words tool awesome an
```

#### Word occurences in folder with output truncated and only 2 columns

```
wc.py tests/test_data/ --truncate 20 --columns word count
```

#### Saving output to file

```
wc.py ./ --filter-words tool awesome an --truncate 50 --output output.txt
```

#### Get the current version

```
wc.py -v
```

# SDK Interface

It is possible to interact with the SDK in multiple levels, the two most common usecases will be:

* WCCore class - Interact with filepaths
* WCExtractor class - Interact with files and text

## WCCore class

### generate_wc_dict(self, paths)

This function finds all the files in a given set of paths, and builds a dictionary with the following structure:

### generate_wc_list(self, paths)

This function finds all the files in a given set of paths, and builds a sorted list (by word count) of the following structure

## WCExtractor class

### extract_wc_from_file

This function extracts all the text from a file and builds a wc_dict object

### extract_wc_from_line

This function extracts all the words from a line and adds it to a wc_dict object

## WCExtractorProcessor class

This class does all the processing to convert a WC Dict into a sorted WCList object.

### process_dict_wc_to_list

As function name suggest, this function converts a WCDict object into a sorted WCList object.

## Core WC Types

### WCDict

```
{
: {
word_count: ,
files: {
: [
,
,

]
},
{

}
},
: ...
}
```

### WCList

```
[
{
"word": ,

word_count: ,
files: {
: [
,
,

]
}
},
{
"word": ,
...

}
]
```

# Contributing

If you'd like to contribute, feel free to submit a pull request, open bugs/issues and join the discussion.

## Install VirtualEnv and Requirements

Python 3.X is used, and it's strongly recommended to set up the project in a virtual environment:

```
virtualenv --no-site-packages -p python3 venv
```

Then install it using the setup.py command

```
python setup.py install_data
```

You can also install the requirements directly by running

```
python -r requirements.txt
```

## NLTK

This package uses the NLTK `english.pickle` dataset. The dataset includes in both, the repository and the PyPi package, however if you want to donwload more of the languages you can do so with the following command:

```
python -c "import nltk; nltk.download('punkt')"
```

## Testing

`py.test` is used to run the tests, in order to run it simply run:

```
python setup.py test
```

## Cleaning

To clean all the files generated during runtime simply run:

```
python setup.py clean
```

# Roadmap

* Support multiple types of documents