https://github.com/mre/freq

🗼 A CLI term frequency analyzer. Counts the number of occurrences of each word in an input and creates formatted output or a histogram.
https://github.com/mre/freq

frequency histogram occurences words

Last synced: 4 months ago
JSON representation

🗼 A CLI term frequency analyzer. Counts the number of occurrences of each word in an input and creates formatted output or a histogram.

Host: GitHub
URL: https://github.com/mre/freq
Owner: mre
License: apache-2.0
Created: 2021-03-04T13:48:28.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2021-03-15T14:52:16.000Z (over 4 years ago)
Last Synced: 2025-02-26T20:23:29.196Z (4 months ago)
Topics: frequency, histogram, occurences, words
Language: Rust
Homepage:
Size: 185 KB
Stars: 3
Watchers: 5
Forks: 2
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE-APACHE

Awesome Lists containing this project

README

        # freq

A commandline tool that counts the number of word occurrences in an input.

[![James Munns on Twitter](assets/tweet.png)](https://twitter.com/bitshiftmask/status/1367451210987544580)

This is just a placeholder repository for now.

Please create issues for feature request and collaboration.

## Usage

### Commandline

```sh

echo "b a n a n a" | freq

0.16666667 - 1 - b

0.33333334 - 2 - n

0.5 - 3 - a

```

### Library

```rust

use std::error::Error;

fn main() -> Result<(), Box> {

    let frequencies = freq::count("fixtures/sample.txt")?;

    println!("{:?}", frequencies);

    Ok(())

}

```

## Features

- [x] Ignore words ([regex pattern](https://docs.rs/regex/latest/regex/struct.RegexSet.html)) [[issue 5](https://github.com/mre/freq/issues/5)]

- [x] Different output formats (plaintext, JSON)

- [x] freq.toml configuration file

- [x] Filter stopwords (similar to NLTK's stopwords)

- [ ] Performance (SIMD support, async execution)

- [ ] Recursion support

- [ ] Allow skipping files

- [ ] Allow specifying ignored words in a separate file

- [ ] Generate "heat bars" for words like shell-hist does

- [ ] Split report by file/folder (sort of like `sloc` does for code)

- [ ] Choose language for stopwords (`--lang fr`)

- [ ] Format output (e.g. justify counts a la `uniq -c`)

- [ ] Interactive mode (shows stats while running) (`--interactive`)

- [ ] Calculate TF-IDF score in a multi-file scenario

- [ ] Limit the output to the top N words (e.g. `--top 3`)

- [ ] Ignore hidden files (begins with `.`)

- [ ] Minimize number of allocations

- [ ] No-std support?

- [ ] Ignore "words" only consisting of special characters, e.g. `///`

- [ ] Multiple files as inputs

- [ ] Glob input patterns

- [ ] If directory is given, walk contents of folder recursively (walker)

- [ ] Verbose output (show currently analyzed file etc)

- [ ] Library usage

- [ ] https://github.com/jonhoo/evmap

- [ ] Automated abstract generation with Luhn's algorithm [Issue #1](https://github.com/mre/freq/issues/1)

Idea contributors:

- [@jamesmunns](https://github.com/jamesmunns)

- [@M3t0r](https://github.com/M3t0r)

- [@themihel](https://github.com/themihel)

- [@AlexanderThaller](https://github.com/AlexanderThaller)

- [@pizzamig](https://github.com/pizzamig)

- Want to see your name here? Create an issue!

## Similar tools

**tot-up**

Similar tool written in Rust with nice graphical output

https://github.com/payload/tot-up

**uniq**

A basic version would be

```sh,ignore

curl -L 'https://github.com/mre/freq/raw/main/README.md' | tr -cs '[:alnum:]' "\n" | grep -vEx 'and|or|for|a|of|to|an|in' | sort | uniq -c | sort

```

This works, but it's not very extensible by normal users.

It would also lack most of the features listed above.

**Lucene**

Has all the bells and whistles, but there is no official CLI interface and requires a full Java installation.

**wordcount**

`freqword  freq`

Nice and simple. Doesn't exclude stopwords and no regex support, though.

https://github.com/juditacs/wordcount

**word-frequency**

Haskell-based approach: Includes features like min length for words, or min occurrences of words in a text.

https://github.com/cbzehner/word-frequency

**What else?**

There must be more tools out there. Can you help me find them?

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mre/freq

Awesome Lists containing this project

README