Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/adulau/napkin-text-analysis

Napkin is a simple tool to produce statistical analysis of a text
https://github.com/adulau/napkin-text-analysis

nlp text-analysis text-mining

Last synced: 1 day ago
JSON representation

Napkin is a simple tool to produce statistical analysis of a text

Host: GitHub
URL: https://github.com/adulau/napkin-text-analysis
Owner: adulau
License: agpl-3.0
Created: 2020-08-18T14:49:23.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2024-02-25T14:55:25.000Z (9 months ago)
Last Synced: 2024-04-16T18:21:51.111Z (7 months ago)
Topics: nlp, text-analysis, text-mining
Language: Python
Homepage: https://adulau.github.io/napkin-text-analysis/
Size: 556 KB
Stars: 11
Watchers: 6
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# napkin-text-analysis

![napkin text analysis - logo](./logo/logo.png)

Napkin is a Python tool to produce statistical analysis of a text.

Analysis features are :

- Verbs frequency
- Nouns frequency
- Digit frequency
- Labels frequency such as (Person, organisation, product, location) as defined in spacy.io [named entities](https://spacy.io/api/annotation#named-entities)
- URL frequency
- Email frequency
- Mention frequency (everything prefixed with an @ symbol)
- Out-Of-Vocabulary (OOV) word frequency meaning any words outside English dictionary

Verbs and nouns are in their lemmatized form by default but the option `--verbatim` allows to keep the original inflection.

Intermediate results are stored in a Redis database to allow the analysis of multiple text files.

# requirements

- Python >= 3.6
- spacy.io
- redis (a redis server running on port 6380 is required)
- pycld3
- tabulate

# how to use napkin

~~~~
usage: napkin.py [-h] [-v V] [-f F] [-t T] [-s] [-o O] [-l L] [-i]
[--verbatim] [--no-flushdb] [--binary] [--analysis ANALYSIS]
[--disable-parser] [--disable-tagger]
[--token-span TOKEN_SPAN] [--table-format TABLE_FORMAT]
[--full-labels]

Extract statistical analysis of text

optional arguments:
-h, --help show this help message and exit
-v V verbose output
-f F file to analyse
-t T maximum value for the top list (default is 100) -1 is
no limit
-s display the overall statistics (default is False)
-o O output format (default is csv), json, readable
-l L language used for the analysis (default is en)
-i Use stdin instead of a filename
--verbatim Don't use the lemmatized form, use verbatim. (default
is the lematized form)
--no-flushdb Don't flush the redisdb, useful when you want to
process multiple files and aggregate the results. (by
default the redis database is flushed at each run)
--binary set output in binary instead of UTF-8 (default)
--analysis ANALYSIS Limit output to a specific analysis (verb, noun,
hashtag, mention, digit, url, oov, labels, punct).
(Default is all analysis are displayed)
--disable-parser disable parser component in Spacy
--disable-tagger disable tagger component in Spacy
--token-span TOKEN_SPAN
Find the sentences where a specific token is located
--table-format TABLE_FORMAT
set tabulate format (default is fancy_grid)
--full-labels store each label value in a ranked set (default is
False)
~~~~

# example usage of napkin

## Generate all analysis for a given text

A sample file "The Prince, by Nicoló Machiavelli" is included to test napkin.

`python3 ./bin/napkin.py -o readable -f samples/the-prince.txt -t 4`

Example output:

~~~~
╒══════════════════╕
│ Top 4 of verb │
╞══════════════════╡
│ 207 occurences │
├──────────────────┤
│ will │
├──────────────────┤
│ 137 occurences │
├──────────────────┤
│ can │
├──────────────────┤
│ 116 occurences │
├──────────────────┤
│ make │
├──────────────────┤
│ 106 occurences │
├──────────────────┤
│ may │
├──────────────────┤
│ 102 occurences │
├──────────────────┤
│ would │
╘══════════════════╛
╒══════════════════╕
│ Top 4 of noun │
╞══════════════════╡
│ 206 occurences │
├──────────────────┤
│ prince │
├──────────────────┤
│ 120 occurences │
├──────────────────┤
│ man │
├──────────────────┤
│ 108 occurences │
├──────────────────┤
│ state │
├──────────────────┤
│ 90 occurences │
├──────────────────┤
│ people │
├──────────────────┤
│ one │
╘══════════════════╛
╒═════════════════════╕
│ Top 4 of hashtag │
╞═════════════════════╡
╘═════════════════════╛
╒═════════════════════╕
│ Top 4 of mention │
╞═════════════════════╡
╘═════════════════════╛
╒═══════════════════╕
│ Top 4 of digit │
╞═══════════════════╡
│ 1 occurences │
├───────────────────┤
│ 99775 │
├───────────────────┤
│ 84116 │
├───────────────────┤
│ 750175 │
├───────────────────┤
│ 6221541 │
├───────────────────┤
│ 57037 │
╘═══════════════════╛
╒═════════════════════════════════════════╕
│ Top 4 of url │
╞═════════════════════════════════════════╡
│ 5 occurences │
├─────────────────────────────────────────┤
│ www.gutenberg.org │
├─────────────────────────────────────────┤
│ 2 occurences │
├─────────────────────────────────────────┤
│ www.gutenberg.org/donate │
├─────────────────────────────────────────┤
│ 1 occurences │
├─────────────────────────────────────────┤
│ www.gutenberg.org/license │
├─────────────────────────────────────────┤
│ www.gutenberg.org/contact │
├─────────────────────────────────────────┤
│ http://www.gutenberg.org/5/7/0/3/57037/ │
╘═════════════════════════════════════════╛
╒═════════════════╕
│ Top 4 of oov │
╞═════════════════╡
│ 9 occurences │
├─────────────────┤
│ Sforza │
├─────────────────┤
│ 7 occurences │
├─────────────────┤
│ Fermo │
├─────────────────┤
│ 6 occurences │
├─────────────────┤
│ Vitelli │
├─────────────────┤
│ Pertinax │
├─────────────────┤
│ Orsinis │
╘═════════════════╛
╒════════════════════╕
│ Top 4 of labels │
╞════════════════════╡
│ 339 occurences │
├────────────────────┤
│ PERSON │
├────────────────────┤
│ 305 occurences │
├────────────────────┤
│ GPE │
├────────────────────┤
│ 197 occurences │
├────────────────────┤
│ CARDINAL │
├────────────────────┤
│ 189 occurences │
├────────────────────┤
│ ORG │
├────────────────────┤
│ 131 occurences │
├────────────────────┤
│ NORP │
╘════════════════════╛
╒═══════════════════╕
│ Top 4 of punct │
╞═══════════════════╡
│ 3440 occurences │
├───────────────────┤
├───────────────────┤
│ 144 occurences │
├───────────────────┤
├───────────────────┤
│ 32 occurences │
├───────────────────┤
├───────────────────┤
│ 26 occurences │
├───────────────────┤
├───────────────────┤
│ 11 occurences │
├───────────────────┤
│ 1.F.3 │
╘═══════════════════╛
╒═══════════════════╕
│ Top 4 of email │
╞═══════════════════╡
│ 1 occurences │
├───────────────────┤
│ [email protected] │
╘═══════════════════╛
~~~~

## Extract the sentences associated to a specific token

`python3 ./bin/napkin.py -o readable -f samples/the-prince.txt -t 4 --token-span "Vitelli"`

~~~~
╒═════════════════════════════════════════════════════════════════════════╕
│ Top 4 of span for Vitelli │
╞═════════════════════════════════════════════════════════════════════════╡
│ 1 occurences │
├─────────────────────────────────────────────────────────────────────────┤
│ This duke entered │
│ Romagna with auxiliary troops, leading forces composed entirely of │
│ French soldiers, and with these he took Imola and Forli; but as they │
│ seemed unsafe, he had recourse to mercenaries, and hired the Orsini and │
│ Vitelli; afterwards finding these uncertain to handle, unfaithful and │
│ dangerous, he suppressed them, and relied upon his own men. │
├─────────────────────────────────────────────────────────────────────────┤
│ The Florentines appointed Paolo Vitelli their captain, │
│ a man of great prudence, who had risen from a private station to the │
│ highest reputation. │
├─────────────────────────────────────────────────────────────────────────┤
│ Nevertheless, Messer Niccolo Vitelli has been seen in │
│ our own time to destroy two fortresses in Città di Castello in order │
│ to keep that state. │
├─────────────────────────────────────────────────────────────────────────┤
│ And the │
│ difference between these forces can be easily seen if one considers │
│ the difference between the reputation of the duke when he had only the │
│ French, when he had the Orsini and Vitelli, and when he had to rely │
│ on himself and his own soldiers. │
├─────────────────────────────────────────────────────────────────────────┤
│ And that his foundations were │
│ good is seen from the fact that the Romagna waited for him more than a │
│ month; in Rome, although half dead, he remained secure, and although │
│ the Baglioni, Vitelli, and Orsini entered Rome they found no followers │
│ against him. │
╘═════════════════════════════════════════════════════════════════════════╛
~~~~

## Specify table format

In `readable` output, the format can be set to any of the tabulate format supported. If you want the top 10 of out-of-vocabulary
words from the text in GitHub markdown format.

`python3 ./bin/napkin.py -o readable -f samples/the-prince.txt -t 10 --analysis oov --table-format github`

| Top 10 of oov |
|------------------|
| 9 occurences |
| Sforza |
| 7 occurences |
| Fermo |
| 6 occurences |
| Vitelli |
| Pertinax |
| Orsinis |
| Colonnas |
| Bentivogli |
| Agathocles |
| 5 occurences |
| Oliverotto |
| Cæsar |
| Commodus |

# overview of processing in napkin

![overview of processing in napkin](https://raw.githubusercontent.com/adulau/napkin-text-analysis/master/doc/napkin.png)

# what about the name?

The name 'napkin' came after a first sketch of the idea on a napkin. The goal was also to provide a simple text analysis tool which can be run on the corner of table in a kitchen.

# LICENSE

napkin is free software under the AGPLv3 license.

~~~~
Copyright (C) 2020 Alexandre Dulaunoy
Copyright (C) 2020 Pauline Bourmeau
~~~~