https://github.com/nikitaeverywhere/edu-text-analysis-experiments
Statistical text analysis and semantic networks with Python
https://github.com/nikitaeverywhere/edu-text-analysis-experiments
analysis gephi semantic-networks sigma sigma-analysis text-analysis text-analyzer text-corpus tf-idf
Last synced: 13 days ago
JSON representation
Statistical text analysis and semantic networks with Python
- Host: GitHub
- URL: https://github.com/nikitaeverywhere/edu-text-analysis-experiments
- Owner: nikitaeverywhere
- License: mit
- Created: 2017-10-03T12:34:38.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2017-11-30T19:01:26.000Z (over 7 years ago)
- Last Synced: 2025-03-31T03:12:41.203Z (about 2 months ago)
- Topics: analysis, gephi, semantic-networks, sigma, sigma-analysis, text-analysis, text-analyzer, text-corpus, tf-idf
- Language: Python
- Homepage:
- Size: 30.3 MB
- Stars: 14
- Watchers: 1
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
- License: license
Awesome Lists containing this project
README
# TF-IDF, Sigma and Other (Experimental) Texts Analysing Tools
[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) and Sigma analysis written in Python, which
outputs results to the convenient *.xlsx spreadsheets for detailed analysis.TF-IDF analysis allows to detect the most "important" words in the given text of some text corpus
(set of articles, etc). These "important" words are those which occur in the particular document
more than in any other document of the same text corpus.While TF-IDF analysis is useful for a set of articles, Sigma analysis is useful to analyze the most
"important" words in a single, usually large text (books, documents, etc).There are a couple of more advanced scripts:
+ Matrix output for Gephi in `gephi.py`. Sample output file is `gephi.csv` in this repository.
+ Horizontal visibility graph building with `hor-vis-graph.py`. A couple of sample files are included in `hor-vis-graph/` directory.
+ Other experiments (see below)Preview / Examples
------------------Experimental semantic network builder (main concepts from [this article](http://news.bbc.co.uk/2/hi/technology/4276125.stm)):

TF-IDF applied to some news articles text corpus:

Sigma method applied to the book "The Hunger Games":

Analysis of [article about Putin](http://news.bbc.co.uk/2/hi/business/4120339.stm) with horizontal visibility graph and other articles text corpus:

Usage
-----1. Install Python 3, clone the repository, enter repository directory with `cd edu-tf-idf`.
2. Install required dependencies: `pip3 install -r requirements.txt`.
3. Place texts to analyze in `/texts` directory (there are a couple already).
4. Run the analyzer with `py tf-idf.py` command (there are many!).Example
-------##### TF-IDF: Run the program (by default, picks texts from `texts/news`):
```bash
py tf-idf.py
```Result:
```text
Reading texts...
Done! Computing TF-IDF ranks...
Progressing text 2225/2225
Done! Writing results...
Writing worksheet 2225/2225
Done!
```Output goes to `tf-idf.xlsx` file ready for analysis.
##### Sigma method (by default, picks texts from `texts/books`):
```bash
py sigma.py
```Result goes to `sigma.xlsx` file.
##### Horizontal Visibility Graph (exports to `hor-vis-graph/` directory, picks from `texts/news`):
```bash
py hor-vis-graph.py
```Check the result in `hor-vis-graph/` directory, visualize it using [Gephi](https://gephi.org/).
##### Individual Text Analysis
Run experimental semantic network builder with
```bash
py analyze_text.py texts/news/tech/001.txt
```Check the result in `analyzed/` directory, visualize it using [Gephi](https://gephi.org/).
License
-------[MIT](license) © [Nikita Savchenko](https://nikita.tk)