https://github.com/mc-cat-tty/language-classification

Suite of Python modules to recognise the language of a file
https://github.com/mc-cat-tty/language-classification

csv files flask frequency-table itis-fermi-modena language language-analyzer language-classification language-classifier language-recognition python python3 twitter

Last synced: 3 months ago
JSON representation

Suite of Python modules to recognise the language of a file

Host: GitHub
URL: https://github.com/mc-cat-tty/language-classification
Owner: mc-cat-tty
License: mit
Created: 2019-12-25T09:24:53.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2022-07-27T11:00:18.000Z (almost 3 years ago)
Last Synced: 2025-02-08T05:43:50.637Z (5 months ago)
Topics: csv, files, flask, frequency-table, itis-fermi-modena, language, language-analyzer, language-classification, language-classifier, language-recognition, python, python3, twitter
Language: Python
Size: 12.9 MB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE

Awesome Lists containing this project

README

# File Language Analyzer
> File Language Analyzer is a suite of Python modules, that provides objects, constants and functions, to recognise the language of a file, analyze its informations and process (elaborate and create) .csv letter frequency tables.

Keep in mind that this project is programmed very poorly, however the logic behind the adopted method is interesting.

## Table of Contents
* [Project Status](#project-status)
* [Features](#features)
* [Math behind it](#math-behind-it)
* [Technologies](#technologies)
* [Requirements](#requirements)
* [Launch](#launch)
* [Usage](#usage)

## Project Status

![License](https://img.shields.io/badge/license-MIT-brightgreen) ![build](https://img.shields.io/badge/build-passed-brightgreen) ![Version](https://img.shields.io/badge/version-1.0.0-blue)

## Features

- Recognise the language of a file
- Convert .csv frequency table to Python dictionary
- Convert Python dictionary to .csv frequency table
- Generate frequency table starting from a set of Twitter messages

## Math behind it

By analyzing the frequency of every single letter is possible to detect the language of a given text.

Once the characters' frequencies have been extracted, this information can be used as a representation of the text.

We want to find out which is its language, so we have to determine which is the table's column that has the nearest values.

To accomplish that, it can be used the Pythagorean theorem extended to 26 dimensions, the number of letters in the Latin alphabet.

By computing the distance between the given text and each language inside the table, it's possible to define which is the nearest language.

## Technologies

- **_Python_** 3.x
- Python built-in libraries
- Twitter API wrapped by **_tweepy_** library
- **_wikipedia-api_** module
- **_Flask_**

## Requirements

Use one of the following commands (according to the configuration of your environment):

```sh
$ pip install -r requirements.txt
```
or

```sh
$ py -m pip install -r requirements.txt
```

## Launch

If you are in Bash-like environment with Python installed, you can run directly by typing:

```sh
$ ./Main.py
```

Otherwise, depending on your Python interpreter installation and your OS:

```sh
$ python Main.py
```
or
```sh
$ py Main.py
```
After that, go to http://127.0.0.1:5000 or http://localhost:5000 and try out the web interface.

Default frequency table is `letters_frequency_twitter.csv`

## Usage

If you want to use `tweetrain.py`'s functions, you have to insert your personal Twitter tokens.
Look at the first four uppercase variables and fill in double quotes with the proper value.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mc-cat-tty/language-classification

Awesome Lists containing this project

README