https://github.com/simeonhristov99/ati

Ati is a web-based application for predicting which famous classic Bulgarian novelist wrote a piece of text (short or long).
https://github.com/simeonhristov99/ati

authorship-attribution embeddings jupyter-notebook multiclass-classification nlp optuna pycaret python3 scraping-websites spacy transformer

Last synced: 3 months ago
JSON representation

Ati is a web-based application for predicting which famous classic Bulgarian novelist wrote a piece of text (short or long).

Host: GitHub
URL: https://github.com/simeonhristov99/ati
Owner: SimeonHristov99
License: mit
Created: 2023-01-25T15:53:11.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-02-07T12:09:53.000Z (over 2 years ago)
Last Synced: 2025-01-13T21:44:31.411Z (5 months ago)
Topics: authorship-attribution, embeddings, jupyter-notebook, multiclass-classification, nlp, optuna, pycaret, python3, scraping-websites, spacy, transformer
Language: Jupyter Notebook
Homepage:
Size: 6.8 MB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # ATI: multiclass single-label Authorship aTtrIbution

Ati is a web-based application for predicting which famous classic Bulgarian novelist wrote a piece of text (short or long).

## Formal Aim

Given a corpus of documents `D`, each one written by one author `y`, identify the author of an anonymous text `x`.

![big_picture](./assets/big_picture.png)

## Plan of attack

### Sprint 01

- [X] Create proper GitHub repository.

- [X] Upload structure for documentation.

- [X] Scrape [chitanka.info](https://chitanka.info/).

  - [X] Use dataclasses to create dictionaries.

- [X] Combine in a big dataset.

- [X] Filter only texts used for modelling.

### Sprint 02

- [X] Download texts.

- [X] Combine them in a dataframe.

- [X] Simple preprocessing:

  - [X] Remove suffix from each text (holds metainformation).

  - [X] Remove prefix from each text (holds metainformation).

- [X] Organize notebooks: split the first one into two.

- [X] Preprocessing: Lemmatize, Stem, Bigrams, Trigrams, Fourgrams.

- [X] Word embedding: tf-idf.

- [X] Perform EDA.

- [X] Create samples (parts of random texts) instead of using the whole texts as such.

- [X] Modelling:

  - [X] Train/val/test split.

  - [X] PCA so as to not overfit.

- [X] Evaluating on val split:

  - [X] f1.

  - [X] mcc.

  - [X] Log loss.

- [X] Evaluating on test split. This should be done only 1 time!

  - [X] f1.

  - [X] mcc.

  - [X] Log loss.

- [X] Pickle model.

- [X] Create the user interface using [streamlit](https://streamlit.io/).

### Sprint 03

- [X] Explore `Catboost`.

- [X] Try to create a pipeline object.

- [X] PyCaret with PCA.

- [X] PyCaret without PCA. Results seem to be (almost) the same.

- [X] Implement as much metrics as possible:

  - [X] Install bulgarian-nlp POS tagger. Couldn't make it work. Used [classla](https://pypi.org/project/classla/) Python package.

  - [X] Character-based lexical features;

  - [X] Sentence- and word-based features;

  - [X] Function / Stop words;

  - [X] Flesch Reading Ease Score.

- [X] Modelling.

- [X] Word cloud;

### Sprint 04

- [X] EDA on the text features.

- [X] Pipeline for the text features models.

- [X] Show text features in streamlit.

- [X] Try using transformer embeddings using `sbert`.

### Future improvements

- [ ] Compare `LogisticRegression`, `KNeighborsClassifier`, `GaussianNB`, `MultinomialNB`, `DecisionTree`, `RandomForest`, `XGBoost`, and `combined/aggregated`.

- [ ] Remove multicolinearity.

- [ ] Better modelling.

- [ ] More EDA.

## Motivation / Use cases

1. **Authorship check**: Is the given text really written by a certain author?

2. **Plagiarism detection**: Finding similarities between two texts.

3. **Author profiling or introduction**: Extracting information about the age, education, gender, etc. of the author of a given text.

4. **Detecting stylistic inconsistencies** (as can happen in collaborative writing): Is there only one author?

## Set of authors Y and set of documents D

- **Ivan Vazov**: "Българският език", "Отечество любезно", "При Рилския манастир", "Елате ни вижте", "Линее нашто поколение", "Левски", "Паисий", "Кочо", "Опълченците на Шипка", "Дядо Йоцо гледа", "Чичовци", "Под игото";

- **Aleko Konstantinov**: "Разни хора, разни идеали", "Бай Ганьо";

- **Elin Pelin**: "Ветрената мелница", "Косачи", "Задушница", "Мечтатели", "На оня свят", "Андрешко", "Чорба от греховете на отец Никодим", "Занемелите камбани", "Гераците";

- **Jordan Jovkov**: "Песента на колелетата", "Последна радост", "Шибил", "През чумавото", "Индже", "Албена", "Другоселец", "Серафим";

- **Dimitar Dimov**: "Тютюн";

- **Dimitar Talev**: "Железният светилник".

## Metrics used

The goal is to include as much as possible (the more the better, right?). They were taken from [here](https://ceur-ws.org/Vol-2936/paper-191.pdf).

- **Character-based lexical features**: The number of distinct special character, spaces, punctuation, parentheses and quotation marks as separate features.

- **Sentence- and word-based features**: Distribution of POS-tags, token length, number of sentences, sentence length, average word length, words in all-caps and counts of words above and below 2-3 and 6 characters as separate features. For those statistics a possible package to use is [spacy](https://spacy.io/).

- **Function / Stop words**: The frequency of each function word.

- Various types of **Reading Ease Scores**: indicate the understandability of a passage with a number. It shows how difficult it is to understand the content.

## Approach with modelling

Goal is to try four types of models:

1. By using *tf-idf* create a big matrix with word embeddings. Use it to predict the author.

2. By using an encoding from a transformer - [sbert](https://www.sbert.net/), to get the word embeddings.

3. Using only the metrics.

4. Making a combination of the above approaches: for example, concatenating the encodings from 2 with the metrics 3.

Parallel to the above or after it experiments should be done to determine the best type of classifier: **one** or **an ensamble**.

## Resources

### Where can the texts be found?

- 

### Scrapy YouTube tutorials

- [Best Web Scraping Combo?? Use These In Your Projects - John Watson Rooney](https://www.youtube.com/watch?v=HpRsfpPuUzE)

- [Intro To Web Crawlers & Scraping With Scrapy - Traversy Media](https://www.youtube.com/watch?v=ALizgnSFTwQ)

### POS Tagger for Bulgarian

- [classla](https://pypi.org/project/classla/)

- [bulgarian-nlp](https://github.com/AMontgomerie/bulgarian-nlp) <- Did not work for me.

### Papers on the topic

- [Multi-label Style Change Detection by Solving a Binary Classification Problem](https://ceur-ws.org/Vol-2936/paper-191.pdf)

- [Authorship attribution](https://link.springer.com/article/10.1007/BF01830689)

- [Automatic Authorship Attribution](http://portal.acm.org/citation.cfm?doid=977035.977057)

- [Quantitative Authorship Attribution: An Evaluation of Techniques](https://lirias.kuleuven.be/bitstream/123456789/331335/1/Grieve%20-%20authorship%20attribution.pdf)

- [A Survey of Modern Authorship Attribution Methods](https://onlinelibrary.wiley.com/doi/10.1002/asi.21001)

- [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)

- [Distributed Representations of Sentences and Documents](https://cs.stanford.edu/~quocle/paragraph_vector.pdf)

- [CAG: Stylometric Authorship Attribution of Multi-Author Documents Using a Co-Authorship Graph](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8962080)

- [Style Change Detection on Real-World Data using an LSTM-powered Attribution Algorithm](https://ceur-ws.org/Vol-2936/paper-163.pdf)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/simeonhristov99/ati

Awesome Lists containing this project

README