https://github.com/mullerpeter/authorstyle

Python package to deal with PAN corpora and extract stylometric features from text documents.
https://github.com/mullerpeter/authorstyle

author-attribution intrinsic-plagiarism-detection nlp pan python stylometric-features stylometry

Last synced: about 1 year ago
JSON representation

Python package to deal with PAN corpora and extract stylometric features from text documents.

Host: GitHub
URL: https://github.com/mullerpeter/authorstyle
Owner: mullerpeter
License: mit
Created: 2020-03-25T19:32:47.000Z (about 6 years ago)
Default Branch: master
Last Pushed: 2022-11-11T18:50:33.000Z (over 3 years ago)
Last Synced: 2025-03-26T07:42:58.354Z (about 1 year ago)
Topics: author-attribution, intrinsic-plagiarism-detection, nlp, pan, python, stylometric-features, stylometry
Language: Python
Size: 1.21 MB
Stars: 16
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # authorstyle

[![PyPI version](https://badge.fury.io/py/authorstyle.svg)](https://badge.fury.io/py/authorstyle)

Python package to deal with PAN corpora and extract stylometric features from text documents.

### Installation

Clone the repo, install the _authorstyle_ framework and required libraries

```

git clone git@github.com:mullerpeter/authorstyle.git

cd authorstyle

pip install .

```

Or you can directly install the package with

```

pip install authorstyle

```

### Example

```python

from authorstyle import Corpus, average_word_length

from sklearn import metrics

# Load Validation Set and remove class 1

validation_data = Corpus(path='data/pan19-style-change-detection/validation')

validation_data.problems = [problem for problem in validation_data.problems if problem.truth['authors'] > 1]

print('Validation set loaded')

# Perform feature extraction for each sample in the validation set

true = []

pred = []

for problem in validation_data.problems:

    feature = average_word_length(problem.text)

    # Demo prediction method (not really smart)

    num_predicted = int(feature) % 5

    true.append(problem.truth['authors'])

    pred.append(num_predicted)

# Print Validation Score

confusion_matrix = metrics.confusion_matrix(true, pred)

val_accuracy = metrics.accuracy_score(true, pred)

print('Validation Accuracy:', val_accuracy)

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mullerpeter/authorstyle

Awesome Lists containing this project

README