https://github.com/mullerpeter/authorstyle
Python package to deal with PAN corpora and extract stylometric features from text documents.
https://github.com/mullerpeter/authorstyle
author-attribution intrinsic-plagiarism-detection nlp pan python stylometric-features stylometry
Last synced: about 1 year ago
JSON representation
Python package to deal with PAN corpora and extract stylometric features from text documents.
- Host: GitHub
- URL: https://github.com/mullerpeter/authorstyle
- Owner: mullerpeter
- License: mit
- Created: 2020-03-25T19:32:47.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2022-11-11T18:50:33.000Z (over 3 years ago)
- Last Synced: 2025-03-26T07:42:58.354Z (about 1 year ago)
- Topics: author-attribution, intrinsic-plagiarism-detection, nlp, pan, python, stylometric-features, stylometry
- Language: Python
- Size: 1.21 MB
- Stars: 16
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# authorstyle
[](https://badge.fury.io/py/authorstyle)
Python package to deal with PAN corpora and extract stylometric features from text documents.
### Installation
Clone the repo, install the _authorstyle_ framework and required libraries
```
git clone git@github.com:mullerpeter/authorstyle.git
cd authorstyle
pip install .
```
Or you can directly install the package with
```
pip install authorstyle
```
### Example
```python
from authorstyle import Corpus, average_word_length
from sklearn import metrics
# Load Validation Set and remove class 1
validation_data = Corpus(path='data/pan19-style-change-detection/validation')
validation_data.problems = [problem for problem in validation_data.problems if problem.truth['authors'] > 1]
print('Validation set loaded')
# Perform feature extraction for each sample in the validation set
true = []
pred = []
for problem in validation_data.problems:
feature = average_word_length(problem.text)
# Demo prediction method (not really smart)
num_predicted = int(feature) % 5
true.append(problem.truth['authors'])
pred.append(num_predicted)
# Print Validation Score
confusion_matrix = metrics.confusion_matrix(true, pred)
val_accuracy = metrics.accuracy_score(true, pred)
print('Validation Accuracy:', val_accuracy)
```