https://github.com/tikquuss/eulascript
Machine learning (ML) solution that review end-user license agreements (EULA) for terms and conditions that are unacceptable to the government
https://github.com/tikquuss/eulascript
albert bert distilbert eula huggingface ktrain pandas roberta xlnet
Last synced: 3 months ago
JSON representation
Machine learning (ML) solution that review end-user license agreements (EULA) for terms and conditions that are unacceptable to the government
- Host: GitHub
- URL: https://github.com/tikquuss/eulascript
- Owner: Tikquuss
- Created: 2020-08-19T10:36:07.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2021-11-14T01:59:24.000Z (over 3 years ago)
- Last Synced: 2025-01-18T13:41:15.796Z (4 months ago)
- Topics: albert, bert, distilbert, eula, huggingface, ktrain, pandas, roberta, xlnet
- Language: Python
- Homepage: https://eulapp.herokuapp.com/
- Size: 10.4 MB
- Stars: 2
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 1 - Cloning the repository
```
git clone https://github.com/Tikquuss/eulascript
```# 2 - Installing the dependencies
* [PyPDF2](https://pypi.org/project/PyPDF2/) and [PyMuPDF](https://pypi.org/project/PyMuPDF/): for reading pdf files
* [python-docx](https://pypi.org/project/python-docx/) : for reading docx files
* [wget](https://pypi.org/project/wget/) : for model downloading
* [pandas](https://pandas.pydata.org/) : to write the result in csv files
* [validators](https://pypi.org/project/validators/) : to check the validity of the urls
* [ktrain](https://github.com/Tikquuss/ktrain) : for loading models. It is a duplication of [amaiya/ktrain](https://github.com/amaiya/ktrain) modified to install `tensorflow-cpu` (instead of `tensorflow-2.1.0-cp36-cp36m-manylinux2010_x86_64.whl`) and `tqdm>=4.29.1`.
```
pip install -r eulascript/requirements.txt
```# 3 - Try
* **model_folder** : directory (or url of the directory) where the model is located (must contain the following three files: `tf_model.preproc`, `config.json` and `tf_model.h5`). In the case of a url the three previous files are downloaded automatically. You can use the pre-trained models directly from [huggingface](https://huggingface.co/transformers/), but [this notebook](samples/public_transformers_in_ktrain.ipynb) illustrates how to fine-tune these models (**bert, distilbert, albert, roberta, xlnet**) on our [dataset](https://drive.google.com/file/d/1eyGBYLpOPsvif0iomTBxjHtXoiY8gnLE/view?usp=sharing) with the [ktrain](https://pypi.org/project/ktrain/) library.
* **output_dir** : folder in which the csv file(s) containing the results (in the format: `clause, label, probability`) will be stored (the name of the created file starts with the name, without extension, of the original file containing the license, followed optionally by a number to avoid file collisions)
* **path_to_eula** : comma-separated list of documents (`txt, md, pdf and docx`) containing the licenses to be analyzed
* **logistic_regression** : this parameter can be provided at the expense of **model_folder** in order to use one of the [pre-trained logistic regression models](production.pth) (must be obligatorily made from these three models: **bag_of_word, tf_idf, bert or distilbert**). This parameter is ignored if it is passed at the same time as **model_folder**. This [notebook](samples/logistic_regression.ipynb) illustrates the process of obtaining the [production.pth](production.pth) file.```
model_folder=my/model_dir_or_url
output_dir=my/output_folder
path_to_eula=my/eula.txt,my/eula.md,my/eula.pdf,my/eula.docxpython eulascript/eula.py --model_folder $model_folder --path_to_eula $path_to_eula --output_dir $output_dir
``````
logistic_regression=bag_of_word
output_dir=my/output_folder
path_to_eula=my/eula.txt,my/eula.md,my/eula.pdf,my/eula.docxpython eulascript/eula.py --logistic_regression $logistic_regression --path_to_eula $path_to_eula --output_dir $output_dir
```**Note**:
* the [samples](samples) folder contains some user licenses and a [notebook](samples/notebook.ipynb) illustrating all.
* The associated web application is available [here](https://eulapp.herokuapp.com/).