Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/orgoro/white-2-black
The official code to reproduce results from the NACCL2019 paper: White-to-Black: Efficient Distillation of Black-Box Adversarial Attacks
https://github.com/orgoro/white-2-black
adversarial-attacks adversarial-networks nlp toxic-comment-classification toxicity
Last synced: 28 days ago
JSON representation
The official code to reproduce results from the NACCL2019 paper: White-to-Black: Efficient Distillation of Black-Box Adversarial Attacks
- Host: GitHub
- URL: https://github.com/orgoro/white-2-black
- Owner: orgoro
- Created: 2018-06-18T18:16:34.000Z (over 6 years ago)
- Default Branch: orphan
- Last Pushed: 2019-06-04T13:08:36.000Z (over 5 years ago)
- Last Synced: 2024-10-03T16:27:08.472Z (about 2 months ago)
- Topics: adversarial-attacks, adversarial-networks, nlp, toxic-comment-classification, toxicity
- Language: Python
- Homepage: https://naacl2019.org/program/accepted/
- Size: 17.6 MB
- Stars: 12
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
[![Build Status](https://travis-ci.com/orgoro/white-2-black.svg?branch=orphan)](https://travis-ci.com/orgoro/white-2-black)
# white2black
## INTRODUCTION
The official code to reproduce results in the NACCL2019 paper:
*White-to-Black: Efficient Distillation of Black-Box Adversarial Attacks*The code is divided into sub-packages:
##### 1. [./Agents](./toxic_fool/agents) - _adversarial learned attck generators_
##### 2. [./Attacks](./toxic_fool/attacks) - _optimization attacks like hot flip_
##### 3. [./Toxicity Classifier](./toxic_fool/toxicity_classifier) - _a classifier of sentences toxic/non toxic_
##### 4. [./Data](./toxic_fool/data) - _data handling_
##### 5. [./Resources](./toxic_fool/resources) - _resources for other categories_## ALGORITHM
As seen in the figure below we train a classifier to predict the class of toxic and non-toxic sentences.
We attack this model using a white-box algorithm called hot-flip and distill the knowledge into a second model - `DistFlip`.
`DistFlip` is able to generate attacks in a black-box manner.
These attacks generalize well to the [Google Perspective](https://www.perspectiveapi.com/) algorithm (tested Jan 2019).
![algorithm](/doc/algorithm.png)## DATA
We used the data from this [kaggle challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)
by JigsawFor data flip using HotFlip+ you can download
the [data from Google Drive](https://drive.google.com/file/d/15zSclVYjFYtM1YXUxZbFUpmWS1MgHTx3/view?usp=sharing)
and unzip it into: `./toxic_fool/resources/data`## RESULTS
The number of flips needed to change the label of a sentences using the original white box algorithm and ours (green)
![survival rate](doc/survival_rate.png)Some example sentences:
![examples](doc/examples.png)