Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/orgoro/white-2-black

The official code to reproduce results from the NACCL2019 paper: White-to-Black: Efficient Distillation of Black-Box Adversarial Attacks
https://github.com/orgoro/white-2-black

adversarial-attacks adversarial-networks nlp toxic-comment-classification toxicity

Last synced: 28 days ago
JSON representation

The official code to reproduce results from the NACCL2019 paper: White-to-Black: Efficient Distillation of Black-Box Adversarial Attacks

Host: GitHub
URL: https://github.com/orgoro/white-2-black
Owner: orgoro
Created: 2018-06-18T18:16:34.000Z (over 6 years ago)
Default Branch: orphan
Last Pushed: 2019-06-04T13:08:36.000Z (over 5 years ago)
Last Synced: 2024-10-03T16:27:08.472Z (about 2 months ago)
Topics: adversarial-attacks, adversarial-networks, nlp, toxic-comment-classification, toxicity
Language: Python
Homepage: https://naacl2019.org/program/accepted/
Size: 17.6 MB
Stars: 12
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        [![Build Status](https://travis-ci.com/orgoro/white-2-black.svg?branch=orphan)](https://travis-ci.com/orgoro/white-2-black)

# white2black

## INTRODUCTION

The official code to reproduce results in the NACCL2019 paper:

*White-to-Black: Efficient Distillation of Black-Box Adversarial Attacks*

The code is divided into sub-packages:

##### 1. [./Agents](./toxic_fool/agents) - _adversarial learned attck generators_ 

##### 2. [./Attacks](./toxic_fool/attacks) - _optimization attacks like hot flip_

##### 3. [./Toxicity Classifier](./toxic_fool/toxicity_classifier) - _a classifier of sentences toxic/non toxic_

##### 4. [./Data](./toxic_fool/data) - _data handling_

##### 5. [./Resources](./toxic_fool/resources) - _resources for other categories_

## ALGORITHM

As seen in the figure below we train a classifier to predict the class of toxic and non-toxic sentences.

We attack this model using a white-box algorithm called hot-flip and distill the knowledge into a second model - `DistFlip`.

`DistFlip` is able to generate attacks in a black-box manner.

These attacks generalize well to the [Google Perspective](https://www.perspectiveapi.com/) algorithm (tested Jan 2019).

![algorithm](/doc/algorithm.png)

## DATA

We used the data from this [kaggle challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) 

by Jigsaw

For data flip using HotFlip+ you can download 

the [data from Google Drive](https://drive.google.com/file/d/15zSclVYjFYtM1YXUxZbFUpmWS1MgHTx3/view?usp=sharing)

and unzip it into: `./toxic_fool/resources/data`

## RESULTS

The number of flips needed to change the label of a sentences using the original white box algorithm and ours (green)

![survival rate](doc/survival_rate.png)

Some example sentences:

![examples](doc/examples.png)