https://github.com/microsoft/safenlp

Safety Score for Pre-Trained Language Models
https://github.com/microsoft/safenlp
ai-safety fairness-ai nlp
Last synced: 8 months ago
JSON representation
Safety Score for Pre-Trained Language Models
Host: GitHub
URL: https://github.com/microsoft/safenlp
Owner: microsoft
License: other
Created: 2022-07-02T00:17:54.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2023-10-18T20:57:47.000Z (almost 2 years ago)
Last Synced: 2025-01-31T16:24:34.953Z (9 months ago)
Topics: ai-safety, fairness-ai, nlp
Language: Python
Homepage:
Size: 1.2 MB
Stars: 93
Watchers: 6
Forks: 7
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
Awesome Lists containing this project

README

          # Safety Score for Pre-Trained Language Models

Paper: [An Empirical Study of Metrics to Measure Representational Harms in Pre-Trained Language Models](https://trustnlpworkshop.github.io/papers/18.pdf) (ACL 2023, TrustNLP workshop)

This repository contains the code used to measure safety scores for pre-trained language models based on [ToxiGen human annotated dataset](https://github.com/microsoft/TOXIGEN) and [ImplicitHate dataset](https://github.com/GT-SALT/implicit-hate). 

## Evaluation Dataset

- We selected a subset of TxiGen and ImplicitHate datasets. The examples in ImplicitHate subset are either implicit-hate or neutral and we down-sampled the neutral examples to have equal number of harmful and benign exxamples. ImplicitHate does not have any information about the target of the hate for each sentence.

- The examples in ToxiGen dataset include the sentences in which all the annotators agreed on whether the sentence is harmful and more than 2 annotators agreed on the target group of the hate. 

## Setup

There are few specific dependencies to install before runnung the safety score calculator, you can install them with the command `pip install -r requirements.txt`.

## How to calculate safety score

Now you can run the following script:

```bash

python safety_score.py \

   --data data/toxiGen.json \ # Path to evaluation dataset

   --output results \ # local path to a directory for saving results

   --model gpt2 \ # pre-trained model name or local path

   --lmHead clm \ # Type of language model head, i.e. causal or masked

   --force # overwrites the output path if it already exists.

```

Two files will be saved in the output path: 

- 'perplexities.json' which contains the perplexity value for each sentence in the evaluation dataset

- 'safety_scores.json' which contains the statistically significant safety scores for each demographic.

For example, the contetn of 'safety_scores.json' after running the above script is

`

{"asian": 0.3694922836054574, "black": 0.36662849289967936, "chinese": 0.3731038121619839, "jewish": 0.40661968642101093, "latino": 0.22831884057971014, "lgbtq": 0.2701839434577746, "mental dis": 0.22755361686659398, "mexican": 0.23524720893141945, "middle-eastern": 0.2604830744365628, "muslim": 0.32320982365959877, "native-american": 0.24511818257746595, "physical dis": 0.22460258469801234, "women": 0.23225019516003123}

`

## Safety scores based on ToxiGen

Here are the results based on the ToxiGen dataset:

| model name 
| ----------------------- 
| BERT-large-uncased 
| BERT-base-uncased 
| DistiBERT-uncased 
| MobileBERT 
| BERT-large-cased 
| BERT-base-cased 
| DistiBERT-cased 
| RoBERTA-Large 
| RoBERTA-Base 
| DistilRoBERTa 
| Electra-large-Generator 
| Electra-base-Generator 
| Electra-small-Generator 
| Albert-xxlarge-v2 
| Albert-xlarge-v2 
| Albert-large-v2 
| Albert-base-v2 
| GPT2-xl 
| GPT2-large 
| GPT2-medium 
| GPT2-small 
| DistilGPT2 
| XLNet-large 
| XLNet-base 
| PTLMs Average

| Asian     | Black    | Chinese  | Jewish   | Latino   | LGBTQ    | Mentally disabled | Mexican  | Middle-Eastern | Muslim   | Native-American | Physically disabled | Women    | Average  | | --------- | -------- | -------- | -------- | -------- | -------- | ----------------- | -------- | -------------- | -------- | --------------- | ------------------- | -------- | -------- | | 0.3904102 | 0.318049 | 0.385327 | 0.391747 | 0.248196 | 0.315275 | 0.260423          | 0.269784 | 0.30053        | 0.307303 | 0.254255        | 0.253674            | 0.243696 | 0.302975 | | 0.3955331 | 0.332077 | 0.387988 | 0.394026 | 0.253957 | 0.314765 | 0.248967          | 0.273278 | 0.291169       | 0.302534 | 0.247724        | 0.244923            | 0.242808 | 0.302288 | | 0.4066471 | 0.324267 | 0.40219  | 0.406393 | 0.272203 | 0.272415 | 0.200269          | 0.2826   | 0.294716       | 0.289555 | 0.264996        | 0.218225            | 0.247609 | 0.298622 | | 0.3717289 | 0.319698 | 0.384602 | 0.405374 | 0.246391 | 0.286268 | 0.199057          | 0.266215 | 0.280596       | 0.300907 | 0.241644        | 0.218105            | 0.248078 | 0.289897 | | 0.3861499 | 0.294892 | 0.362991 | 0.340423 | 0.226696 | 0.296858 | 0.224227          | 0.245158 | 0.207529       | 0.251746 | 0.173039        | 0.217625            | 0.20645  | 0.264137 | | 0.3919012 | 0.316148 | 0.367058 | 0.355918 | 0.240072 | 0.311503 | 0.227047          | 0.256797 | 0.208023       | 0.272093 | 0.176547        | 0.224854            | 0.214208 | 0.274013 | | 0.4032974 | 0.310421 | 0.395748 | 0.347781 | 0.272    | 0.27143  | 0.19779           | 0.298758 | 0.257318       | 0.211965 | 0.238203        | 0.207459            | 0.246604 | 0.281444 | | 0.4380718 | 0.385891 | 0.436398 | 0.42469  | 0.254029 | 0.294581 | 0.263915          | 0.265645 | 0.310878       | 0.281888 | 0.254456        | 0.26209             | 0.261524 | 0.318004 | | 0.4892215 | 0.447183 | 0.493185 | 0.49209  | 0.320232 | 0.343025 | 0.303185          | 0.352225 | 0.359769       | 0.353366 | 0.30507         | 0.311123            | 0.304411 | 0.37493  | | 0.4971137 | 0.488124 | 0.489491 | 0.44293  | 0.363928 | 0.390325 | 0.364319          | 0.367339 | 0.419592       | 0.412908 | 0.35575         | 0.372084            | 0.356928 | 0.409295 | | 0.3665474 | 0.293507 | 0.378886 | 0.366403 | 0.249174 | 0.295975 | 0.230296          | 0.277303 | 0.257767       | 0.283315 | 0.228314        | 0.23375             | 0.224053 | 0.283484 | | 0.3703071 | 0.309711 | 0.376314 | 0.382847 | 0.254341 | 0.297005 | 0.219017          | 0.284024 | 0.270293       | 0.291083 | 0.233509        | 0.226641            | 0.228025 | 0.287932 | | 0.390719  | 0.332936 | 0.417799 | 0.382365 | 0.271123 | 0.337894 | 0.244484          | 0.306524 | 0.285288       | 0.309288 | 0.253554        | 0.247908            | 0.253913 | 0.310292 | | 0.4464272 | 0.409517 | 0.448182 | 0.484349 | 0.291833 | 0.338325 | 0.2682            | 0.314214 | 0.342889       | 0.321211 | 0.322392        | 0.302347            | 0.278864 | 0.351442 | | 0.4285448 | 0.404695 | 0.42712  | 0.471826 | 0.291812 | 0.374162 | 0.262406          | 0.313207 | 0.338421       | 0.329093 | 0.369698        | 0.275218            | 0.293628 | 0.352295 | | 0.4749017 | 0.445774 | 0.465946 | 0.489712 | 0.325978 | 0.414326 | 0.33644           | 0.352111 | 0.384686       | 0.363161 | 0.387505        | 0.334824            | 0.324034 | 0.392262 | | 0.472942  | 0.436361 | 0.476828 | 0.494453 | 0.342572 | 0.390925 | 0.305244          | 0.379035 | 0.370724       | 0.361862 | 0.35094         | 0.325473            | 0.316579 | 0.386457 | | 0.3636664 | 0.366239 | 0.353361 | 0.401766 | 0.207203 | 0.271849 | 0.245597          | 0.213944 | 0.238641       | 0.31103  | 0.237301        | 0.231472            | 0.221868 | 0.281841 | | 0.3649977 | 0.363983 | 0.366992 | 0.402827 | 0.211116 | 0.279551 | 0.243361          | 0.220969 | 0.239988       | 0.311744 | 0.239372        | 0.233702            | 0.22743  | 0.285079 | | 0.3636451 | 0.352714 | 0.362881 | 0.397167 | 0.21392  | 0.275893 | 0.236828          | 0.221197 | 0.232064       | 0.304091 | 0.233108        | 0.219603            | 0.226473 | 0.279968 | | 0.3694923 | 0.366628 | 0.373104 | 0.40662  | 0.228319 | 0.270184 | 0.227554          | 0.235247 | 0.260461       | 0.32321  | 0.245118        | 0.224603            | 0.23225  | 0.289445 | | 0.3853458 | 0.381619 | 0.383766 | 0.418747 | 0.243261 | 0.281941 | 0.23956           | 0.258183 | 0.287869       | 0.343128 | 0.259851        | 0.241207            | 0.227342 | 0.303986 | | 0.3846801 | 0.328298 | 0.378952 | 0.377031 | 0.267681 | 0.287548 | 0.226386          | 0.277208 | 0.238529       | 0.301164 | 0.235279        | 0.208874            | 0.23144  | 0.287928 | | 0.3841209 | 0.333978 | 0.381392 | 0.391181 | 0.281413 | 0.297107 | 0.216329          | 0.292739 | 0.244613       | 0.296866 | 0.231103        | 0.212123            | 0.234504 | 0.292113 | | 0.4056839 | 0.360946 | 0.404021 | 0.411194 | 0.265727 | 0.31288  | 0.249621          | 0.284321 | 0.288431       | 0.309771 | 0.264114        | 0.251996            | 0.253863 | 0.312505 |

## Safety scores based on ImplicitHate

Here are the results based on the ImplicitHate dataset:

| model name              | Safety Score |

| ----------------------- | ------------ |

| BERT-large-uncased      | 0.332300992  |

| BERT-base-uncased       | 0.335931145  |

| DistilBERT-base-uncased | 0.336185856  |

| mobileBERT              | 0.335289526  |

| BERT-large-cased        | 0.300331164  |

| BERT-base-cased         | 0.308677306  |

| DistilBERT-base-cased   | 0.329417992  |

| RoBERTa-large           | 0.353298215  |

| RoBERTa-base            | 0.376362527  |

| DistilRoBERTa           | 0.390526523  |

| ELECTRA-large-generator | 0.332349693  |

| ELECTRA-base-generator  | 0.332561139  |

| ELECTRA-small-generator | 0.334555207  |

| ALBERT-xxlarge-v2       | 0.35294267   |

| ALBERT-xlarge-v2        | 0.358772426  |

| ALBERT-large-v2         | 0.352241738  |

| ALBERT-base-v2          | 0.339738782  |

| GPT-2-xl                | 0.2539317    |

| GPT-2-large             | 0.255463608  |

| GPT-2-medium            | 0.255785509  |

| GPT-2                   | 0.259990915  |

| DistilGPT-2             | 0.26304632   |

| XLNet-large-cased       | 0.269394327  |

| XLNet-base-cased        | 0.271851141  |

## Citation

Please use the following to cite this work:

```

@misc{hosseini2023empirical,

      title={An Empirical Study of Metrics to Measure Representational Harms in Pre-Trained Language Models}, 

      author={Saghar Hosseini and Hamid Palangi and Ahmed Hassan Awadallah},

      year={2023},

      eprint={2301.09211},

      archivePrefix={arXiv},

      primaryClass={cs.CL}

}

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/microsoft/safenlp

Awesome Lists containing this project

README