An open API service indexing awesome lists of open source software.

https://github.com/zouharvi/stolen-subwords

Zero-data blackbox machine translation model distillation / stealing
https://github.com/zouharvi/stolen-subwords

machine-translation model-distillation

Last synced: 3 months ago
JSON representation

Zero-data blackbox machine translation model distillation / stealing

Awesome Lists containing this project

README

          

# Machine Translation Vocabulary Stealing

[![Paper](https://img.shields.io/badge/📜%20paper-481.svg)](https://arxiv.org/abs/2401.16055)

Code accompanying the report [Stolen Subwords: Importance of Vocabularies for Machine Translation Model Stealing](https://arxiv.org/abs/2401.16055).

> **Abstract**: In learning-based functionality stealing, the attacker is trying to build a local model based on the victim's outputs.
> The attacker has to make choices regarding the local model's architecture, optimization method and, specifically for NLP models, subword vocabulary, such as BPE.
> On the machine translation task, we explore (1) whether the choice of the vocabulary plays a role in model stealing scenarios and (2) if it is possible to extract the victim's vocabulary.
> We find that the vocabulary itself does not have a large effect on the local model's performance.
> Given gray-box model access, it is possible to collect the victim's vocabulary by collecting the outputs (detokenized subwords on the output).
> The results of the minimum effect of vocabulary choice are important more broadly for black-box knowledge distillation.