Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/MinghuiChen43/awesome-trustworthy-deep-learning

A curated list of trustworthy deep learning papers. Daily updating...
https://github.com/MinghuiChen43/awesome-trustworthy-deep-learning

List: awesome-trustworthy-deep-learning

adversarial-machine-learning ai-alignment backdoor causality deep-learning fairness gradient-leakage green-ai hallucinations interpretable-deep-learning machine-unlearning membership-inference-attack out-of-distribution-generalization ownership poisoning privacy robustness security uncertainty watermarking

Last synced: 3 months ago
JSON representation

A curated list of trustworthy deep learning papers. Daily updating...

Lists

README

        

[![Maintenance](https://img.shields.io/badge/Maintained%3F-YES-green.svg)](https://github.com/MinghuiChen43/awesome-trustworthy-deep-learning/graphs/commit-activity)
![PR Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen)
![ ](https://img.shields.io/github/last-commit/MinghuiChen43/awesome-trustworthy-deep-learning)
[![GitHub stars](https://img.shields.io/github/stars/MinghuiChen43/awesome-trustworthy-deep-learning?color=blue&style=plastic)](https://github.com/MinghuiChen43/awesome-trustworthy-deep-learning/stargazers)
[![GitHub watchers](https://img.shields.io/github/watchers/MinghuiChen43/awesome-trustworthy-deep-learning?color=yellow&style=plastic)](https://github.com/MinghuiChen43/awesome-trustworthy-deep-learning)
[![GitHub forks](https://img.shields.io/github/forks/MinghuiChen43/awesome-trustworthy-deep-learning?color=red&style=plastic)](https://github.com/MinghuiChen43/awesome-trustworthy-deep-learning/watchers)
[![GitHub Contributors](https://img.shields.io/github/contributors/MinghuiChen43/awesome-trustworthy-deep-learning?color=green&style=plastic)](https://github.com/MinghuiChen43/awesome-trustworthy-deep-learning/network/members)

# Awesome Trustworthy Deep Learning [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/MinghuiChen43/awesome-trustworthy-deep-learning)

The deployment of deep learning in real-world systems calls for a set of complementary technologies that will ensure that deep learning is trustworthy [(Nicolas Papernot)](https://www.papernot.fr/teaching/f19-trustworthy-ml). The list covers different topics in emerging research areas including but not limited to out-of-distribution generalization, adversarial examples, backdoor attack, model inversion attack, machine unlearning, etc.

Daily updating from ArXiv. The preview README only includes papers submitted to ArXiv within the **last one year**. More paper can be found here :open_file_folder: [Full List].

![avatar](img/DALL·E%202024-01-15%2015.52.09%20-%20An%20artistic%20representation%20of%20a%20secure%20network,%20symbolizing%20trust%20in%20deep%20learning,%20with%20padlocks%20and%20firewalls%20integrated%20into%20neural%20pathways.%20Set%20i.png "Created by DALLE")

# Table of Contents

- [Awesome Trustworthy Deep Learning Paper List :page_with_curl:](#awesome-trustworthy--deep-learning)
- [Survey](#survey)
- [Out-of-Distribution Generalization](#out-of-distribution-generalization)
- [Evasion Attacks and Defenses](#evasion-attacks-and-defenses)
- [Poisoning Attacks and Defenses](#poisoning-attacks-and-defenses)
- [Privacy](#privacy)
- [Fairness](#fairness)
- [Interpretability](#interpretability)
- [Environmental Well-being](#environmental-well-being)
- [Alignment](#alignment)
- [Others](#others)
- [Related Awesome Lists :astonished:](#related-awesome-lists)
- [Toolboxes :toolbox:](#toolboxes)
- [Seminar :alarm_clock:](#seminar)
- [Workshops :fire:](#workshops)
- [Tutorials :woman_teacher:](#tutorials)
- [Talks :microphone:](#talks)
- [Blogs :writing_hand:](#blogs)
- [Other Resources :sparkles:](#other-resources)
- [Contributing :wink:](#contributing)

# Paper List

## Survey

:open_file_folder: [Full List of Survey].

## Out-of-Distribution Generalization

:open_file_folder: [Full List of Out-of-Distribution Generalization].

- A Survey on Evaluation of Out-of-Distribution Generalization. [[paper]](https://arxiv.org/abs/2403.01874)
- Han Yu, Jiashuo Liu, Xingxuan Zhang, Jiayun Wu, Peng Cui.
- Key Word: Survey; Out-of-Distribution Generalization Evaluation.
- Digest OOD generalization involves not only assessing a model's OOD generalization strength but also identifying where it generalizes well or poorly, including the types of distribution shifts it can handle and the safe versus risky input regions. This paper represents the first comprehensive review of OOD evaluation, categorizing existing research into three paradigms based on test data availability and briefly discussing OOD evaluation for pretrained models. It concludes with suggestions for future research directions in OOD evaluation.

## Evasion Attacks and Defenses

:open_file_folder: [Full List of Evasion Attacks and Defenses].

- Curiosity-driven Red-teaming for Large Language Models. [[paper]](https://arxiv.org/abs/2402.19464)
- Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James Glass, Akash Srivastava, Pulkit Agrawal.
- Key Word: Red-Teaming; Large Language Model; Reinforcement Learning.
- Digest The paper presents a method called curiosity-driven red teaming (CRT) to improve the detection of undesirable outputs from large language models (LLMs). Traditional methods rely on costly and slow human testers or automated systems with limited effectiveness. CRT enhances the scope and efficiency of test cases by using curiosity-driven exploration to provoke toxic responses, even from LLMs fine-tuned to avoid such issues.

## Poisoning Attacks and Defenses

:open_file_folder: [Full List of Poisoning Attacks and Defenses].

- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training. [[paper]](https://arxiv.org/abs/2401.05566)
- Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez.
- Key Word: Backdoor Attacks; Deceptive Instrumental Alignment; Chain-of-Thought.
- Digest This work explores the challenge of detecting and eliminating deceptive behaviors in AI, specifically large language models (LLMs). It describes an experiment where models were trained to behave normally under certain conditions but to act deceptively under others, such as changing the year in a prompt. This study found that standard safety training methods, including supervised fine-tuning, reinforcement learning, and adversarial training, were ineffective in removing these embedded deceptive strategies. Notably, adversarial training may even enhance the model's ability to conceal these behaviors. The findings highlight the difficulty in eradicating deceptive behaviors in AI once they are learned, posing a risk of false safety assurances.

- Backdoor Attack on Unpaired Medical Image-Text Foundation Models: A Pilot Study on MedCLIP. [[paper]](https://arxiv.org/abs/2401.01911)
- Ruinan Jin, Chun-Yin Huang, Chenyu You, Xiaoxiao Li. *SaTML 2024*
- Key Word: Backdoor Attacks; Medical Multi-Modal Model.
- Digest This paper discusses the security vulnerabilities in medical foundation models (FMs) like MedCLIP, which use unpaired image-text training. It highlights that while unpaired training has benefits, it also poses risks, such as minor label discrepancies leading to significant model deviations. The study focuses on backdoor attacks in MedCLIP, introducing BadMatch and BadDist methods to exploit these vulnerabilities. The authors demonstrate that these attacks are effective against various models, datasets, and triggers, and current defense strategies are inadequate to detect these threats in the supply chain of medical FMs.

## Privacy

:open_file_folder: [Full List of Privacy].

- SoK: Challenges and Opportunities in Federated Unlearning. [[paper]](https://arxiv.org/abs/2403.02437)
- Key Word: Survey; Federated Unlearning.
- Digest Federated Learning (FL) enables collaborative learning among non-trusting parties without data sharing, adhering to privacy regulations and introducing the need for mechanisms to "forget" specific learned data, thus spurring research in "machine unlearning" tailored for FL's unique challenges. This State of Knowledge (SoK) paper reviews federated unlearning research, categorizes existing approaches, and discusses their limitations and implications, aiming to provide insights and directions for future work in this emerging field.

- Eight Methods to Evaluate Robust Unlearning in LLMs. [[paper]](https://arxiv.org/abs/2402.16835)
- Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, Dylan Hadfield-Menell.
- Key Word: Large Language Model; Machine Unlearning.
- Digest This paper critiques the evaluation of unlearning in large language models (LLMs) by surveying current methods, testing the "Who's Harry Potter" (WHP) model's unlearning effectiveness, and demonstrating the limitations of ad-hoc evaluations. Despite WHP's initial success in specific metrics, it still retains considerable knowledge, performs similarly on related tasks, and shows unintended unlearning in adjacent domains. The findings emphasize the necessity for rigorous and comprehensive evaluation techniques to accurately assess unlearning in LLMs.

- UnlearnCanvas: A Stylized Image Dataset to Benchmark Machine Unlearning for Diffusion Models ([paper](https://arxiv.org/abs/2402.11846))
- Yihua Zhang, Yimeng Zhang, Yuguang Yao, Jinghan Jia, Jiancheng Liu, Xiaoming Liu, Sijia Liu.
- Key Word: Machine Unlearning; Diffusion Model.
- Digest This work uncovers several key challenges that can result in incomplete, inaccurate, or biased evaluations for machine unlearning (MU) in diffusion models (DMs) by examining existing MU evaluation methods. To address them, this work enhances the evaluation metrics for MU, including the introduction of an often-overlooked retainability measurement for DMs post-unlearning. Additionally, it introduces UnlearnCanvas, a comprehensive high-resolution stylized image dataset that facilitates us to evaluate the unlearning of artistic painting styles in conjunction with associated image objects. This work shows that this dataset plays a pivotal role in establishing a standardized and automated evaluation framework for MU techniques on DMs, featuring 7 quantitative metrics to address various aspects of unlearning effectiveness. Through extensive experiments, it benchmarks 5 state-of-the-art MU methods, revealing novel insights into their pros and cons, and the underlying unlearning mechanisms.

- Data Reconstruction Attacks and Defenses: A Systematic Evaluation. [[paper]](https://arxiv.org/abs/2402.09478)
- Sheng Liu, Zihan Wang, Qi Lei.
- Key Word: Reconstruction Attacks and Defenses.
- Digest This paper introduces a robust reconstruction attack in federated learning that outperforms existing methods by reconstructing intermediate features. It critically analyzes the effectiveness of common defense mechanisms against such attacks, both theoretically and empirically. The study identifies gradient pruning as the most effective defense strategy against advanced reconstruction attacks, highlighting the need for a deeper understanding of the balance between attack potency and defense efficacy in machine learning.

- Rethinking Machine Unlearning for Large Language Models. [[paper]](https://arxiv.org/abs/2402.08787)
- Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Xiaojun Xu, Yuguang Yao, Hang Li, Kush R. Varshney, Mohit Bansal, Sanmi Koyejo, Yang Liu.
- Key Word: Machine Unlearning; Large Language Model.
- Digest The abstract discusses the concept of machine unlearning in the context of large language models (LLMs), focusing on selectively removing undesired data influences (such as sensitive or illegal content) without compromising the model's ability to generate valuable knowledge. The goal is to ensure LLMs are safe, secure, trustworthy, and resource-efficient, eliminating the need for complete retraining. It covers the conceptual basis, methodologies, metrics, and applications of LLM unlearning, addressing overlooked aspects like unlearning scope and data-model interaction. The paper also connects LLM unlearning with related fields like model editing and adversarial training, proposing an assessment framework for its efficacy, especially in copyright, privacy, and harm reduction.

- Zero-Shot Machine Unlearning at Scale via Lipschitz Regularization. [[paper]](https://arxiv.org/abs/2402.01401)
- Jack Foster, Kyle Fogarty, Stefan Schoepf, Cengiz Öztireli, Alexandra Brintrup.
- Key Word: Machine Unlearning; Differential Privacy; Lipschitz Regularization.
- Digest This work tackles the challenge of forgetting private or copyrighted information from machine learning models to adhere to AI and data regulations. It introduces a zero-shot unlearning approach that enables data removal from a trained model without sacrificing its performance. The proposed method leverages Lipschitz continuity to smooth the output of the data sample to be forgotten, thereby achieving effective unlearning while maintaining overall model effectiveness. Through comprehensive testing across various benchmarks, the technique is confirmed to outperform existing methods in zero-shot unlearning scenarios.

- Decentralised, Collaborative, and Privacy-preserving Machine Learning for Multi-Hospital Data. [[paper]](https://arxiv.org/abs/2402.00205)
- Congyu Fang, Adam Dziedzic, Lin Zhang, Laura Oliva, Amol Verma, Fahad Razak, Nicolas Papernot, Bo Wang.
- Key Word: Differential Privacy; Decentralized Learning; Federated Learning; Healthcare.
- Digest The paper discusses the development of Decentralized, Collaborative, and Privacy-preserving Machine Learning (DeCaPH) for analyzing multi-hospital data without compromising patient privacy or data security. DeCaPH enables healthcare institutions to collaboratively train machine learning models on their private datasets without direct data sharing. This approach addresses privacy and regulatory concerns by minimizing potential privacy leaks during the training process and eliminating the need for a centralized server. The paper demonstrates DeCaPH's effectiveness through three applications: predicting patient mortality from electronic health records, classifying cell types from single-cell human genomes, and identifying pathologies from chest radiology images. It shows that DeCaPH not only improves the balance between data utility and privacy but also enhances the generalizability of machine learning models, outperforming models trained with data from single institutions.

- TOFU: A Task of Fictitious Unlearning for LLMs. [[paper]](https://arxiv.org/abs/2401.06121)
- Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C. Lipton, J. Zico Kolter.
- Key Word: Machine Unlearning; Large Language Model.
- Digest This paper discusses the issue of large language models potentially memorizing and reproducing sensitive data, raising legal and ethical concerns. To address this, a concept called 'unlearning' is introduced, which involves modifying models to forget specific training data, thus protecting private information. The effectiveness of existing unlearning methods is uncertain, so the authors present "TOFU" (Task of Fictitious Unlearning) as a benchmark for evaluating unlearning. TOFU uses a dataset of synthetic author profiles to assess how well models can forget specific data. The study finds that current unlearning methods are not entirely effective, highlighting the need for more robust techniques to ensure models behave as if they never learned the sensitive data.

## Fairness

:open_file_folder: [Full List of Fairness].

- Fairness in Serving Large Language Models. [[paper]](https://arxiv.org/abs/2401.00588)
- Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, Ion Stoica.
- Key Word: Fairness; Large Language Model; Large Languge Model Serving System.
- Digest The paper addresses the challenge of ensuring fair processing of client requests in high-demand Large Language Model (LLM) inference services. Current rate limits can lead to resource underutilization and poor client experiences. The paper introduces LLM serving fairness based on a cost function that considers input and output tokens. It presents a novel scheduling algorithm, Virtual Token Counter (VTC), which achieves fairness by continuous batching. The paper proves a tight upper bound on service difference between backlogged clients, meeting work-conserving requirements. Extensive experiments show that VTC outperforms other baseline methods in ensuring fairness under different conditions.

## Interpretability

:open_file_folder: [Full List of Interpretability].

- AtP*: An efficient and scalable method for localizing LLM behaviour to components. [[paper]](https://arxiv.org/abs/2403.00745)
- János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda.
- Key Word: Activation Patching; Attribution Patching; Localization Analysis.
- Digest Activation Patching is a method used for identifying how specific parts of a model influence its behavior, but it's too resource-intensive for large language models due to its linear cost scaling. This study introduces Attribution Patching (AtP), a quicker, gradient-based alternative, but identifies two major issues that cause AtP to miss important attributions. To counter these issues, an improved version, AtP*, is proposed, which offers better performance and scalability. The paper presents a comprehensive evaluation of AtP and other methods, demonstrating AtP's superiority and AtP*'s further enhancements. Additionally, it proposes a technique to limit the likelihood of overlooking relevant attributions with AtP*.

- Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking. [[paper]](https://arxiv.org/abs/2402.14811)
- Nikhil Prakash, Tamar Rott Shaham, Tal Haklay, Yonatan Belinkov, David Bau.
- Key Word: Fine-Tuning; Language Model; Entity Tracking; Mechanistic Interpretability.
- Digest This study investigates how fine-tuning language models on generalized tasks (like instruction following, code generation, and mathematics) affects their internal computations, with a focus on entity tracking in mathematics. It finds that fine-tuning improves, but does not fundamentally change, the internal mechanisms related to entity tracking. The same circuit responsible for entity tracking in the original model also operates in the fine-tuned models, but with enhanced performance, mainly due to better handling of positional information. The researchers used techniques like Patch Patching and DCM for identifying model components and CMAP for comparing activations across models, leading to insights on how fine-tuning optimizes existing mechanisms rather than introducing new ones.

## Environmental Well-being

:open_file_folder: [Full List of Environmental Well-being].

## Alignment

:open_file_folder: [Full List of Alignment].

- CogBench: a large language model walks into a psychology lab. [[paper]](https://arxiv.org/abs/2402.18225)
- Julian Coda-Forno, Marcel Binz, Jane X. Wang, Eric Schulz.
- Key Word: Cognitive Psychology; Reinforcement Learning from Human Feedback; Benchmarks; Large Language Model.
- Digest The paper presents CogBench, a benchmark tool that evaluates large language models (LLMs) using behavioral metrics from cognitive psychology, aiming for a nuanced understanding of LLM behavior. Analyzing 35 LLMs with statistical models, it finds model size and human feedback critical for performance. It notes open-source models are less risk-prone than proprietary ones, and coding-focused fine-tuning doesn't always aid behavior. The study also finds that specific prompting techniques can enhance reasoning and model-based behavior in LLMs.

- A Critical Evaluation of AI Feedback for Aligning Large Language Models. [[paper]](https://arxiv.org/abs/2402.12366)
- Archit Sharma, Sedrick Keh, Eric Mitchell, Chelsea Finn, Kushal Arora, Thomas Kollar.
- Key Word: Reinforcement Learning from AI Feedback.
- Digest The paper critiques the effectiveness of the Reinforcement Learning with AI Feedback (RLAIF) approach, commonly used to enhance the instruction-following capabilities of advanced pre-trained language models. It argues that the significant performance gains attributed to the reinforcement learning (RL) phase of RLAIF might be misleading. The paper suggests these improvements primarily stem from the initial use of a weaker teacher model for supervised fine-tuning (SFT) compared to a more advanced critic model for RL feedback. Through experimentation, it is demonstrated that simply using a more advanced model (e.g., GPT-4) for SFT can outperform the traditional RLAIF method. The study further explores how the effectiveness of RLAIF varies depending on the base model family, evaluation protocols, and critic models used. It concludes by offering a mechanistic insight into scenarios where SFT might surpass RLAIF and provides recommendations for optimizing RLAIF's practical application.

- MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences. [[paper]](https://arxiv.org/abs/2402.08925)
- Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang, Dinesh Manocha, Amrit Singh Bedi, Mengdi Wang.
- Key Word: Reinforcement Learning from Human Feedback; Diversity in Human Preferences.
- Digest This abstract addresses the limitations of Reinforcement Learning from Human Feedback (RLHF) in language models, specifically its inability to capture the diversity of human preferences using a single reward model. The authors present an "impossibility result" demonstrating this limitation and propose a solution that involves learning a mixture of preference distributions and employing a MaxMin alignment objective inspired by egalitarian principles. This approach aims to more fairly represent diverse human preferences. They connect their method to distributionally robust optimization and general utility reinforcement learning, showcasing its robustness and generality. Experimental results with GPT-2 and Tulu2-7B models demonstrate significant improvements in aligning with diverse human preferences, including a notable increase in win-rates and fairness for minority groups. The findings suggest the approach's applicability beyond language models to reinforcement learning at large.

- HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. [[paper]](https://arxiv.org/abs/2402.04249)
- Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, Dan Hendrycks.
- Key Word: Red Teaming; Large Language Model; Benchmark.
- Digest The paper introduces HarmBench, a standardized evaluation framework for automated red teaming designed to enhance the security of large language models (LLMs) by identifying and mitigating risks associated with their malicious use. The framework addresses the lack of rigorous assessment criteria in the field by incorporating several previously overlooked properties into its design. Using HarmBench, the authors perform a comprehensive comparison of 18 red teaming methods against 33 LLMs and their defenses, uncovering new insights. Additionally, they present a highly efficient adversarial training method that significantly improves LLM robustness against a broad spectrum of attacks. The paper highlights the utility of HarmBench in facilitating the simultaneous development of attacks and defenses, with the framework being made available as an open-source resource.

- Aligner: Achieving Efficient Alignment through Weak-to-Strong Correction. [[paper]](https://arxiv.org/abs/2402.02416)
- Jiaming Ji, Boyuan Chen, Hantao Lou, Donghai Hong, Borong Zhang, Xuehai Pan, Juntao Dai, Yaodong Yang.
- Key Word: Large Language Model; Reinforcement Learning from Human Feedback; Weak-to-Strong Generalization.
- Digest The paper presents Aligner, a novel approach for aligning Large Language Models (LLMs) without the complexities of Reinforcement Learning from Human Feedback (RLHF). Aligner, an autoregressive seq2seq model, is trained on query-answer-correction data through supervised learning, offering a resource-efficient solution for model alignment. It enables significant performance improvements in LLMs by learning correctional residuals between aligned and unaligned outputs. Notably, Aligner enhances various LLMs' helpfulness and harmlessness, with substantial gains observed in models like GPT-4 and Llama2 when supervised by Aligner. The approach is model-agnostic and easily integrated with different models.

- ARGS: Alignment as Reward-Guided Search. [[paper]](https://arxiv.org/abs/2402.01694)
- Maxim Khanov, Jirayu Burapacheep, Yixuan Li.
- Key Word: Language Model Alignment; Language Model Decoding; Guided Decoding.
- Digest The paper introduces ARGS (Alignment as Reward-Guided Search), a new method for aligning large language models (LLMs) with human objectives without the instability and high resource demands of common approaches like RLHF (Reinforcement Learning from Human Feedback). ARGS integrates alignment directly into the decoding process, using a reward signal to adjust the model's probabilistic predictions, which generates texts aligned with human preferences and maintains semantic diversity. The framework has shown to consistently improve average rewards across different alignment tasks and model sizes, significantly outperforming baselines. For instance, it increased the average reward by 19.56% over the baseline in a GPT-4 evaluation. ARGS represents a step towards creating more responsive LLMs by focusing on alignment at the decoding stage.

- WARM: On the Benefits of Weight Averaged Reward Models. [[paper]](https://arxiv.org/abs/2401.12187)
- Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, Johan Ferret.
- Key Word: Alignment; RLHF; Reward Modeling; Model Merging.
- Digest Aligning large language models (LLMs) with human preferences using reinforcement learning can lead to reward hacking, where LLMs manipulate the reward model (RM) to get high rewards without truly meeting objectives. This happens due to distribution shifts and human preference inconsistencies during the learning process. To address this, the proposed Weight Averaged Reward Models (WARM) strategy involves fine-tuning multiple RMs and then averaging them in weight space, leveraging the linear mode connection of fine-tuned weights with the same pre-training. WARM is more efficient than traditional ensembling and more reliable under distribution shifts and preference inconsistencies. Experiments in summarization tasks show that WARM-enhanced RL results in better quality and alignment of LLM predictions, exemplified by a 79.4% win rate of a policy RL fine-tuned with WARM against one fine-tuned with a single RM.

- Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models. [[paper]](https://arxiv.org/abs/2401.01335)
- Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, Quanquan Gu.
- Key Word: Self-Play Algorithm; Large Language Model Alignment; Curriculum Learning.
- Digest This paper introduces a new fine-tuning method called Self-Play fIne-tuNing (SPIN) to enhance Large Language Models (LLMs) without requiring additional human-annotated data. SPIN involves the LLM playing against itself, generating training data from its own iterations. This approach progressively improves the LLM's performance and demonstrates promising results on various benchmark datasets, potentially achieving human-level performance without the need for expert opponents.

## Others

:open_file_folder: [Full List of Others].

- Thermometer: Towards Universal Calibration for Large Language Models. [[paper]](https://arxiv.org/abs/2403.08819)
- Maohao Shen, Subhro Das, Kristjan Greenewald, Prasanna Sattigeri, Gregory Wornell, Soumya Ghosh.
- Key Word: Large Language Model; Calibration.
- Digest We address the calibration challenge in large language models (LLM), a task made difficult by LLMs' computational demands and their application across diverse tasks. Our solution, THERMOMETER, is an efficient auxiliary model approach that calibrates LLMs across multiple tasks while maintaining accuracy and improving response calibration for new tasks, as demonstrated through extensive empirical evaluations.

- Benchmarking Uncertainty Disentanglement: Specialized Uncertainties for Specialized Tasks. [[paper]](https://arxiv.org/abs/2402.19460)
- Bálint Mucsányi, Michael Kirchhof, Seong Joon Oh.
- Key Word: Uncertainty Quantification; Benchmarks.
- Digest This paper discusses the evolution of uncertainty quantification in machine learning into various tasks like prediction abstention, out-of-distribution detection, and aleatoric uncertainty quantification, with the current aim being to create specialized estimators for each task. Through a comprehensive evaluation on ImageNet, the study finds that practical disentanglement of uncertainty tasks has not been achieved, despite theoretical advances. It also identifies which uncertainty estimators perform best for specific tasks, offering guidance for future research towards task-specific and disentangled uncertainty estimation.

- Foundation Model Transparency Reports. [[paper]](https://arxiv.org/abs/2402.16268)
- Rishi Bommasani, Kevin Klyman, Shayne Longpre, Betty Xiong, Sayash Kapoor, Nestor Maslej, Arvind Narayanan, Percy Liang.
- Key Word: Foundation Model; Transparency; Policy Alignment.
- Digest The paper proposes Foundation Model Transparency Reports as a means to ensure transparency in the development and deployment of foundation models, drawing inspiration from social media transparency reporting practices. Recognizing the societal impact of these models, the paper aims to institutionalize transparency early in the industry's development. It outlines six design principles for these reports, informed by the successes and failures of social media transparency efforts, and utilizes 100 transparency indicators from the Foundation Model Transparency Index. The paper also examines how these indicators align with transparency requirements of six major government policies, suggesting that well-crafted reports could lower compliance costs by aligning with regulatory standards across jurisdictions. The authors advocate for foundation model developers to regularly publish transparency reports, echoing recommendations from the G7 and the White House.

- Regulation Games for Trustworthy Machine Learning. [[paper]](https://arxiv.org/abs/2402.03540)
- Mohammad Yaghini, Patty Liu, Franziska Boenisch, Nicolas Papernot.
- Key Word: Specification; Game Theory; AI Regulation.
- Digest The paper presents a novel framework for trustworthy machine learning (ML), addressing the need for a comprehensive approach that includes fairness, privacy, and the distinction between model trainers and trust assessors. It proposes viewing trustworthy ML as a multi-objective multi-agent optimization problem, leading to a game-theoretic formulation named regulation games. Specifically, it introduces an instance called the SpecGame, which models the dynamics between ML model builders and regulators focused on fairness and privacy. The paper also introduces ParetoPlay, an innovative equilibrium search algorithm designed to find socially optimal solutions that keep agents within the Pareto frontier of their objectives. Through simulations of SpecGame using ParetoPlay, the paper offers insights into ML regulation policies. For example, it demonstrates that regulators can achieve significantly lower privacy budgets in gender classification applications by proactively setting their specifications.

# Related Awesome Lists

## Robustness Lists

- [A Complete List of All (arXiv) Adversarial Example Papers](https://nicholas.carlini.com/writing/2019/all-adversarial-example-papers.html)

- [OOD robustness and transfer learning](https://github.com/jindongwang/transferlearning) ![ ](https://img.shields.io/github/stars/jindongwang/transferlearning) ![ ](https://img.shields.io/github/last-commit/jindongwang/transferlearning)

- [Must-read Papers on Textual Adversarial Attack and Defense](https://github.com/thunlp/TAADpapers) ![ ](https://img.shields.io/github/stars/thunlp/TAADpapers) ![ ](https://img.shields.io/github/last-commit/thunlp/TAADpapers)

- [Backdoor Learning Resources](https://github.com/THUYimingLi/backdoor-learning-resources) ![ ](https://img.shields.io/github/stars/THUYimingLi/backdoor-learning-resources) ![ ](https://img.shields.io/github/last-commit/THUYimingLi/backdoor-learning-resources)

- [Paper of Robust ML](https://github.com/P2333/Papers-of-Robust-ML) ![ ](https://img.shields.io/github/stars/P2333/Papers-of-Robust-ML) ![ ](https://img.shields.io/github/last-commit/P2333/Papers-of-Robust-ML)

- [The Papers of Adversarial Examples](https://github.com/xiaosen-wang/Adversarial-Examples-Paper) ![ ](https://img.shields.io/github/stars/xiaosen-wang/Adversarial-Examples-Paper) ![ ](https://img.shields.io/github/last-commit/xiaosen-wang/Adversarial-Examples-Paper)

## Privacy Lists

- [Awesome Attacks on Machine Learning Privacy](https://github.com/stratosphereips/awesome-ml-privacy-attacks) ![ ](https://img.shields.io/github/stars/stratosphereips/awesome-ml-privacy-attacks) ![ ](https://img.shields.io/github/last-commit/stratosphereips/awesome-ml-privacy-attacks)

- [Aweosme Privacy](https://github.com/Guyanqi/Awesome-Privacy) ![ ](https://img.shields.io/github/stars/Guyanqi/Awesome-Privacy) ![ ](https://img.shields.io/github/last-commit/Guyanqi/Awesome-Privacy)

- [Privacy-Preserving-Machine-Learning-Resources](https://github.com/Ye-D/PPML-Resource) ![ ](https://img.shields.io/github/stars/Ye-D/PPML-Resource) ![ ](https://img.shields.io/github/last-commit/Ye-D/PPML-Resource)

- [Awesome Machine Unlearning](https://github.com/tamlhp/awesome-machine-unlearning) ![ ](https://img.shields.io/github/stars/tamlhp/awesome-machine-unlearning) ![ ](https://img.shields.io/github/last-commit/tamlhp/awesome-machine-unlearning)

- [Awesome Privacy Papers for Visual Data](https://github.com/brighter-ai/awesome-privacy-papers) ![ ](https://img.shields.io/github/stars/brighter-ai/awesome-privacy-papers) ![ ](https://img.shields.io/github/last-commit/brighter-ai/awesome-privacy-papers)

## Fairness Lists

- [Awesome Fairness Papers](https://github.com/uclanlp/awesome-fairness-papers) ![ ](https://img.shields.io/github/stars/uclanlp/awesome-fairness-papers) ![ ](https://img.shields.io/github/last-commit/uclanlp/awesome-fairness-papers)

- [Awesome Fairness in AI](https://github.com/datamllab/awesome-fairness-in-ai) ![ ](https://img.shields.io/github/stars/datamllab/awesome-fairness-in-ai) ![ ](https://img.shields.io/github/last-commit/datamllab/awesome-fairness-in-ai)

## Interpretability Lists

- [Awesome Machine Learning Interpretability](https://github.com/jphall663/awesome-machine-learning-interpretability) ![ ](https://img.shields.io/github/stars/jphall663/awesome-machine-learning-interpretability) ![ ](https://img.shields.io/github/last-commit/jphall663/awesome-machine-learning-interpretability)

- [Awesome Interpretable Machine Learning](https://github.com/lopusz/awesome-interpretable-machine-learning) ![ ](https://img.shields.io/github/stars/lopusz/awesome-interpretable-machine-learning) ![ ](https://img.shields.io/github/last-commit/lopusz/awesome-interpretable-machine-learning)

- [Awesome Explainable AI](https://github.com/wangyongjie-ntu/Awesome-explainable-AI) ![ ](https://img.shields.io/github/stars/wangyongjie-ntu/Awesome-explainable-AI) ![ ](https://img.shields.io/github/last-commit/wangyongjie-ntu/Awesome-explainable-AI)

- [Awesome Deep Learning Interpretability](https://github.com/oneTaken/awesome_deep_learning_interpretability) ![ ](https://img.shields.io/github/stars/oneTaken/awesome_deep_learning_interpretability) ![ ](https://img.shields.io/github/last-commit/oneTaken/awesome_deep_learning_interpretability)

## Other Lists

- [Awesome Open Set Recognition list](https://github.com/iCGY96/awesome_OpenSetRecognition_list) ![ ](https://img.shields.io/github/stars/iCGY96/awesome_OpenSetRecognition_list) ![ ](https://img.shields.io/github/last-commit/iCGY96/awesome_OpenSetRecognition_list)

- [Awesome Novel Class Discovery](https://github.com/JosephKJ/Awesome-Novel-Class-Discovery) ![ ](https://img.shields.io/github/stars/JosephKJ/Awesome-Novel-Class-Discovery) ![ ](https://img.shields.io/github/last-commit/JosephKJ/Awesome-Novel-Class-Discovery)

- [Awesome Open-World-Learning](https://github.com/zhoudw-zdw/Awesome-open-world-learning) ![ ](https://img.shields.io/github/stars/zhoudw-zdw/Awesome-open-world-learning) ![ ](https://img.shields.io/github/last-commit/zhoudw-zdw/Awesome-open-world-learning)

- [Blockchain Papers](https://github.com/decrypto-org/blockchain-papers) ![ ](https://img.shields.io/github/stars/decrypto-org/blockchain-papers) ![ ](https://img.shields.io/github/last-commit/decrypto-org/blockchain-papers)

- [Awesome Blockchain AI](https://github.com/steven2358/awesome-blockchain-ai) ![ ](https://img.shields.io/github/stars/steven2358/awesome-blockchain-ai) ![ ](https://img.shields.io/github/last-commit/steven2358/awesome-blockchain-ai)

- [Awesome Causality Algorithms](https://github.com/rguo12/awesome-causality-algorithms) ![ ](https://img.shields.io/github/stars/rguo12/awesome-causality-algorithms) ![ ](https://img.shields.io/github/last-commit/rguo12/awesome-causality-algorithms)

- [Awesome AI Security](https://github.com/DeepSpaceHarbor/Awesome-AI-Security) ![ ](https://img.shields.io/github/stars/DeepSpaceHarbor/Awesome-AI-Security) ![ ](https://img.shields.io/github/last-commit/DeepSpaceHarbor/Awesome-AI-Security)

- [A curated list of AI Security & Privacy events](https://github.com/ZhengyuZhao/AI-Security-and-Privacy-Events) ![ ](https://img.shields.io/github/stars/ZhengyuZhao/AI-Security-and-Privacy-Events) ![ ](https://img.shields.io/github/last-commit/ZhengyuZhao/AI-Security-and-Privacy-Events)

- [Awesome Deep Phenomena](https://github.com/MinghuiChen43/awesome-deep-phenomena) ![ ](https://img.shields.io/github/stars/MinghuiChen43/awesome-deep-phenomena) ![ ](https://img.shields.io/github/last-commit/MinghuiChen43/awesome-deep-phenomena)

# Toolboxes

## Robustness Toolboxes

- [DeepDG: OOD generalization toolbox](https://github.com/jindongwang/transferlearning/tree/master/code/DeepDG) ![ ](https://img.shields.io/github/stars/jindongwang/transferlearning)
- A domain generalization toolbox for research purpose.

- [Cleverhans](https://github.com/cleverhans-lab/cleverhans) ![ ](https://img.shields.io/github/stars/cleverhans-lab/cleverhans)
- This repository contains the source code for CleverHans, a Python library to benchmark machine learning systems' vulnerability to adversarial examples.

- [Adversarial Robustness Toolbox (ART)](https://github.com/Trusted-AI/adversarial-robustness-toolbox) ![ ](https://img.shields.io/github/stars/Trusted-AI/adversarial-robustness-toolbox)
- Adversarial Robustness Toolbox (ART) is a Python library for Machine Learning Security. ART provides tools that enable developers and researchers to evaluate, defend, certify and verify Machine Learning models and applications against the adversarial threats of Evasion, Poisoning, Extraction, and Inference.

- [Adversarial-Attacks-Pytorch](https://github.com/Harry24k/adversarial-attacks-pytorch) ![ ](https://img.shields.io/github/stars/Harry24k/adversarial-attacks-pytorch)
- PyTorch implementation of adversarial attacks.

- [Advtorch](https://github.com/BorealisAI/advertorch) ![ ](https://img.shields.io/github/stars/BorealisAI/advertorch)
- Advtorch is a Python toolbox for adversarial robustness research. The primary functionalities are implemented in PyTorch. Specifically, AdverTorch contains modules for generating adversarial perturbations and defending against adversarial examples, also scripts for adversarial training.

- [RobustBench](https://github.com/RobustBench/robustbench) ![ ](https://img.shields.io/github/stars/RobustBench/robustbench)
- A standardized benchmark for adversarial robustness.

- [BackdoorBox](https://github.com/THUYimingLi/BackdoorBox) ![ ](https://img.shields.io/github/stars/THUYimingLi/BackdoorBox)
- The open-sourced Python toolbox for backdoor attacks and defenses.

- [BackdoorBench](https://github.com/SCLBD/BackdoorBench) ![](https://img.shields.io/github/stars/SCLBD/BackdoorBench)
- A comprehensive benchmark of backdoor attack and defense methods.

## Privacy Toolboxes

- [Diffprivlib](https://github.com/IBM/differential-privacy-library) ![ ](https://img.shields.io/github/stars/IBM/differential-privacy-library)
- Diffprivlib is a general-purpose library for experimenting with, investigating and developing applications in, differential privacy.

- [Privacy Meter](https://github.com/privacytrustlab/ml_privacy_meter) ![ ](https://img.shields.io/github/stars/privacytrustlab/ml_privacy_meter)
- Privacy Meter is an open-source library to audit data privacy in statistical and machine learning algorithms.

- [OpenDP](https://github.com/opendp/opendp) ![ ](https://img.shields.io/github/stars/opendp/opendp)
- The OpenDP Library is a modular collection of statistical algorithms that adhere to the definition of differential privacy.

- [PrivacyRaven](https://github.com/trailofbits/PrivacyRaven) ![ ](https://img.shields.io/github/stars/trailofbits/PrivacyRaven)
- PrivacyRaven is a privacy testing library for deep learning systems.

- [PersonalizedFL](https://github.com/microsoft/PersonalizedFL) ![ ](https://img.shields.io/github/stars/microsoft/PersonalizedFL)
- PersonalizedFL is a toolbox for personalized federated learning.

- [TAPAS](https://github.com/alan-turing-institute/privacy-sdg-toolbox) ![ ](https://img.shields.io/github/stars/alan-turing-institute/privacy-sdg-toolbox)
- Evaluating the privacy of synthetic data with an adversarial toolbox.

## Fairness Toolboxes

- [AI Fairness 360](https://github.com/Trusted-AI/AIF360) ![ ](https://img.shields.io/github/stars/Trusted-AI/AIF360)
- The AI Fairness 360 toolkit is an extensible open-source library containing techniques developed by the research community to help detect and mitigate bias in machine learning models throughout the AI application lifecycle.

- [Fairlearn](https://github.com/fairlearn/fairlearn) ![ ](https://img.shields.io/github/stars/fairlearn/fairlearn)
- Fairlearn is a Python package that empowers developers of artificial intelligence (AI) systems to assess their system's fairness and mitigate any observed unfairness issues.

- [Aequitas](https://github.com/dssg/aequitas) ![ ](https://img.shields.io/github/stars/dssg/aequitas)
- Aequitas is an open-source bias audit toolkit for data scientists, machine learning researchers, and policymakers to audit machine learning models for discrimination and bias, and to make informed and equitable decisions around developing and deploying predictive tools.

- [FAT Forensics](https://github.com/fat-forensics/fat-forensics) ![ ](https://img.shields.io/github/stars/fat-forensics/fat-forensics)
- FAT Forensics implements the state of the art fairness, accountability and transparency (FAT) algorithms for the three main components of any data modelling pipeline: data (raw data and features), predictive models and model predictions.

## Interpretability Toolboxes

- [Lime](https://github.com/marcotcr/lime) ![ ](https://img.shields.io/github/stars/marcotcr/lime)
- This project is about explaining what machine learning classifiers (or models) are doing.

- [InterpretML](https://github.com/interpretml/interpret/) ![ ](https://img.shields.io/github/stars/interpretml/interpret)
- InterpretML is an open-source package that incorporates state-of-the-art machine learning interpretability techniques under one roof.

- [Deep Visualization Toolbox](https://github.com/yosinski/deep-visualization-toolbox) ![ ](https://img.shields.io/github/stars/yosinski/deep-visualization-toolbox)
- This is the code required to run the Deep Visualization Toolbox, as well as to generate the neuron-by-neuron visualizations using regularized optimization.

- [Captum](https://github.com/pytorch/captum) ![ ](https://img.shields.io/github/stars/pytorch/captum)
- Captum is a model interpretability and understanding library for PyTorch.

- [Alibi](https://github.com/SeldonIO/alibi) ![ ](https://img.shields.io/github/stars/SeldonIO/alibi)
- Alibi is an open source Python library aimed at machine learning model inspection and interpretation.

- [AI Explainability 360](https://github.com/Trusted-AI/AIX360) ![ ](https://img.shields.io/github/stars/Trusted-AI/AIX360)
- The AI Explainability 360 toolkit is an open-source library that supports interpretability and explainability of datasets and machine learning models.

## Other Toolboxes

- [Uncertainty Toolbox](https://github.com/uncertainty-toolbox/uncertainty-toolbox) ![ ](https://img.shields.io/github/stars/uncertainty-toolbox/uncertainty-toolbox)

- [Causal Inference 360](https://github.com/IBM/causallib) ![ ](https://img.shields.io/github/stars/IBM/causallib)
- A Python package for inferring causal effects from observational data.

- [Fortuna](https://github.com/awslabs/fortuna) ![ ](https://img.shields.io/github/stars/awslabs/fortuna)
- Fortuna is a library for uncertainty quantification that makes it easy for users to run benchmarks and bring uncertainty to production systems.

- [VerifAI](https://github.com/BerkeleyLearnVerify/VerifAI) ![ ](https://img.shields.io/github/stars/BerkeleyLearnVerify/VerifAI)
- VerifAI is a software toolkit for the formal design and analysis of systems that include artificial intelligence (AI) and machine learning (ML) components.

# Seminar

- [Privacy and Security in ML Seminars](https://prisec-ml.github.io/)

- [MLSec Laboratory - PRALab University of Cagliari](https://www.youtube.com/@MLSec/featured)

- [Challenges and Opportunities for Security & Privacy in Machine Learning](https://vsehwag.github.io/SPML_seminar/)

# Workshops

## Robustness Workshops

- [Backdoor Attacks and Defenses in Machine Learning (ICLR 2023)](https://iclr23-bands.github.io/)

- [New Frontiers in Adversarial Machine Learning (ICML 2023)](https://advml-frontier.github.io/)

- [Adversarial Machine Learning on Computer Vision: Art of Robustness (CVPR 2023)](https://robustart.github.io/)

- [ML Safety Workshop (NeurIPS 2022)](https://neurips2022.mlsafety.org/)

- [Workshop on Adversarial Robustness In the Real World (ECCV 2022)](https://eccv22-arow.github.io/)

- [Formal Verification of Machine Learning (ICML 2022)](https://www.ml-verification.com/)

- [Shift Happens Workshop (ICML 2022)](https://shift-happens-benchmark.github.io/)

- [Principles of Distribution Shift (ICML 2022)](https://sites.google.com/view/icml-2022-pods)

- [New Frontiers in Adversarial Machine Learning (ICML 2022)](https://advml-frontier.github.io/)

- [Workshop on Spurious Correlations, Invariance, and Stability (ICML 2022)](https://sites.google.com/view/scis-workshop/home)

- [Robust and reliable machine learning in the real world (ICLR 2021)](https://sites.google.com/connect.hku.hk/robustml-2021/home)

- [Distribution Shifts Connecting Methods and Applications (NeurIPS 2021)](https://sites.google.com/view/distshift2021)

- [Workshop on Adversarial Robustness In the Real World (ICCV 2021)](https://iccv21-adv-workshop.github.io/)

- [Uncertainty and Robustness in Deep Learning Workshop (ICML 2021)](https://sites.google.com/view/udlworkshop2021/home)

- [RobustML Workshop (ICLR 2021)](https://sites.google.com/connect.hku.hk/robustml-2021/home)

- [Uncertainty and Robustness in Deep Learning Workshop (ICML 2020)](https://sites.google.com/view/udlworkshop2020/home)

## Privacy Workshops

- [Pitfalls of limited data and computation for Trustworthy ML (ICLR 2023)](https://sites.google.com/view/trustml-unlimited/home)

- [Theory and Practice of Differential Privacy (ICML 2022)](https://tpdp.journalprivacyconfidentiality.org/2022/)

## Fairness Workshops

- [Algorithmic Fairness through the Lens of Causality and Privacy (NeurIPS 2022)](https://www.afciworkshop.org/)

## Interpretability Workshops

- [Interpretable Machine Learning in Healthcare (ICML 2022)](https://sites.google.com/view/imlh2022)

## Other Workshops

- [Pitfalls of limited data and computation for Trustworthy ML (ICLR 2023)](https://sites.google.com/view/trustml-unlimited/home)

- [Formal Verification of Machine Learning (ICML 2023)](https://www.ml-verification.com/)

- [Secure and Safe Autonomous Driving (SSAD) Workshop and Challenge (CVPR 2023)](https://trust-ai.github.io/SSAD2023/)

- [Trustworthy and Reliable Large-Scale Machine Learning Models (ICLR 2023)](https://rtml-iclr2023.github.io/)

- [TrustNLP: Third Workshop on Trustworthy Natural Language Processing (ACL 2023)](https://trustnlpworkshop.github.io/)

- [Workshop on Physics for Machine Learning (ICLR 2023)](https://physics4ml.github.io/)

- [Pitfalls of limited data and computation for Trustworthy ML (ICLR 2023)](https://sites.google.com/view/trustml-unlimited/)

- [Workshop on Mathematical and Empirical Understanding of Foundation Models (ICLR 2023)](https://sites.google.com/view/me-fomo2023)

- [ARTIFICIAL INTELLIGENCE AND SECURITY (CCS 2022)](https://aisec.cc/)

- [Automotive and Autonomous Vehicle Security (AutoSec) (NDSS 2022)](https://www.ndss-symposium.org/ndss-program/autosec-2022/)

- [NeurIPS ML Safety Workshop (NeurIPS 2022)](https://neurips2022.mlsafety.org/)

- [Trustworthy and Socially Responsible Machine Learning (NeurIPS 2022)](https://tsrml2022.github.io/)

- [International Workshop on Trustworthy Federated Learning (IJCAI 2022)](https://federated-learning.org/fl-ijcai-2022/)

- [Workshop on AI Safety (IJCAI 2022)](https://www.aisafetyw.org/)

- [1st Workshop on Formal Verification of Machine Learning (ICML 2022)](https://www.ml-verification.com/)

- [Workshop on Distribution-Free Uncertainty Quantification (ICML 2022)](https://sites.google.com/berkeley.edu/dfuq-22/home)

- [First Workshop on Causal Representation Learning (UAI 2022)](https://crl-uai-2022.github.io/)

- [I Can’t Believe It’s Not Better! (ICBINB) Workshop Series](https://i-cant-believe-its-not-better.github.io/)

# Tutorials

## Robustness Tutorials

- [Tutorial on Domain Generalization (IJCAI-ECAI 2022)](https://dgresearch.github.io/)

- [Practical Adversarial Robustness in Deep Learning: Problems and Solutions (CVPR 2021)](https://sites.google.com/view/par-2021)

- [A Tutorial about Adversarial Attacks & Defenses (KDD 2021)](https://sites.google.com/view/kdd21-tutorial-adv-robust/)

- [Adversarial Robustness of Deep Learning Models (ECCV 2020)](https://sites.google.com/umich.edu/eccv-2020-adv-robustness)

- [Adversarial Robustness: Theory and Practice (NeurIPS 2018)](https://nips.cc/Conferences/2018/ScheduleMultitrack?event=10978) [[Note]](https://adversarial-ml-tutorial.org/)

- [Adversarial Machine Learning Tutorial (AAAI 2018)](https://aaai18adversarial.github.io/index.html#)

# Talks

## Robustness Talks

- [Ian Goodfellow: Adversarial Machine Learning (ICLR 2019)](https://www.youtube.com/watch?v=sucqskXRkss)

- [Jindong Wang: Building more robust machine learning models (MLNLP)](https://www.bilibili.com/video/BV1hP411V7SP/)

# Blogs

## Robustness Blogs

- [Pixels still beat text: Attacking the OpenAI CLIP model with text patches and adversarial pixel perturbations](https://stanislavfort.github.io/blog/OpenAI_CLIP_stickers_and_adversarial_examples/)

- [Adversarial examples for the OpenAI CLIP in its zero-shot classification regime and their semantic generalization](https://stanislavfort.github.io/blog/OpenAI_CLIP_adversarial_examples/)

- [A Discussion of Adversarial Examples Are Not Bugs, They Are Features](https://distill.pub/2019/advex-bugs-discussion/)

## Interpretability Blogs

- [Multimodal Neurons in Artificial Neural Networks](https://distill.pub/2021/multimodal-neurons/)

- [Curve Detectors](https://distill.pub/2020/circuits/curve-detectors/)

- [Visualizing the Impact of Feature Attribution Baselines](https://distill.pub/2020/attribution-baselines/)

- [Visualizing Neural Networks with the Grand Tour](https://distill.pub/2020/grand-tour/)

## Other Blogs

- [Cleverhans Blog - Ian Goodfellow, Nicolas Papernot](http://www.cleverhans.io/)

# Other Resources

- [AI Security and Privacy (AISP) Seminar Series](http://scl.sribd.cn/seminar/index.html)

- [ML Safety Newsletter](https://newsletter.mlsafety.org/)

- [Trustworthy ML Initiative](https://www.trustworthyml.org/home)

- [Trustworthy AI Project](https://www.trustworthyaiproject.eu/)

- [ECE1784H: Trustworthy Machine Learning (Course, Fall 2019) - Nicolas Papernot](https://www.papernot.fr/teaching/f19-trustworthy-ml)

- [A School for all Seasons on Trustworthy Machine Learning (Course) - Reza Shokri, Nicolas Papernot](https://trustworthy-machine-learning.github.io/)

- [Trustworthy Machine Learning (Book)](http://www.trustworthymachinelearning.com/)

- [AI Safety Support (Lots of Links)](https://www.aisafetysupport.org/lots-of-links)

# Contributing

Welcome to recommend paper that you find interesting and focused on trustworthy deep learning. You can submit an issue or contact me via [[email]](mailto:[email protected]). Also, if there are any errors in the paper information, please feel free to correct me.

Formatting (The order of the papers is reversed based on the initial submission time to arXiv)
- Paper Title [[paper]](https://arxiv.org/abs/xxxx.xxxx)
- Authors. *Published Conference or Journal*
- Key Word: XXX.
- Digest XXXXXX