{"id":28176470,"url":"https://github.com/chrisliu298/awesome-representation-engineering","last_synced_at":"2025-05-16T00:18:53.454Z","repository":{"id":236569111,"uuid":"792848396","full_name":"chrisliu298/awesome-representation-engineering","owner":"chrisliu298","description":"A resource repository for representation engineering in large language models","archived":false,"fork":false,"pushed_at":"2024-11-14T21:03:30.000Z","size":23,"stargazers_count":52,"open_issues_count":2,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-11-14T22:18:19.970Z","etag":null,"topics":["alignment","awesome","large-language-model","llm","representation-engineering"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chrisliu298.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-27T18:10:46.000Z","updated_at":"2024-11-14T21:03:33.000Z","dependencies_parsed_at":"2024-09-11T18:54:38.789Z","dependency_job_id":"b2fa1e8c-25bd-4752-bf3e-6a1ec0a2d747","html_url":"https://github.com/chrisliu298/awesome-representation-engineering","commit_stats":null,"previous_names":["chrisliu298/awesome-representation-engineering"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chrisliu298%2Fawesome-representation-engineering","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chrisliu298%2Fawesome-representation-engineering/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chrisliu298%2Fawesome-representation-engineering/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chrisliu298%2Fawesome-representation-engineering/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chrisliu298","download_url":"https://codeload.github.com/chrisliu298/awesome-representation-engineering/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254442810,"owners_count":22071888,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alignment","awesome","large-language-model","llm","representation-engineering"],"created_at":"2025-05-16T00:18:47.732Z","updated_at":"2025-05-16T00:18:53.425Z","avatar_url":"https://github.com/chrisliu298.png","language":null,"funding_links":[],"categories":["Other Lists","Categories"],"sub_categories":["TeX Lists","Relevant Repo and Blog"],"readme":"# Awesome Representation Engineering\n\n\u003cp align=\"center\"\u003e\n\u003ca href=\"\"\u003e \u003cimg src=\"https://awesome.re/badge-flat.svg\" alt=\"Awesome\"\u003e\u003c/a\u003e\n\u003ca href=\"\"\u003e \u003cimg src=\"https://img.shields.io/github/stars/chrisliu298/awesome-representation-engineering?style=flat-square\u0026logo=github\" alt=\"GitHub stars\"\u003e\u003c/a\u003e\n\u003ca href=\"\"\u003e \u003cimg src=\"https://img.shields.io/github/forks/chrisliu298/awesome-representation-engineering?style=flat-square\u0026logo=github\" alt=\"GitHub forks\"\u003e\u003c/a\u003e\n\u003ca href=\"\"\u003e \u003cimg src=\"https://img.shields.io/github/issues/chrisliu298/awesome-representation-engineering?style=flat-square\u0026logo=github\" alt=\"GitHub issues\"\u003e\u003c/a\u003e\n\u003ca href=\"\"\u003e \u003cimg src=\"https://img.shields.io/github/last-commit/chrisliu298/awesome-representation-engineering?style=flat-square\u0026logo=github\" alt=\"GitHub Last commit\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\nThis repository tracks the latest research on representation engineering (RepE), which was originally introduced by [Zou et al. (2023)](https://arxiv.org/abs/2310.01405). The goal is to offer a comprehensive list of papers and resources relevant to the topic. Work that falls under the umbrella of representation engineering are also included.\n\n\u003e [!NOTE]\n\u003e If you believe your paper on representation engineering (or related topics) is not included, or if you find a mistake, typo, or information that is not up to date, please open an issue, and I will address it as soon as possible.\n\u003e\n\u003e If you want to add a new paper, feel free to either open an issue or create a pull request.\n\nAlso:\n\n\u003e [!IMPORTANT]\n\u003e *Note that representation engineering is a relatively new framework, so the categorization below reflects my subjective understanding of the techniques. The first list includes work that explicitly uses the term \"representation engineering.\" Other closely related work is grouped in the later lists.*\n\u003e\n\u003e If you disagree with the categorization or have suggestions for improvement, please let me know by opening an issue.\n\n## Table of Contents\n\n- [Table of Contents](#table-of-contents)\n- [Papers](#papers)\n  - [Representation engineering](#representation-engineering)\n  - [Steering vectors](#steering-vectors)\n  - [Concept activation vectors](#concept-activation-vectors)\n  - [Other relevant papers](#other-relevant-papers)\n- [Blog Posts](#blog-posts)\n\n## Papers\n\n### Representation engineering\n\n- [Enhancing Multiple Dimensions of Trustworthiness in LLMs via Sparse Activation Control](https://arxiv.org/abs/2411.02461)\n  - Author(s): Yuxin Xiao, Chaoqun Wan, Yonggang Zhang, Wenxiao Wang, Binbin Lin, Xiaofei He, Xu Shen, Jieping Ye\n  - Date: 2024-11\n  - Venue: NeurIPS 2024\n  - Code: -\n- [Towards Reliable Evaluation of Behavior Steering Interventions in LLMs](https://arxiv.org/abs/2410.17245)\n  - Author(s): Itamar Pres, Laura Ruis, Ekdeep Singh Lubana, David Krueger\n  - Date: 2024-10\n  - Venue: NeurIPS 2024 Workshop on Foundation Model Interventions\n  - Code: -\n- [Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering](https://arxiv.org/abs/2410.15999)\n  - Author(s): Yu Zhao, Alessio Devoto, Giwon Hong, Xiaotang Du, Aryo Pradipta Gema, Hongru Wang, Xuanli He, Kam-Fai Wong, Pasquale Minervini\n  - Date: 2024-10\n  - Venue: -\n  - Code: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat\u0026logo=github\u0026logoColor=white)](https://github.com/yuzhaouoe/SAE-based-representation-engineering)\n- [A Timeline and Analysis for Representation Plasticity in Large Language Models](https://arxiv.org/abs/2410.06225)\n  - Author(s): Akshat Kannan\n  - Date: 2024-10\n  - Venue: -\n  - Code: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat\u0026logo=github\u0026logoColor=white)](https://github.com/UltraTsar/NonTrivialRepE_Timeline)\n- [Gradient-based Jailbreak Images for Multimodal Fusion Models](https://arxiv.org/abs/2410.03489)\n  - Author(s): Javier Rando, Hannah Korevaar, Erik Brinkman, Ivan Evtimov, Florian Tramèr\n  - Date: 2024-10\n  - Venue: -\n  - Code: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat\u0026logo=github\u0026logoColor=white)](https://github.com/facebookresearch/multimodal-fusion-jailbreaks)\n- [Towards Inference-time Category-wise Safety Steering for Large Language Models](https://arxiv.org/abs/2410.01174)\n  - Author(s): Amrita Bhattacharjee, Shaona Ghosh, Traian Rebedea, Christopher Parisien\n  - Date: 2024-10\n  - Venue: -\n  - Code: -\n- [Words in Motion: Representation Engineering for Motion Forecasting](https://arxiv.org/abs/2406.11624)\n  - Author(s): Omer Sahin Tas, Royden Wagner\n  - Date: 2024-06\n  - Venue: -\n  - Code: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat\u0026logo=github\u0026logoColor=white)](https://github.com/kit-mrt/future-motion)\n- [Legend: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets](https://arxiv.org/abs/2406.08124)\n  - Author(s): Duanyu Feng, Bowen Qin, Chen Huang, Youcheng Huang, Zheng Zhang, Wenqiang Lei\n  - Date: 2024-06\n  - Venue: -\n  - Code: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat\u0026logo=github\u0026logoColor=white)](https://github.com/colfeng/Legend)\n- [PaCE: Parsimonious Concept Engineering for Large Language Models](https://arxiv.org/abs/2406.04331)\n  - Author(s): Jinqi Luo, Tianjiao Ding, Kwan Ho Ryan Chan, Darshan Thaker, Aditya Chattopadhyay, Chris Callison-Burch, René Vidal\n  - Date: 2024-06\n  - Venue: -\n  - Code: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat\u0026logo=github\u0026logoColor=white)](https://github.com/peterljq/Parsimonious-Concept-Engineering)\n- [Improving Alignment and Robustness with Circuit Breakers](https://arxiv.org/abs/2406.04313)\n  - Author(s): Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, Dan Hendrycks\n  - Date: 2024-06\n  - Venue: -\n  - Code: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat\u0026logo=github\u0026logoColor=white)](https://github.com/blackswan-ai/circuit-breakers)\n- [ConTrans: Weak-to-Strong Alignment Engineering via Concept Transplantation](https://arxiv.org/abs/2405.13578)\n  - Author(s): Weilong Dong, Xinwei Wu, Renren Jin, Shaoyang Xu, Deyi Xiong\n  - Date: 2024-05\n  - Venue: -\n  - Code: -\n- [Towards General Conceptual Model Editing via Adversarial Representation Engineering](https://arxiv.org/abs/2404.13752)\n  - Author(s): Yihao Zhang, Zeming Wei, Jun Sun, Meng Sun\n  - Date: 2024-04\n  - Venue: -\n  - Code: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat\u0026logo=github\u0026logoColor=white)](https://github.com/Zhang-Yihao/Adversarial-Representation-Engineering)\n- [Towards Uncovering How Large Language Model Works: An Explainability Perspective](https://arxiv.org/abs/2402.10688)\n  - Author(s): Haiyan Zhao, Fan Yang, Bo Shen, Himabindu Lakkaraju, Mengnan Du\n  - Date: 2024-02\n  - Venue: -\n  - Code: -\n- [Tradeoffs Between Alignment and Helpfulness in Language Models](https://arxiv.org/abs/2401.16332)\n  - Author(s): Yotam Wolf, Noam Wies, Dorin Shteyman, Binyamin Rothberg, Yoav Levine, Amnon Shashua\n  - Date: 2024-01\n  - Venue: -\n  - Code: -\n- [Open the Pandora's Box of LLMs: Jailbreaking LLMs through Representation Engineering](https://arxiv.org/abs/2401.06824)\n  - Author(s): Tianlong Li, Shihan Dou, Wenhao Liu, Muling Wu, Changze Lv, Xiaoqing Zheng, Xuanjing Huang\n  - Date: 2024-01\n  - Venue: -\n  - Code: -\n- [Aligning Large Language Models with Human Preferences through Representation Engineering](https://arxiv.org/abs/2312.15997)\n  - Author(s): Wenhao Liu, Xiaohua Wang, Muling Wu, Tianlong Li, Changze Lv, Zixuan Ling, Jianhao Zhu, Cenyuan Zhang, Xiaoqing Zheng, Xuanjing Huang\n  - Date: 2023-12\n  - Venue: -\n  - Code: -\n- [Representation Engineering: A Top-Down Approach to AI Transparency](https://arxiv.org/abs/2310.01405)\n  - Author(s): Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, Dan Hendrycks\n  - Date: 2023-10\n  - Venue: -\n  - Code: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat\u0026logo=github\u0026logoColor=white)](https://github.com/andyzoujm/representation-engineering)\n\n### Steering vectors\n\n- [Can sparse autoencoders be used to decompose and interpret steering vectors?](https://arxiv.org/abs/2411.08790)\n  - Author(s): Harry Mayne, Yushi Yang, Adam Mahdi\n  - Date: 2024-11\n  - Venue: -\n  - Code: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat\u0026logo=github\u0026logoColor=white)](https://github.com/HarryMayne/SV_interpretability)\n- [Extracting Unlearned Information from LLMs with Activation Steering](https://arxiv.org/abs/2411.02631)\n  - Author(s): Atakan Seyitoğlu, Aleksei Kuvshinov, Leo Schwinn, Stephan Günnemann\n  - Date: 2024-11\n  - Venue: NeurIPS 2024 Workshop on Safe Generative AI\n  - Code: -\n- [Improving Steering Vectors by Targeting Sparse Autoencoder Features](https://arxiv.org/abs/2411.02193)\n  - Author(s): Sviatoslav Chalnev, Matthew Siu, Arthur Conmy\n  - Date: 2024-11\n  - Venue: -\n  - Code: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat\u0026logo=github\u0026logoColor=white)](https://github.com/slavachalnev/SAE-TS)\n- [Improving Instruction-Following in Language Models through Activation Steering](https://arxiv.org/abs/2410.12877)\n  - Author(s): Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, Besmira Nushi\n  - Date: 2024-10\n  - Venue: -\n  - Code: -\n- [Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors](https://arxiv.org/abs/2410.12299)\n  - Author(s): Weixuan Wang, Jingyuan Yang, Wei Peng\n  - Date: 2024-10\n  - Venue: -\n  - Code: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat\u0026logo=github\u0026logoColor=white)](https://github.com/weixuan-wang123/SADI)\n- [Activation Scaling for Steering and Interpreting Language Models](https://arxiv.org/abs/2410.04962)\n  - Author(s): Niklas Stoehr, Kevin Du, Vésteinn Snæbjarnarson, Robert West, Ryan Cotterell, Aaron Schein\n  - Date: 2024-10\n  - Venue: EMNLP 2024 Findings\n  - Code: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat\u0026logo=github\u0026logoColor=white)](https://github.com/niklasstoehr/activationScaling)\n- [Uncovering Latent Chain of Thought Vectors in Language Models](https://arxiv.org/abs/2409.14026)\n  - Author(s): Jason Zhang, Scott Viteri\n  - Date: 2024-09\n  - Venue: ICLR 2024 Tiny Paper\n  - Code: -\n- [Householder Pseudo-Rotation: A Novel Approach to Activation Editing in LLMs with Direction-Magnitude Perspective](https://arxiv.org/abs/2409.10053)\n  - Author(s): Van-Cuong Pham, Thien Huu Nguyen\n  - Date: 2024-09\n  - Venue: -\n  - Code: -\n- [Analyzing the Generalization and Reliability of Steering Vectors](https://arxiv.org/abs/2407.12404)\n  - Author(s): Daniel Tan, David Chanin, Aengus Lynch, Dimitrios Kanoulas, Brooks Paige, Adria Garriga-Alonso, Robert Kirk\n  - Date: 2024-07\n  - Venue: -\n  - Code: -\n- [Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs](https://arxiv.org/abs/2407.04108)\n  - Author(s): Sara Price, Arjun Panickssery, Sam Bowman, Asa Cooper Stickland\n  - Date: 2024-07\n  - Venue: -\n  - Code: -\n- [Steering Without Side Effects: Improving Post-Deployment Control of Language Models](https://arxiv.org/abs/2406.15518)\n  - Author(s): Asa Cooper Stickland, Alexander Lyzhov, Jacob Pfau, Salsabila Mahdi, Samuel R. Bowman\n  - Date: 2024-06\n  - Venue: -\n  - Code: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat\u0026logo=github\u0026logoColor=white)](https://github.com/AsaCooperStickland/kl-then-steer)\n- [Who's asking? User personas and the mechanics of latent misalignment](https://arxiv.org/abs/2406.12094)\n  - Author(s): Asma Ghandeharioun, Ann Yuan, Marius Guerard, Emily Reif, Michael A. Lepori, Lucas Dixon\n  - Date: 2024-06\n  - Venue: -\n  - Code: -\n- [Controlling Large Language Model Agents with Entropic Activation Steering](https://arxiv.org/abs/2406.00244)\n  - Author(s): Nate Rahn, Pierluca D'Oro, Marc G. Bellemare\n  - Date: 2024-06\n  - Venue: -\n  - Code: -\n- [Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories](https://arxiv.org/abs/2406.00034)\n  - Author(s): Tianlong Wang, Xianfeng Jiao, Yifan He, Zhongzhi Chen, Yinghao Zhu, Xu Chu, Junyi Gao, Yasha Wang, Liantao Ma\n  - Date: 2024-06\n  - Venue: -\n  - Code: -\n- [Activation Steering for Robust Type Prediction in CodeLLMs](https://arxiv.org/abs/2404.01903)\n  - Author(s): Francesca Lucchetti, Arjun Guha\n  - Date: 2024-04\n  - Venue: -\n  - Code: -\n- [Extending Activation Steering to Broad Skills and Multiple Behaviours](https://arxiv.org/abs/2403.05767)\n  - Author(s): Teun van der Weij, Massimo Poesio, Nandi Schoots\n  - Date: 2024-03\n  - Venue: -\n  - Code: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat\u0026logo=github\u0026logoColor=white)](https://github.com/TeunvdWeij/extending-activation-addition)\n- [Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models](https://arxiv.org/abs/2402.19465)\n  - Author(s): Chen Qian, Jie Zhang, Wei Yao, Dongrui Liu, Zhenfei Yin, Yu Qiao, Yong Liu, Jing Shao\n  - Date: 2024-02\n  - Venue: -\n  - Code: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat\u0026logo=github\u0026logoColor=white)](https://github.com/ChnQ/TracingLLM)\n- [MiMiC: Minimally Modified Counterfactuals in the Representation Space](https://arxiv.org/abs/2402.09631)\n  - Author(s): Shashwat Singh, Shauli Ravfogel, Jonathan Herzig, Roee Aharoni, Ryan Cotterell, Ponnurangam Kumaraguru\n  - Date: 2024-02\n  - Venue: -\n  - Code: -\n- [Investigating Bias Representations in Llama 2 Chat via Activation Steering](https://arxiv.org/abs/2402.00402)\n  - Author(s): Dawn Lu, Nina Rimsky\n  - Date: 2024-02\n  - Venue: -\n  - Code: -\n- [InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance](https://arxiv.org/abs/2401.11206)\n  - Author(s): Pengyu Wang, Dong Zhang, Linyang Li, Chenkun Tan, Xinghao Wang, Ke Ren, Botian Jiang, Xipeng Qiu\n  - Date: 2024-01\n  - Venue: -\n  - Code: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat\u0026logo=github\u0026logoColor=white)](https://github.com/Jihuai-wpy/InferAligner)\n- [Steering Llama 2 via Contrastive Activation Addition](https://arxiv.org/abs/2312.06681)\n  - Author(s): Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Matt Turner\n  - Date: 2024-12\n  - Venue: -\n  - Code: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat\u0026logo=github\u0026logoColor=white)](https://github.com/nrimsky/CAA)\n- [Improving Activation Steering in Language Models with Mean-Centring](https://arxiv.org/abs/2312.03813)\n  - Author(s): Ole Jorgensen, Dylan Cope, Nandi Schoots, Murray Shanahan\n  - Date: 2023-12\n  - Venue: -\n  - Code: -\n- [Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment](https://arxiv.org/abs/2311.09433)\n  - Author(s): Haoran Wang, Kai Shu\n  - Date: 2023-11\n  - Venue: -\n  - Code: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat\u0026logo=github\u0026logoColor=white)](https://github.com/wang2226/Backdoor-Activation-Attack)\n- [The Linear Representation Hypothesis and the Geometry of Large Language Models](https://arxiv.org/abs/2311.03658)\n  - Author(s): Kiho Park, Yo Joong Choe, Victor Veitch\n  - Date: 2023-11\n  - Venue: -\n  - Code: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat\u0026logo=github\u0026logoColor=white)](https://github.com/KihoPark/linear_rep_geometry)\n- [Activation Addition: Steering Language Models Without Optimization](https://arxiv.org/abs/2308.10248)\n  - Author(s): Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, Monte MacDiarmid\n  - Date: 2023-08\n  - Venue: -\n  - Code: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat\u0026logo=github\u0026logoColor=white)](https://github.com/montemac/activation_additions)\n- [Extracting Latent Steering Vectors from Pretrained Language Models](https://arxiv.org/abs/2205.05124)\n  - Author(s): Nishant Subramani, Nivedita Suresh, Matthew E. Peters\n  - Date: 2022-05\n  - Venue: -\n  - Code: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat\u0026logo=github\u0026logoColor=white)](https://github.com/nishantsubramani/steering_vectors)\n\n### Concept activation vectors\n\n- [Visual-TCAV: Concept-based Attribution and Saliency Maps for Post-hoc Explainability in Image Classification](https://arxiv.org/abs/2411.05698)\n  - Author(s): Antonio De Santis, Riccardo Campi, Matteo Bianchi, Marco Brambilla\n  - Date: 2024-11\n  - Venue: -\n  - Code: -\n- [Decision Trees for Interpretable Clusters in Mixture Models and Deep Representations](https://arxiv.org/abs/2411.01576)\n  - Author(s): Maximilian Fleissner, Maedeh Zarvandi, Debarghya Ghoshdastidar\n  - Date: 2024-11\n  - Venue: -\n  - Code: -\n- [Exploiting Text-Image Latent Spaces for the Description of Visual Concepts](https://arxiv.org/abs/2410.17832)\n  - Author(s): Laines Schmalwasser, Jakob Gawlikowski, Joachim Denzler, Julia Niebling\n  - Date: 2024-10\n  - Venue: -\n  - Code: -\n- [KTCR: Improving Implicit Hate Detection with Knowledge Transfer driven Concept Refinement](https://arxiv.org/abs/2410.15314)\n  - Author(s): Samarth Garg, Vivek Hruday Kavuri, Gargi Shroff, Rahul Mishra\n  - Date: 2024-10\n  - Venue: -\n  - Code: -\n- [LG-CAV: Train Any Concept Activation Vector with Language Guidance](https://arxiv.org/abs/2410.10308)\n  - Author(s): Qihan Huang, Jie Song, Mengqi Xue, Haofei Zhang, Bingde Hu, Huiqiong Wang, Hao Jiang, Xingen Wang, Mingli Song\n  - Date: 2024-10\n  - Venue: NeuroIPS 2024\n  - Code: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat\u0026logo=github\u0026logoColor=white)](https://github.com/hqhQAQ/LG-CAV)\n- [Looking into Concept Explanation Methods for Diabetic Retinopathy Classification](https://arxiv.org/abs/2410.03188)\n  - Author(s): Andrea M. Storås, Josefine V. Sundgaard\n  - Date: 2024-10\n  - Venue: -\n  - Code: -\n- [EQ-CBM: A Probabilistic Concept Bottleneck with Energy-based Models and Quantized Vectors](https://arxiv.org/abs/2409.14630)\n  - Author(s): Sangwon Kim, Dasom Ahn, Byoung Chul Ko, In-su Jang, Kwang-Ju Kim\n  - Date: 2024-09\n  - Venue: ACCV 2024\n  - Code: -\n- [TextCAVs: Debugging vision models using text](https://arxiv.org/abs/2408.08652)\n  - Author(s): Angus Nicolson, Yarin Gal, J. Alison Noble\n  - Date: 2024-08\n  - Venue: -\n  - Code: -\n- [Uncovering Safety Risks in Open-source LLMs through Concept Activation Vector](https://arxiv.org/abs/2404.12038)\n  - Author(s): Zhihao Xu, Ruixuan Huang, Xiting Wang, Fangzhao Wu, Jing Yao, Xing Xie\n  - Date: 2024-04\n  - Venue: -\n  - Code: -\n- [Explaining Explainability: Understanding Concept Activation Vectors](https://arxiv.org/abs/2404.03713)\n  - Author(s): Angus Nicolson, Lisa Schut, J. Alison Noble, Yarin Gal\n  - Date: 2024-04\n  - Venue: -\n  - Code: -\n- [Demystifying Embedding Spaces using Large Language Models](https://arxiv.org/abs/2310.04475)\n  - Author(s): Guy Tennenholtz, Yinlam Chow, Chih-Wei Hsu, Jihwan Jeong, Lior Shani, Azamat Tulepbergenov, Deepak Ramachandran, Martin Mladenov, Craig Boutilier\n  - Date: 2023-10\n  - Venue: ICLR 2024\n  - Code: -\n\n### Other relevant papers\n\n- [The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets](https://arxiv.org/abs/2310.06824)\n  - Author(s): Samuel Marks, Max Tegmark\n  - Date: 2023-10\n  - Venue: -\n  - Code: [![GitHub](https://img.shields.io/badge/GitHub-181717?style=flat\u0026logo=github\u0026logoColor=white)](https://github.com/saprmarks/geometry-of-truth)\n\n## Blog Posts\n\n- [Representation Engineering Mistral-7B an Acid Trip](https://vgel.me/posts/representation-engineering/)\n- [Representation Engineering for Control Vector](https://mlops.substack.com/p/representation-engineering-for-control)\n\nOther relevant posts:\n\n- [Refusal in LLMs is mediated by a single direction](https://www.alignmentforum.org/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction)\n- [Simple probes can catch sleeper agents](https://www.anthropic.com/research/probes-catch-sleeper-agents)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchrisliu298%2Fawesome-representation-engineering","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchrisliu298%2Fawesome-representation-engineering","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchrisliu298%2Fawesome-representation-engineering/lists"}