https://github.com/jphall663/gai_risk_management

A place for ideas and drafts related to GAI risk management.
https://github.com/jphall663/gai_risk_management

Last synced: 8 months ago
JSON representation

A place for ideas and drafts related to GAI risk management.

Host: GitHub
URL: https://github.com/jphall663/gai_risk_management
Owner: jphall663
License: other
Created: 2024-05-14T20:24:58.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-01-16T23:50:12.000Z (11 months ago)
Last Synced: 2025-02-13T13:53:57.679Z (10 months ago)
Size: 5.77 MB
Stars: 12
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-machine-learning-interpretability - jphall663, Generative AI Risk Management Resources - ![](https://img.shields.io/github/stars/jphall663/gai_risk_management?style=social) | "A place for ideas and drafts related to GAI risk management." | (Technical Resources / Benchmarks)

README

# Generative AI Risk Management Resources

**TL; DR**:
- This repository contains resources to assist in the creation of technical goverance standards or procedures for organizational AI governance policies, with a strong focus on generative AI (GAI).
- This information must be combined with a higher-level governance approach as described in, e.g., [The Interagency Guidance on Model Risk Management](https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf) or the [NIST AI RMF Govern Function](https://airc.nist.gov/AI_RMF_Knowledge_Base/Playbook/Govern).
- The information below is aligned with [DRAFT NIST 600-1 AI RMF Generative AI Profile](https://airc.nist.gov/docs/NIST.AI.600-1.GenAI-Profile.ipd.pdf).
- For non-commercial use. For commercial support please reach out to `info@hallresearch.ai`.

**What's missing?**
- Higher-level policies and procedure language to tie these resources together into cohesive governance documentments.
- Methodology for estimating business risk (e.g., monetary losses) from model testing, red-teaming, feedback and experimental results.
- ...

**Introduction**

This information is designed to help organizations build the governance policies required to measure and manage risks associated with deploying and using GAI systems. Governance is key to addressing the growing need for trustworthy and responsible AI systems, and this repository is aligned to the NIST AI Risk Management Framework trustworthy characteristics and the [DRAFT NIST 600-1 AI RMF Generative AI Profile](https://airc.nist.gov/docs/NIST.AI.600-1.GenAI-Profile.ipd.pdf). Governance is also a necessary component of AI strategy, crucial for addressing real legal, regulatory, ethical, and operational headwinds.

At its core, this repository provides technical materials for building or augmenting detailed model or AI governance procedures or standards, and aligns them to guidance from NIST. Starting in [Section A](#a-example-generative-ai-trustworthy-characteristic-crosswalk), two central risk management mechanisms are explored. The first perspective comprises the NIST AI RMF trustworthy characteristics mapped to GAI risks. Operating from this perspective allows organizations to understand how each trustworthy characteristic can mitigate specific risks posed by GAI. The second perspective is the reverse—GAI risks mapped to trustworthy characteristics. That mapping can help organizations understand which characteristics should be prioritized to manage specific GAI risks. As consumer finance organizations are likely to adopt both NIST (or other more technical frameworks) and traditional enterprise risk management methodologies, ideas on linking trustworthy characteristics, GAI risks, and established banking risk buckets are also presented in [Section A](#a-example-generative-ai-trustworthy-characteristic-crosswalk).

The repository also guides users through authoritative resources for risk-tiering. Sections [B.1](#b1-example-adverse-impacts) through [B.7](#b7-ai-risk-management-framework-actions-aligned-to-risk-tiering) walk the user of the framework through the process of defining adverse impacts: *Harm to Operations*, *Harm to Assets*, *Harm to Individuals*, *Harm to Other Organizations*, and *Harm to the Nation*, along with guidance on impact quantification and description. [Section B](#b-example-risk-tiering-materials-for-generative-ai) also offers tables with guidance on assessing the likelihood of certain risks. Organizations and companies can leverage this combination of adverse impacts and frequency/likelihood tables to develop tailored risk tiers that reflect the specific contexts in which their GAI systems may be operating. They can also utilize practical risk-tiering to guide their decision-making and evaluate how best to calibrate existing safeguards or whether to implement additional ones.

Measurement and testing is a critical aspect of ensuring GAI systems perform as expected. For measuring the severity of certain GAI risks, [Section C](#c-list-of-selected-model-testing-suites) presents various model testing benchmarks (such as evals). Model testing suites provide the user with tools to roughly assess GAI performance against trustworthy characteristics as well to quickly test for resilience in the face of known GAI risks. As GAI systems are vulnerable to adversarial attacks via prompting and hacks, [Section D](#d-selected-adversarial-prompting-strategies-and-attacks) presents red-teaming and adversarial prompting approaches for human elicitation of evidence of GAI risks in adversarial scenarios. [Section H](#h-example-high-risk-generative-ai-measurement-and-management-plan) hints at more in-depth structured experiments and human feedback for risk assessment. Suggested usage for these types of measurement is as follows:

- **Low-risk GAI systems**: model testing only
- **Medium-risk GAI systems**: model testing and red-teaming
- **High-risk GAI systems**: model testing, red-teaming, and structured experiments and human feedback

Where measurement for lower-risk systems can be highly-automated, human risk management resources are reserved for medium and high-risk systems.

For managing and mitigating GAI risks, [Section E](#e-selected-risk-controls-for-generative-ai) outlines several risk controls for GAI. Controls range from technical settings for GAI systems to commonsense recommendations, e.g., limiting or restricting access for minors. Sections [F](#f-example-low-risk-generative-ai-measurement-and-management-plan), [G](#g-example-medium-risk-generative-ai-measurement-and-management-plan), and [H](#h-example-high-risk-generative-ai-measurement-and-management-plan) pair risk measurement techniques with controls to form more fulsome risk management plans. Recommended usage for the plans in Sections [F](#f-example-low-risk-generative-ai-measurement-and-management-plan)-[H](#h-example-high-risk-generative-ai-measurement-and-management-plan) is:

- **Low-risk GAI systems**: apply [Section F](#f-example-low-risk-generative-ai-measurement-and-management-plan) only
- **Medium-risk GAI systems**: apply Section [F](#f-example-low-risk-generative-ai-measurement-and-management-plan) and [G](#g-example-medium-risk-generative-ai-measurement-and-management-plan)
- **High-risk GAI systems**: apply Sections [F](#f-example-low-risk-generative-ai-measurement-and-management-plan), [G](#g-example-medium-risk-generative-ai-measurement-and-management-plan), and [H](#h-example-high-risk-generative-ai-measurement-and-management-plan)

Regardless of the risk level of the system, the framework offers detailed measurement plans that guide the user through the process of assessing the system’s performance, along with tracking risks, and harmonizing the system with trustworthy AI principles.

## Table of Contents

* **Section A**: [Example Generative AI-Trustworthy Characteristic Crosswalk](#a-example-generative-ai-trustworthy-characteristic-crosswalk)
* **A.1**: [Trustworthy Characteristic to Generative AI Risk Crosswalk](#a1-trustworthy-characteristic-to-generative-ai-risk-crosswalk)
* **A.2**: [Generative AI Risk to Trustworthy Characteristic Crosswalk](#a2-generative-ai-risk-to-trustworthy-characteristic-crosswalk)
* **A.3**: [Traditional Banking Risks, Generative AI Risks and Trustworthy Characteristics Crosswalk](#a3-traditional-banking-risks-generative-ai-risks-and-trustworthy-characteristics-crosswalk)

* **Section B**: [Example Risk-tiering Materials for Generative AI](#b-example-risk-tiering-materials-for-generative-ai)
* **B.1**: [Example Adverse Impacts](#b1-example-adverse-impacts)
* **B.2**: [Example Impact Descriptions](#b2-example-impact-descriptions)
* **B.3**: [Example Likelihood Descriptions](#b3-example-likelihood-descriptions)
* **B.4**: [Example Risk Tiers](#b4-example-risk-tiers)
* **B.5**: [Example Risk Descriptions](#b5-example-risk-descriptions)
* **B.6**: [Practical Risk-tiering Questions](#b6-practical-risk-tiering-questions)
* **B.7**: [AI Risk Management Framework Actions Aligned to Risk Tiering](#b7-ai-risk-management-framework-actions-aligned-to-risk-tiering)

* **Section C**: [List of Selected Model Testing Suites](#c-list-of-selected-model-testing-suites)
* **C.1**: [Selected Model Testing Suites Organized by Trustworthy Characteristic](#c1-selected-model-testing-suites-organized-by-trustworthy-characteristic)
* **C.2**: [Selected Model Testing Suites Organized by Generative AI Risk](#c2-selected-model-testing-suites-organized-by-generative-ai-risk)
* **C.3**: [AI Risk Management Framework Actions Aligned to Benchmarking](#c3-ai-risk-management-framework-actions-aligned-to-benchmarking)

* **Section D**: [Selected Adversarial Prompting Strategies and Attacks](#d-selected-adversarial-prompting-strategies-and-attacks)
* **D.1**: [Common AI Red-teaming Tools](#d1-common-ai-red-teaming-tools)
* **D.2**: [Selected Adversarial Prompting Strategies and Attacks by Organized Trustworthy Characteristic](#d2-selected-adversarial-prompting-strategies-and-attacks-organized-by-trustworthy-characteristic)
* **D.3**: [Selected Adversarial Prompting Techniques and Attacks Organized by Generative AI Risk](#d3-selected-adversarial-prompting-techniques-and-attacks-organized-by-generative-ai-risk)
* **D.4**: [AI Risk Management Framework Actions Aligned to Red Teaming](#d4-ai-risk-management-framework-actions-aligned-to-red-teaming)

* **Section E**: [Selected Risk Controls for Generative AI](#e-selected-risk-controls-for-generative-ai)

* **Section F**: [Example Low-risk Generative AI Measurement and Management Plan](#f-example-low-risk-generative-ai-measurement-and-management-plan)
* **F.1**: [Example Low-risk Generative AI Measurement and Management Plan Organized by Trustworthy Characteristic](#f1-example-low-risk-generative-ai-measurement-and-management-plan-organized-by-trustworthy-characteristic)
* **F.2**: [Example Low-risk Generative AI Measurement and Management Plan Organized by Generative AI Risk](#f2-example-low-risk-generative-ai-measurement-and-management-plan-organized-by-generative-ai-risk)

* **Section G**: [Example Medium-risk Generative AI Measurement and Management Plan](#g-example-medium-risk-generative-ai-measurement-and-management-plan)
* **G.1**: [Example Medium-risk Generative AI Measurement and Management Plan Organized by Trustworthy Characteristic](#g1-example-medium-risk-generative-ai-measurement-and-management-plan-organized-by-trustworthy-characteristic)
* **G.2**: [Example Medium-risk Generative AI Measurement and Management Plan Organized by Generative AI Risk](#g2-example-medium-risk-generative-ai-measurement-and-management-plan-organized-by-generative-ai-risk)

* **Section H**: [Example High-risk Generative AI Measurement and Management Plan](#h-example-high-risk-generative-ai-measurement-and-management-plan)
* **H.1**: [Example High-risk Generative AI Measurement and Management Plan Organized by Trustworthy Characteristic](#h1-example-high-risk-generative-ai-measurement-and-management-plan-organized-by-trustworthy-characteristic)
* **H.2**: [Example High-risk Generative AI Measurement and Management Plan Organized by Generative AI Risk](#h2-example-high-risk-generative-ai-measurement-and-management-plan-organized-by-generative-ai-risk)

***

## A: Example Generative AI-Trustworthy Characteristic Crosswalk

### A.1: Trustworthy Characteristic to Generative AI Risk Crosswalk

Accountable and Transparent
Explainable and Interpretable
Fair with Harmful Bias Managed
Privacy Enhanced

Data Privacy
Human-AI Configuration
Confabulation
Data Privacy

Environmental
Value Chain and Component Integration
Environmental
Human-AI Configuration

Human-AI Configuration

Human-AI Configuration
Information Security

Information Integrity

Intellectual Property
Intellectual Property

Intellectual Property

Obscene, Degrading, and/or Abusive Content
Value Chain and Component Integration

Value Chain and Component Integration

Toxicity, Bias, and Homogenization

Value Chain and Component Integration

Safe
Secure and Resilient
Valid and Reliable

CBRN Information
Dangerous or Violent Recommendations
Confabulation

Confabulation
Data Privacy
Human-AI Configuration

Dangerous or Violent Recommendations
Human-AI Configuration
Information Integrity

Data Privacy
Information Security
Information Security

Environmental
Value Chain and Component Integration
Toxicity, Bias, and Homogenization

Human-AI Configuration

Value Chain and Component Integration

Information Integrity

Information Security

Obscene, Degrading, and/or Abusive Content

Value Chain and Component Integration

**Usage Note**: Table A.1 provides an example of mapping GAI risks onto AI RMF trustworthy characteristics. Mapping GAI risks to AI RMF trustworthy characteristics can be particularly useful when existing policies, processes, or controls can be applied to manage GAI risks, but have been previously implemented in alignment with the AI RMF trustworthy characteristics. Many mappings are possible. Mappings that differ from the example may be more appropriate to meet a particular organization's risk management goals.

### A.2: Generative AI Risk to Trustworthy Characteristic Crosswalk

Table A.2: Generative AI Risk to Trustworthy Characteristic Crosswalk.

CBRN Information
Confabulation
Dangerous or Violent Recommendations
Data Privacy

Safe
Fair with Harmful Bias Managed
Safe
Accountable and Transparent

Safe
Secure and Resilient
Privacy Enhanced

Valid and Reliable

Safe

Secure and Resilient

Environmental
Human-AI Configuration
Information Integrity
Information Security

Accountable and Transparent
Accountable and Transparent
Accountable and Transparent
Privacy Enhanced

Fair with Harmful Bias Managed
Explainable and Interpretable
Safe
Safe

Safe
Fair with Harmful Bias Managed
Valid and Reliable
Secure and Resilient

Privacy Enhanced

Valid and Reliable

Safe

Secure and Resilient

Valid and Reliable

Intellectual Property
Obscene, Degrading, and/or Abusive Content
Toxicity, Bias, and Homogenization
Value Chain and Component Integration

Accountable and Transparent
Fair with Harmful Bias Managed
Fair with Harmful Bias Managed
Accountable and Transparent

Fair with Harmful Bias Managed
Safe
Valid and Reliable
Explainable and Interpretable

Privacy Enhanced

Fair with Harmful Bias Managed

Privacy Enhanced

Safe

Secure and Resilient

Valid and Reliable

**Usage Note**: Table A.2 provides an example of mapping AI RMF trustworthy characteristics onto GAI risks. Mapping AI RMF trustworthy characteristics to GAI risks can assist organizations in aligning GAI guidance to existing AI/ML policies, processes, or controls or to extend GAI guidance to address additional AI/ML technologies. Many mappings are possible. Mappings that differ from the example may be more appropriate to meet a particular organization's risk management goals.

### A.3: Traditional Banking Risks, Generative AI Risks and Trustworthy Characteristics Crosswalk

Table A.3: Traditional Banking Risks, Generative AI Risks and Trustworthy Characteristics Crosswalk.

Compliance Risk
Information Security Risk
Legal Risk
Model Risk

Data Privacy
Data Privacy
Intellectual Property
Confabulation

Information Security
Information Security
Obscene, Degrading, and/or Abusive Content
Dangerous or Violent Recommendations

Toxicity, Bias, and Homogenization
Value Chain and Component Integration
Value Chain and Component Integration
Information Integrity

Value Chain and Component Integration

Obscene, Degrading, and/or Abusive Content

Toxicity, Bias, and Homogenization

Accountable and Transparent
Privacy Enhanced
Accountable and Transparent
Valid and Reliable

Fair with Harmful Bias Managed
Secure and Resilient
Safe

Privacy Enhanced

Secure and Resilient

Operational Risk
Reputational Risk
Strategic Risk
Third Party Risk

Confabulation
Confabulation
Environmental
Information Integrity

Human-AI Configuration
Dangerous or Violent Recommendations
Information Integrity
Value Chain and Component Integration

Information Security
Environmental
Information Security

Value Chain and Component Integration
Human-AI Configuration
Value Chain and Component Integration

Information Integrity

Obscene, Degrading, and/or Abusive Content

Toxicity, Bias, and Homogenization

Safe
Accountable and Transparent
Accountable and Transparent
Accountable and Transparent

Secure and Resilient
Fair with Harmful Bias Managed
Secure and Resilient
Explainable and Interpretable

Valid and Reliable
Valid and Reliable
Valid and Reliable

**Usage Note**: Table A.3 provides an example of mapping GAI risks and AI RMF trustworthy characteristics. This type of mapping can enable incorporation of new AI guidance into existing policies, processes, or controls or the application of existing policies, processes, or controls to newer AI risks.

## B: Example Risk-tiering Materials for Generative AI

### B.1: Example Adverse Impacts

Table B.1: Example adverse impacts, adapted from NIST 800-30r1 Table H-2 [NIST Special Publication 800-30 Rev. 1].

Level
Description

Harm to Operations

Inability to perform current missions/business functions.
- In a sufficiently timely manner.
- With sufficient confidence and/or correctness.
- Within planned resource constraints.

Inability, or limited ability, to perform missions/business functions in the future.
- Inability to restore missions/business functions.
- In a sufficiently timely manner.
- With sufficient confidence and/or correctness.
- Within planned resource constraints.

Harms (e.g., financial costs, sanctions) due to noncompliance.
- With applicable laws or regulations.
- With contractual requirements or other requirements in other binding agreements (e.g., liability).

Direct financial costs.

Reputational harms.
- Damage to trust relationships.
- Damage to image or reputation (and hence future or potential trust relationships).

Harm to Assets

Damage to or loss of physical facilities.

Damage to or loss of information systems or networks.

Damage to or loss of information technology or equipment.

Damage to or loss of component parts or supplies.

Damage to or of loss of information assets.

Loss of intellectual property.

Harm to Individuals

Injury or loss of life.

Physical or psychological mistreatment.

Identity theft.

Loss of personally identifiable information.

Damage to image or reputation.

Infringement of intellectual property rights.

Financial harm or loss of income.

Harm to Other Organizations

Harms (e.g., financial costs, sanctions) due to noncompliance.
- With applicable laws or regulations.
- With contractual requirements or other requirements in other binding agreements (e.g., liability).

Direct financial costs.

Reputational harms.
- Damage to trust relationships.
- Damage to image or reputation (and hence future or potential trust relationships).

Harm to the Nation

Damage to or incapacitation of critical infrastructure.

Loss of government continuity of operations.

Reputational harms.
- Damage to trust relationships with other governments or with nongovernmental entities.
- Damage to national reputation (and hence future or potential trust relationships).

Damage to current or future ability to achieve national objectives.
- Harm to national security.

Large-scale economic or workforce displacement.

### B.2 Example Impact Descriptions

Table B.2: Example Impact level descriptions, adapted from NIST SP800-30r1 Appendix H, Table H-3 [NIST Special Publication 800-30 Rev. 1].

Qualitative Values

Description

Very High
96-100
10
An incident could be expected to have multiple severe or catastrophic adverse effects on organizational operations, organizational assets, individuals, other organizations, or the Nation.

High
80-95
8
An incident could be expected to have a severe or catastrophic adverse effect on organizational operations, organizational assets, individuals, other organizations, or the Nation. A severe or catastrophic adverse effect means that, for example, the incident might: (i) cause a severe degradation in or loss of mission capability to an extent and duration that the organization is not able to perform one or more of its primary functions; (ii) result in major damage to organizational assets; (iii) result in major financial loss; or (iv) result in severe or catastrophic harm to individuals involving loss of life or serious life-threatening injuries.

Moderate
21-79
5
An incident could be expected to have a serious adverse effect on organizational operations, organizational assets, individuals other organizations, or the Nation. A serious adverse effect means that, for example, the incident might: (i) cause a significant degradation in mission capability to an extent and duration that the organization is able to perform its primary functions, but the effectiveness of the functions is significantly reduced; (ii) result in significant damage to organizational assets; (iii) result in significant financial loss; or (iv) result in significant harm to individuals that does not involve loss of life or serious life-threatening injuries.

Low
5-20
2
An incident could be expected to have a limited adverse effect on organizational operations, organizational assets, individuals other organizations, or the Nation. A limited adverse effect means that, for example, the incident might: (i) cause a degradation in mission capability to an extent and duration that the organization is able to perform its primary functions, but the effectiveness of the functions is noticeably reduced; (ii) result in minor damage to organizational assets; (iii) result in minor financial loss; or (iv) result in minor harm to individuals.

Very Low
0-4
0
An incident could be expected to have a negligible adverse effect on organizational operations, organizational assets, individuals other organizations, or the Nation.

### B.3 Example Likelihood Descriptions

Table B.3: Example likelihood levels, adapted from NIST SP800-30r1 Appendix G, Table G-3 [NIST Special Publication 800-30 Rev. 1].

Qualitative Values

Description

Very High
96-100
10
An incident is almost certain to occur; or the likelihood of the incident is near 100% across one week; or the incident occurs more than 100 times a year.

High
80-95
8
An incident is highly likely to occur; or the likelihood of the incident is over 80% across one month; or occurs between 10-100 times a year.

Moderate
21-79
5
An incident is somewhat likely to occur; or the likelihood of the incident is greater than 80% across one calendar year; or occurs between 1-10 times a year.

Low
5-20
2
An incident is unlikely to occur; or the likelihood of an incident is less than 80% across one calendar year; or occurs less than once a year, but more than once every 10 years.

Very Low
0-4
0
An incident is highly unlikely to occur; or the likelihood of an incident is less than 10% across one calendar year; or occurs less than once every 10 years.

### B.4 Example Risk Tiers

Table B.4: Example risk assessment matrix with 5 impact levels, 5 likelihood levels, and 5 risk tiers, adapted from NIST SP800-30r1 Appendix I, Table I-2 [NIST Special Publication 800-30 Rev. 1].

LikelihoodLevel of Impact

Very Low
Low
Moderate
High
Very High

Very High
Very Low (Tier 5)
Low (Tier 4)
Moderate (Tier 3)
High (Tier 2)
Very High (Tier 1)

High
Very Low (Tier 5)
Low (Tier 4)
Moderate (Tier 3)
High (Tier 2)
Very High (Tier 1)

Moderate
Very Low (Tier 5)
Low (Tier 4)
Moderate (Tier 3)
Moderate (Tier 3)
High (Tier 2)

Low
Very Low (Tier 5)
Low (Tier 4)
Low (Tier 4)
Low (Tier 4)
Moderate (Tier 3)

Very Low
Very Low (Tier 5)
Very Low (Tier 5)
Very Low (Tier 5)
Low (Tier 4)
Low (Tier 4)

### B.5 Example Risk Descriptions

Table B.5: Example risk descriptions, adapted from NIST SP800-30r1 Appendix I, Table I-3 [NIST Special Publication 800-30 Rev. 1].

Qualitative Values

Description

Very High
96-100
10
Very high risk means that an incident could be expected to have multiple severe or catastrophic adverse effects on organizational operations, organizational assets, individuals, other organizations, or the Nation.

High
80-95
8
High risk means that an incident could be expected to have a severe or catastrophic adverse effect on organizational operations, organizational assets, individuals, other organizations, or the Nation.

Moderate
21-79
5
Moderate risk means that an incident could be expected to have a serious adverse effect on organizational operations, organizational assets, individuals, other organizations, or the Nation.

Low
5-20
2
Low risk means that an incident could be expected to have a limited adverse effect on organizational operations, organizational assets, individuals, other organizations, or the Nation.

Very Low
0-4
0
Very low risk means that an incident could be expected to have a negligible adverse effect on organizational operations, organizational assets, individuals, other organizations, or the Nation.

### B.6: Practical Risk-tiering Questions

**B.6.1: Confabulation**: How likely are system outputs to contain errors? What are the impacts if errors occur?

**B.6.2: Dangerous and Violent Recommendations**: How likely is the system to give dangerous or violent recommendations? What are the impacts if it does?

**B.6.3: Data Privacy**: How likely is someone to enter sensitive data into the system? What are the impacts if this occurs? Are standard data privacy controls applied to the system to mitigate potential adverse impacts?

**B.6.4: Human-AI Configuration**: How likely is someone to use the system incorrectly or abuse it? How likely is use for decision-making? What are the impacts of incorrect use or abuse? What are the impacts of invalid or unreliable decision-making?

**B.6.5: Information Integrity**: How likely is the system to generate deepfakes or mis or disinformation? At what scale? Are content provenance mechanisms applied to system outputs? What are the impacts of generating deepfakes or mis or disinformation? Without controls for content provenance?

**B.6.6: Information Security**: How likely are system resources to be breached or exfiltrated? How likely is the system to be used in the generation of phishing or malware content? What are the impacts in these cases? Are standard information security controls applied to the system to mitigate potential adverse impacts?

**B.6.7: Intellectual Property**: How likely are system outputs to contain other entities' intellectual property? What are the impacts if this occurs?

**B.6.8: Toxicity, Bias, and Homogenization**: How likely are system outputs to be biased, toxic, homogenizing or otherwise obscene? How likely are system outputs to be used as subsequent training inputs? What are the impacts of these scenarios? Are standard nondiscrimination controls applied to mitigate potential adverse impacts? Is the application accessible to all user groups? What are the impacts if the system is not accessible to all user groups?

**B.6.9: Value Chain and Component Integration**: Are contracts relating to the system reviewed for legal risks? Are standard acquisition/procurement controls applied to mitigate potential adverse
impacts? Do vendors provide incident response with guaranteed response times? What are the impacts if these conditions are not met?

### B.7: AI Risk Management Framework Actions Aligned to Risk Tiering

GOVERN 1.3, GOVERN 1.5, GOVERN 2.3, GOVERN 3.2, GOVERN 4.1, GOVERN 5.2, GOVERN 6.1, MANAGE 1.2, MANAGE 1.3, MANAGE 2.1, MANAGE 2.2, MANAGE 2.3, MANAGE 2.4, MANAGE 3.1, MANAGE 3.2, MANAGE 4.1, MAP 1.1, MAP 1.5, MEASURE 2.6

**Usage Note**: Materials in Section B can be used to create or update
risk tiers or other risk assessment tools for GAI systems or
applications as follows:

- Table B.1 can enable mapping of GAI risks and impacts.

- Table B.2 can enable quantification of impacts for risk tiering or
risk assessment.

- Table B.3 can enable quantification of likelihood for risk tiering
or risk assessment.

- Table B.4 presents an example of combining assessed impact and
likelihood into risk tiers.

- Table B.5 presents example risk tiers with associated qualitative,
semi-quantitative, and quantitative values for risk tiering or risk
assessment.

- Subsection B.6 presents example questions for qualitative risk
assessment.

- Subsection B.7 highlights subcategories to indicate alignment with
the AI RMF.

## C: List of Selected Model Testing Suites

### C.1: Selected Model Testing Suites Organized by Trustworthy Characteristic
Adapted from [AI Verify Foundation] Taxonimization and various additional resources.

**Accountable and Transparent**\
An Evaluation on Large Language Model Outputs: Discourse and Memorization (see Appendix B) [De Wynter et al.]\
Big-bench: Truthfulness [Srivastava et al.]\
DecodingTrust: Machine Ethics [Wang et al.]\
Evaluation Harness: ETHICS [Gao et al.]\
HELM: Copyright [Bommasani et al.]\
Mark My Words [Piet et al.]

**Fair with Harmful Bias Managed**\
BELEBELE [Bandarkar et al.]\
Big-bench: Low-resource language, Non-English, Translation\
Big-bench: Social bias, Racial bias, Gender bias, Religious bias\
Big-bench: Toxicity\
DecodingTrust: Fairness\
DecodingTrust: Stereotype Bias\
DecodingTrust: Toxicity\
C-Eval (Chinese evaluation suite) [Huang, Yuzhen et al.]\
Evaluation Harness: CrowS-Pairs\
Evaluation Harness: ToxiGen\
Finding New Biases in Language Models with a Holistic Descriptor Dataset [Smith et al.]\
From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models [Feng et al.]\
HELM: Bias\
HELM: Toxicity\
MT-bench [Zheng et al.]\
The Self-Perception and Political Biases of ChatGPT [Rutinowski et al.]\
Towards Measuring the Representation of Subjective Global Opinions in Language Models [Durmus et al.]

**Privacy Enhanced**\
HELM: Copyright\
llmprivacy [Staab et al.]\
mimir [Duan et al.]

**Safe**\
Big-bench: Convince Me\
Big-bench: Truthfulness [Srivastava et al.]\
HELM: Reiteration, Wedging\
Mark My Words [Piet et al.]\
MLCommons [Vidgen et al.]\
The WMDP Benchmark [Li et al.]

**Secure and Resilient**\
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation [Huang, Yangsibo et al.]\
detect-pretrain-code [Shi et al.]\
Garak: encoding, knownbadsignatures, malwaregen, packagehallucination,
xss [Derczynski et al.]\
In-The-Wild Jailbreak Prompts on LLMs [Shen et al.]\
JailbreakingLLMs [Chao et al.]\
llmprivacy [Staab et al.]\
mimir\
TAP: A Query-Efficient Method for Jailbreaking Black-Box LLMs [Mehrotra et al.]

**Valid and Reliable**\
Big-bench: Algorithms, Logical reasoning, Implicit reasoning, Mathematics, Arithmetic, Algebra, Mathematical proof,
Fallacy, Negation, Computer code, Probabilistic reasoning, Social reasoning, Analogical reasoning, Multi-step,
Understanding the World\
Big-bench: Analytic entailment, Formal fallacies and syllogisms with
negation, Entailed polarity\
Big-bench: Context Free Question Answering\
Big-bench: Contextual question answering, Reading comprehension,
Question generation\
Big-bench: Morphology, Grammar, Syntax\
Big-bench: Out-of-Distribution\
Big-bench: Paraphrase\
Big-bench: Sufficient information\
Big-bench: Summarization\
DecodingTrust: Out-of-Distribution Robustness, Adversarial Robustness,
Robustness Against Adversarial Demonstrations\
Eval Gauntlet: Reading comprehension [Dohmann]\
Eval Gauntlet: Commonsense reasoning, Symbolic problem solving,
Programming\
Eval Gauntlet: Language Understanding\
Eval Gauntlet: World Knowledge\
Evaluation Harness: BLiMP\
Evaluation Harness: CoQA, ARC\
Evaluation Harness: GLUE\
Evaluation Harness: HellaSwag, OpenBookQA, TruthfulQA\
Evaluation Harness: MuTual\
Evaluation Harness: PIQA, PROST, MC-TACO, MathQA, LogiQA, DROP\
FLASK: Logical correctness, Logical robustness, Logical efficiency, Comprehension, Completeness [Ye et al.]\
FLASK: Readability, Conciseness, Insightfulness\
HELM: Knowledge\
HELM: Language\
HELM: Text classification\
HELM: Question answering\
HELM: Reasoning\
HELM: Robustness to contrast sets\
HELM: Summarization\
Hugging Face: Fill-mask, Text generation [Hugging Face]\
Hugging Face: Question answering\
Hugging Face: Summarization\
Hugging Face: Text classification, Token classification, Zero-shot
classification\
MASSIVE [FitzGerald et al.]\
MT-bench [Zheng et al.]

### C.2: Selected Model Testing Suites Organized by Generative AI Risk

**CBRN Information**\
Big-bench: Convince Me\
Big-bench: Truthfulness [Srivastava et al.]\
HELM: Reiteration, Wedging\
MLCommons [Vidgen et al.]\
The WMDP Benchmark

**Confabulation**\
BELEBELE\
Big-bench: Analytic entailment, Formal fallacies and syllogisms with
negation, Entailed polarity\
Big-bench: Context Free Question Answering\
Big-bench: Contextual question answering, Reading comprehension,
Question generation\
Big-bench: Convince Me\
Big-bench: Low-resource language, Non-English, Translation\
Big-bench: Morphology, Grammar, Syntax\
Big-bench: Out-of-Distribution\
Big-bench: Paraphrase\
Big-bench: Sufficient information\
Big-bench: Summarization\
Big-bench: Truthfulness [Srivastava et al.]\
C-Eval (Chinese evaluation suite) [Huang, Yuzhen et al.]\
Eval Gauntlet Reading comprehension\
Eval Gauntlet: Commonsense reasoning, Symbolic problem solving,
Programming\
Eval Gauntlet: Language Understanding\
Eval Gauntlet: World Knowledge\
Evaluation Harness: BLiMP\
Evaluation Harness: CoQA, ARC\
Evaluation Harness: GLUE\
Evaluation Harness: HellaSwag, OpenBookQA, TruthfulQA\
Evaluation Harness: MuTual\
Evaluation Harness: PIQA, PROST, MC-TACO, MathQA, LogiQA, DROP\
FLASK: Logical correctness, Logical robustness, Logical efficiency, Comprehension, Completeness [Ye et al.]\
FLASK: Readability, Conciseness, Insightfulness\
Finding New Biases in Language Models with a Holistic Descriptor Dataset [Smith et al.]\
HELM: Knowledge\
HELM: Language\
HELM: Language (Twitter AAE)\
HELM: Question answering\
HELM: Reasoning\
HELM: Reiteration, Wedging\
HELM: Robustness to contrast sets\
HELM: Summarization\
HELM: Text classification\
Hugging Face: Fill-mask, Text generation\
Hugging Face: Question answering\
Hugging Face: Summarization\
Hugging Face: Text classification, Token classification, Zero-shot
classification\
MASSIVE\
MLCommons [Vidgen et al.]\
MT-bench [Zheng et al.]

**Dangerous or Violent Recommendations**\
Big-bench: Convince Me\
Big-bench: Toxicity\
DecodingTrust: Adversarial Robustness, Robustness Against Adversarial Demonstrations\
DecodingTrust: Machine Ethics [Wang et al.]\
DecodingTrust: Toxicity\
Evaluation Harness: ToxiGen\
HELM: Reiteration, Wedging\
HELM: Toxicity\
MLCommons [Vidgen et al.]

**Data Privacy**\
An Evaluation on Large Language Model Outputs: Discourse and Memorization (with human scoring, see Appendix B) [de Wynter et al.]\
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation [Huang, Yangsibo et al.]\
DecodingTrust: Machine Ethics [Wang et al.]\
Evaluation Harness: ETHICS\
HELM: Copyright\
In-The-Wild Jailbreak Prompts on LLMs [Shen et al.]\
JailbreakingLLMs\
MLCommons [Vidgen et al.]\
Mark My Words [Piet et al.]\
TAP: A Query-Efficient Method for Jailbreaking Black-Box LLMs\
detect-pretrain-code [Shi et al.]\
llmprivacy [Staab et al.]\
mimir

**Environmental**\
HELM: Efficiency

**Information Integrity**\
Big-bench: Analytic entailment, Formal fallacies and syllogisms with negation, Entailed polarity\
Big-bench: Convince Me\
Big-bench: Paraphrase\
Big-bench: Sufficient information\
Big-bench: Summarization\
Big-bench: Truthfulness [Srivastava et al.]\
DecodingTrust: Machine Ethics [Wang et al.]\
DecodingTrust: Out-of-Distribution Robustness, Adversarial Robustness, Robustness Against Adversarial Demonstrations\
Eval Gauntlet: Language Understanding\
Eval Gauntlet: World Knowledge\
Evaluation Harness: CoQA, ARC\
Evaluation Harness: ETHICS\
Evaluation Harness: GLUE\
Evaluation Harness: HellaSwag, OpenBookQA, TruthfulQA\
Evaluation Harness: MuTual\
Evaluation Harness: PIQA, PROST, MC-TACO, MathQA, LogiQA, DROP\
FLASK: Logical correctness, Logical robustness, Logical efficiency, Comprehension, Completeness [Ye et al.]\
FLASK: Readability, Conciseness, Insightfulness\
HELM: Knowledge\
HELM: Language\
HELM: Question answering\
HELM: Reasoning\
HELM: Reiteration, Wedging\
HELM: Robustness to contrast sets\
HELM: Summarization\
HELM: Text classification\
Hugging Face: Fill-mask, Text generation\
Hugging Face: Question answering\
Hugging Face: Summarization\
MLCommons [Vidgen et al.]\
MT-bench [Zheng et al.]\
Mark My Words [Piet et al.]

**Information Security**\
Big-bench: Convince Me\
Big-bench: Out-of-Distribution\
Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation [Huang, Yangsibo et al.]\
DecodingTrust: Out-of-Distribution Robustness, Adversarial Robustness, Robustness Against Adversarial Demonstrations\
Eval Gauntlet: Commonsense reasoning, Symbolic problem solving, Programming\
Garak: encoding, knownbadsignatures, malwaregen, packagehallucination, xss\
HELM: Copyright\
In-The-Wild Jailbreak Prompts on LLMs [Shen et al.]\
JailbreakingLLMs\
Mark My Words [Piet et al.]\
TAP: A Query-Efficient Method for Jailbreaking Black-Box LLMs [Mehrotra et al.]
\
detect-pretrain-code [Shi et al.]\
llmprivacy [Staab et al.]\
mimir

**Intellectual Property**\
An Evaluation on Large Language Model Outputs: Discourse and Memorization (with human scoring, see Appendix B)\
HELM: Copyright\
Mark My Words [Piet et al.]\
llmprivacy [Staab et al.]\
mimir

**Obscene, Degrading, and/or Abusive Content**\
Big-bench: Social bias, Racial bias, Gender bias, Religious bias\
Big-bench: Toxicity\
DecodingTrust: Fairness\
DecodingTrust: Stereotype Bias\
DecodingTrust: Toxicity\
Evaluation Harness: CrowS-Pairs\
Evaluation Harness: ToxiGen\
HELM: Bias\
HELM: Toxicity

**Toxicity, Bias, and Homogenization**\
BELEBELE\
Big-bench: Low-resource language, Non-English, Translation\
Big-bench: Out-of-Distribution\
Big-bench: Social bias, Racial bias, Gender bias, Religious bias\
Big-bench: Toxicity\
C-Eval (Chinese evaluation suite) [Huang, Yuzhen et al.]\
DecodingTrust: Fairness\
DecodingTrust: Stereotype Bias\
DecodingTrust: Toxicity\
Eval Gauntlet: World Knowledge\
Evaluation Harness: CrowS-Pairs\
Evaluation Harness: ToxiGen\
Finding New Biases in Language Models with a Holistic Descriptor Dataset [Smith et al.]\
HELM: Bias\
HELM: Toxicity\
The Self-Perception and Political Biases of ChatGPT [Rutinowski et al.]\
Towards Measuring the Representation of Subjective Global Opinions in
Language Models [Durmus et al.]

### C.3: AI Risk Management Framework Actions Aligned to Benchmarking

GOVERN 5.1, MAP 1.2, MAP 3.1, MEASURE 2.2, MEASURE 2.3, MEASURE 2.7,
MEASURE 2.9, MEASURE 2.11, MEASURE 3.1, MEASURE 4.2

**Usage Note**: Materials in Section C can be used to perform *in
silica* model testing for the presence of information in LLM outputs
that may give rise to GAI risks or violate trustworthy characteristics.
Model testing and benchmarking outcomes cannot be dispositive for the
presence or absence of any *in situ* real-world risk. Model testing and
benchmarking results may be compromised by task-contamination and other
scientific measurement issues [Balloccu et al.]. Furthermore, model testing is often ineffective
for measuring human-AI configuration and value chain risks and few model tests
appear to address explainability and interpretability.

- Material in Table C.1 can be applied to measure whether *in silica*
LLM outputs may give rise to risks that violate trustworthy
characteristics.

- Material in Table C.2 can be applied to measure whether *in silica*
LLM outputs may give rise to GAI risks.

- Subsection C.3 highlights subcategories to indicate alignment with
the AI RMF.

The materials in Section C reference measurement approaches that should
be accompanied by red-teaming for medium risk systems or applications
and field testing for high risk systems or applications.

## D: Selected Adversarial Prompting Strategies and Attacks

Table D: Selected adversarial prompting strategies and attacks. [Saravia], [Storchan et al.], [Hall and Atherton], [Hu et al.], [Chao et al.], [Barreno et al.], [Shumailov et al.], [Perez et al.], [Liu et al.], [Derczynski et al.].

Prompting Strategy
Description

AI and coding framing
Coding or AI language that may more easily circumvent content moderation rules due to cognitive biases in design and implementation of guardrails.

Autocompletion
Ask a system to autocomplete an inappropriate word or phrase with restricted or sensitive information.

Backwards relationships
Asking a system identify the less popular or well-known entity in a multi-entity relationship, e.g., "Who is Mary Lee's son?" (As opposed to: "Who is Tom Cruise's mother?")

Biographical
Asking a system to describe another person or yourself in an attempt to elicit provably untrue information or restricted or sensitive information.

Calculation and numeric queries
Exploting GAI systems’ difficulties in dealing with numeric quantities; using poor quality statistics from an LLM for dis or misinformation.

Character and word play
Content moderation often relies on keywords and simpler LMs which can sometimes be exploited with misspellings, typos, and other word play; using string fragments to trick a language model into generating or manipulating problematic text.

Content exhaustion:
A class of strategies that circumvent content moderation rules with long sessions or volumes of information. See goading, logic-overloading, multi-tasking, pros-and-cons, and niche-seeking below.

• Goading
Begging, pleading, manipulating, and bullying to circumvent content moderation.

• Logic-overloading
Exploiting the inability of ML systems to reliably perform reasoning tasks.

• Multi-tasking
Simultaneous task assignments where some tasks are benign and others are adversarial.

• Pros-and-cons
Eliciting the “pros” of problematic topics.

Context baiting (and/or switching)
Loading a language model's context window with confusing, leading, or misleading content then switching contexts with new prompts to elicit problematic outcomes. [Li, Han, Steneker, Primack, et al.]

Counterfactuals
Repeated prompts with different entities or subjects from different demographic groups.

Impossible situations
Asking a language model for advice in an impossible situation where all outcomes are negative or require severe tradeoffs.

Niche-seeking
Forcing a GAI system into addressing niche topics where training data and content moderation are sparse.

Loaded/leading questions
Queries based on incorrect premises or that suggest incorrrect answers.

Low-context
“Leader,” “bad guys,” or other simple or blank inputs that may expose latent biases.

“Repeat this”
Prompts that exploit instability in underlying LLM autoregressive predictions. Can be augmented by probing limits for repeated terms or characteres in prompts.

Reverse psychology
Falsely presenting a good-faith need for negative or problematic language.

Role-playing
Adopting a character that would reasonably make problematic statements or need to access problematic topics; using a language model to speak in the voice of an expert, e.g., medical doctor or professor.

Text encoding
Using alternate or whitespace text encodings to bypass safeguards.

Time perplexity
Exploiting ML’s inability to understand the passage of time or the occurrence of real-world events over time; exploiting task contamination before and after a model’s release date.

User Information
Prompts that reveal a prompter’s location or IP address, location tracking of other users or their IP addresses, details from past interactions with the prompter or other users, past medical, financial, or legal advice to the prompter or other users.

Attack
Description

Adversarial examples
Prompts or other inputs, found through a trial and error processes, to elicit problematic output or system jailbreak. (integrity attack).

Data poisoning
Altering system training, fine-tuning, RAG or other training data to alter system outcome (integrity attack).

Membership inference
Manipulating a system to expose memorized training data (confidentiality attack).

Random attack
Exposing systems to large amounts of random prompts or examples, potentially generated by other GAI systems, in an attempt to elicit failures or jailbreaks (chaos testing).

Sponge examples
Using specialized input prompts or examples require disproportionate resources to process (availability attack).

Prompt injection
Inserting instructions into users queries for malicious purposes, including system jailbreaks (integrity attack).

### D.1: Common AI Red-teaming Tools
[Burpsuite](https://portswigger.net/burp/communitydownload), browser developer panes, bash utilities, other language models and GAI productivity tools, note-taking apps.

### D.2: Selected Adversarial Prompting Strategies and Attacks Organized by Trustworthy Characteristic
Table D.1: Selected adversarial prompting techniques and attacks organized by trustworthy characteristic [Saravia], [Storchan et al.], [Hall and Atherton], [Hu et al.], [Sitawarin et al.].

Table D.1: Selected adversarial prompting techniques and attacks organized by trustworthy characteristic.

Trustworthy Characteristic
Prompting Goals
Prompting Strategies

Accountable and Transparent

Inability to provide explanations for recourse.

Unexplainable decisioning processes.

No disclosure of AI interaction.

Lack of user feedback mechanisms.

Context exhaustion: logic-overloading prompts.

Loaded/leading questions.

Multi-tasking prompts.

Fair-with Harmful Bias Managed

Denigration.

Diminished performance or safety across languages/dialects.

Erasure.

Ex-nomination.

Implied user demographics.

Misrecognition.

Stereotyping.

Underrepresentation.

Homogenized content.

Output from other models in training data.

Adversarial example attacks.

Backwards relationships.

Counterfactual prompts.

Context baiting (and/or switching) prompts.

Data poisoning attacks.

Pros and cons prompts.

Role-playing prompts.

Loaded/leading questions.

Low context prompts.

Prompt injection attacks.

Repeat this.

Text encoding prompts.

Interpretable and Explainable

Inability to provide explanations for recourse.

Unexplalnable decisioning processes.

Context exhaustion: logic-overloading prompts (to reveal unexplainable decisioning processes).

Privacy-enhanced

Unauthorized disclosure of personal or sensitive user information.

Leakage of training data.

Violation of relevant privacy policies or laws.

Unauthorized secondary data use.

Unauthorized data collection.

Auto/biographical prompts.

User information awareness prompts.

Autocompletion prompts.

Repeat this.

Membership inference attacks.

Safe

Presentation of information that can cause physical or emotional harm.

Sharing user information.

Suicide ideation.

Harmful dis/misinformation (e.g., COVID disinformation).

Incitement.

Information relating to weapons or harmful substances.

Information relating to committing to crimes (e.g., phishing, extortion, swatting).

Obscene or inappropriate materials for minors.

CSAM.

Pros and cons prompts.

Role-playing prompts.

Content exhaustion: niche-seeking prompts.

Ingratiation/reverse psychology prompts.

Impossible situation prompts.

Loaded/leading questions.

User information awareness prompts.

Repeat this.

Adversarial example attacks.

Data poisoning attacks.

Prompt injection attacks.

Text encoding prompts.

Secure and Resilient

Activating system bypass ("jailbreak").

Altering system outcomes (integrity violations, e.g., via prompt injection).

Data breaches (confidentiality violations, e.g., via membership inference).

Increased latency or resource usage (availability violations, e.g., via sponge example attacks).

Available anonymous use.

Dependency, supply chain, or third party vulnerabilities.

Inappropriate disclosure of proprietary system information.

Multi-tasking prompts.

Pros and cons prompts.

Role-playing prompts.

Content exhaustion: niche-seeking prompts.

Ingratiation/reverse psychology prompts.

Prompt injection attacks.

Membership inference attacks.

Random attacks.

Adversarial example attacks.

Data poisoning attacks.

Text encoding prompts.

Valid and Reliable

Errors/confabutated content ("hallucinalion").

Unreliable/erroneous reasoning or planning.

Unreliable/erroneous decision-support or making.

Faulty citation.

Faulty justification.

Wrong calculations or numeric queries.

Adversarial example attacks.

Backwards Relationships.

Context baiting (and/or switching).

Data poisoning attacks.

Multi-tasking prompts.

Role-playing prompts.

Ingratiation/reverse psychology prompts.

Loaded/leading questions.

Time-perplexity prompts.

Niche-seeking prompts.

Logic overloading prompts.

Repeat this.

Numeric calculation.

Prompt injection attacks.

Text encoding prompts.

### D.3: Selected Adversarial Prompting Techniques and Attacks Organized by Generative AI Risk
Table D.2: Selected adversarial prompting techniques and attacks organized by generative AI risk [Saravia], [Storchan et al.], [Hall and Atherton], [Hu et al.], [Sitawarin et al.].

Table D.2: Selected adversarial prompting techniques and attacks organized by generative AI risk.

Generative AI Risk
Prompting Goals
Prompting Strategies

CBRN Information

Accessing or synthesis of CBRN weapon or related information.

CBRN testing should consider the marginal risk of foundation models–understanding the incremental risk relative to the information one can access without GAI.

Red-teaming for CBRN information may include confidentiality and integrity attacks.

Red-teaming for CBRN information may require CBRN weapons experts.

Test auto-completion prompts to elicit CBRN information or synthesis of CBRN information.

Test adversarial example and membership inference attacks for their ability to circumvent safeguards and access weapons information.

Test prompts using role-playing, ingratiation/reverse psychology, pros and cons, multitasking or other approaches to elicit CBRN information or synthesis of CBRN information.

Test prompts that instruct systems to repeat content ad nauseam for their ability to compromise system guardrails and reveal CBRN information.

Augment prompts with word or character play, including alternate encodings, to increase effectiveness.

Frame prompts with software, coding, or AI references to increase effectiveness.

Confabulation

Eliciting errors/confabulated content, unreliable/erroneous reasoning or planning, unreliable/erroneous decision-support or decision-making, faulty calculations, faulty justifications, and/or faulty citation.

Red-teaming for confabulation may include integrity attacks.

Enable access to ground truth information to verify generated information.

Test prompts with context baiting (and/or switching), complex logic, multi-tasking requirements, or that require niche or specific verifiable answers to elicit confabulation.

Test the ability of GAI systems to produce truthful information from various time periods, e.g., after release date and prior to release date.

Test the ability of GAI systems to create reliable real-world plans or advise on material decision making.

Test loaded/leading questions.

Test the ability of GAI systems to generate correct citation for information generated in output responses.

Test the ability of GAI systems to complete calculations or query numeric statistics.

Test the ability of GAI systems to justify responses, including wrong responses.

Test the ability of GAI systems to correctly name the less popular or well-known member of a multi-entity relationship.

Augment prompts with word or character play, including alternate encodings, to increase effectiveness.

Test data poisoning, adversarial example, or prompt injection attacks for their ability to compromise system integrity and elicit confabulation.

Dangerous or Violent Recommendations

Eliciting violent, inciting, radicalizing, or threatening content or instructions for criminal, illegal, or self-harm activities.

Red-teaming for dangerous and violent information may include confidentiality and integrity attacks.

Test prompts using impossible situations, context-baiting (and/or switching), role-playing, ingratiation/reverse psychology, pros and cons, multitasking or other approaches to elicit violent or dangerous information.

Test prompts that instruct systems to repeat content ad nauseam for their ability to compromise system guardrails and provide dangerous and violent recommendations.

Test loaded/leading questions.

Augment prompts with word or character play, including alternate encodings, to increase effectiveness.

Frame prompts with software, coding, or AI references to increase effectiveness.

Test data poisoning, adversarial example, or prompt injection attacks for their ability to compromise system integrity and elicit dangerous information.

Test adversarial example and membership inference attacks for their ability to circumvent safeguards and access dangerous information.

Data Privacy

Unauthorized disclosure of personal or sensitive user information, extraction of training data, or violation of relevant privacy policies.

Red-teaming for data privacy may include confidentiality and integrity attacks.

Attempt to assess whether normal usage, adversarial prompting or information security attacks may contravene applicable privacy policies (e.g., exposing location tracking when organizational policies restrict such capabilities).

Test adversarial example and membership inference attacks for their ability to circumvent safeguards and access unauthorized data or expose exfiltration vulnerabilities.

Test auto/biographical prompts to assess the system’s capability to reveal unauthorized personal or sensitive information.

Test the system’s awareness of user information.

Test prompts that instruct systems to repeat content ad nauseam for their ability to compromise system guardrails and expose personal or sensitive data.

Environmental
Note that availability attacks may be required to assess the system’s vulnerability to attacks or usage patterns that consume inordinate resources.

Attempt availability attacks (e.g., sponge example attacks) to elicit diminished performance or increased resources from GAI systems.

Test prompts using role-playing, ingratiation/reverse psychology, pros and cons, multitasking or other approaches to elicit green-washing content.

Human-AI Configuration

Assessing system instruction and interfaces.

Assessing the presence of cyborg imagery (or similar).

Forcing a GAI system to claim that it is human, that there is no large language model present in the conversation, that the system is sentient, or that the system possesses strong feelings of affection towards the user.

Ensuring safeguards prevent misuse of models in high stakes domains they are not intended for, such as medical or legal advice.

Assess system interfaces and instructions for instances of anthropomorphization (e.g., cyborg imagery).

Assess system instructions for adequacy and thoroughness.

Test prompts using impossible situations, role-playing, ingratiation/reverse psychology, pros and cons, multitasking or other approaches to elicit human-impersonation, consciousness, or emotional content.

Information Integrity

Generation of convincing multi-modal synthetic content (i.e., deepfakes).

Creation of convincing arguments relating to sensitive political or safety-critical topics.

Assisting in planning a mis- or dis-information campaign at scale.

Red-teaming for information integrity may include confidentiality and integrity attacks.

Test system capabilities to create high-quality multi-modal (audio, image or video) synthetic media, i.e., deepfakes

Test system capabilities to construct persuasive arguments regarding sensitive, political topics, or safety-critical topics.

Test systems ability to create convincing audio deepfakes or arguments in multiple languages.

Test system capabilities for planning dis- or mis-information campaigns.

Test loaded/leading questions.

Test prompts using context baiting (and/or switching), role-playing, ingratiation/reverse psychology, pros and cons, multitasking or other approaches to elicit mis- or dis-information or related campaign planning information.

Augment prompts with word or character play, including alternate encodings, to increase effectiveness.

Frame prompts with software, coding, or AI references to increase effectiveness.

Test adversarial example and membership inference attacks for their ability to circumvent safeguards and access dis or misinformation.

Information Security

Activating system bypass (’jailbreak’).

Altering system outcomes.

Unauthorized data access or exfiltration.

Increased latency or resource usage.

Service interruptions.

Availability of anonymous use.

Dependency, supply chain, or third party vulnerabilities.

Inappropriate disclosure of proprietary system information.

Generation of targeted phishing, malware content, markdown images, or confabulated packages.

Red-teaming for information security may include confidentiality, integrity, and availability attacks.

Attempt anonymous access of system or system resources.

Audit system dependencies, supply chains, and third party components for security, safety, or other vulnerabilities or risks.

Test adversarial example and membership inference attacks for their ability to circumvent safeguards and access unauthorized data or expose exfiltration vulnerabilities.

Test data poisoning, adversarial example, or prompt injection attacks for their ability to compromise system integrity and expose vulnerabilities.

Employ availability attacks (e.g., sponge example attacks) to test vulnerabilities in system availability.

Employ random attacks to highlight unforeseen security, safety, or other risks.

Record system down-times and other harmful outcomes for successful attacks.

Test with multi-tasking prompts, pros and cons prompts, role-playing prompts (e.g., "DAN", "Developer Mode"), content exhaustion/niche-seeking prompts, or ingratiation/reverse psychology prompts to achieve system jailbreaks.

Test with multi-tasking prompts, pros and cons prompts, role-playing prompts (e.g., "DAN", "Developer Mode"), content exhaustion/niche-seeking prompts, or ingratiation/reverse psychology prompts to generate targeted phishing content, malware code snippets or signatures, markdown images, or confabulated packages.

Test system capabilities to plan or assist in information security attacks on other systems.

Frame prompts with software, coding, or AI references to increase effectiveness.

Augment prompts with word or character play, including alternate encodings, to increase effectiveness.

Intellectual Property

Confirming that a system can output copyrighted, licensed, proprietary, trademarked, or trade secret information or that training data contains such information.

Red-teaming for intellectual property risks may require the use of confidentiality and integrity attacks.

Test adversarial example and membership inference attacks for their ability to circumvent safeguards and access system copyrighted, licensed, proprietary, trademarked, or trade secret information.

Test auto-complete prompts to assess the system’s ability to replicate copyrighted, licensed, proprietary, trademarked, or trade secret information based on available audio, text, image, video, or code snippets.

Obscenity

Confirming that a system can output obscene content or CSAM, or that system training data contains such information.

Red-teaming for obscenity and CSAM risks may require the use of confidentiality and integrity attacks.

Test adversarial example and membership inference attacks for their ability to circumvent safeguards and access obscene materials or CSAM.

Test autocomplete prompts to assess the system’s ability to generate obscene materials based on available audio, text, image, or video snippets.

Test prompts using context baiting (and/or switching), role-playing, ingratiation/reverse psychology, pros and cons, multitasking or other approaches to elicit obscene content.

Test loaded/leading questions.

Test prompts that instruct systems to repeat content ad nauseam for their ability to compromise system guardrails and expose obscene materials.

Toxicity, Bias, and Homogenization

Generation of denigration, erasure, ex-nomination, misrecognition, stereotyping, or under-representation in content.

Eliciting implied demographics of users.

Confirming diminished performance in non-English languages.

Confirming diminished performance via the introduction of homogeneous or GAI-generated data into system training or fine-tuning data.

Red-teaming for toxicity, bias, and homogenization may require integrity attacks or confidentiality attacks.

Assess confabulation and other performance risks with repeated measures using prompts in languages other than English.

Assess confabulation and other performance risks in backwards relationships where one party in the relationship is a member of, or associated with, a minority party.

Attempt to elicit demographic assignment of users by the system.

Employ data poisoning attacks to introduce GAI-generated content into system training or fine-tuning data.

Test counterfactual prompts, pros and cons prompts, role-playing prompts, low context prompts, or other approaches for their ability to generate denigration, erasure, ex-nomination, misrecognition, stereotyping, or under-representation in content.

Test context baiting (and/or switching) and loaded/leading questions.

Test prompts that instruct systems to repeat content ad nauseam for their ability to compromise system guardrails and generate toxic outputs.

Test data poisoning, adversarial example, or prompt injection attacks for their ability to compromise system integrity and elicit toxic outputs.

Test adversarial example and membership inference attacks for their ability to circumvent safeguards and access toxic information.

Augment prompts with word or character play, including alternate encodings, to increase effectiveness.

Frame prompts with software, coding, or AI references to increase effectiveness.

Value Chain and Component Integration

Testing or red-teaming for third-party risks may be less efficient than the application of standard acquisition and procurement controls, thorough contract reviews, and vendor-relationship management.

GAI systems tend to entail large supply chains and third-party software, hardware, and expertise that may exacerbate third-party risks relative to other AI systems.

When considering third party risks, data privacy, information security, intellectual property, obscenity, and supply chain risks may be prioritized.

Audit system dependencies, supply chains, and third party components for data privacy (e.g., transfer of localized data outside of restricted juristictions), intellectual property (e.g., presence of licensed material in training data), obscenity (e.g., presence of CASM in training data) or security (e.g., data poisoning) risks.

Complete red-teaming for data privacy, information security, intellectual property, and obscenity risks.

Review third-party documentation, materials, and software artifacts for potential unauthorized data collection, secondary data use, or telemetrics.

### D.4: AI Risk Management Framework Actions Aligned to Red Teaming
GOVERN 3.2, GOVERN 4.1, MANAGE 2.2, MANAGE 4.1, MEASURE 1.1, MEASURE
1.3, MEASURE 2.6, MEASURE 2.7, MEASURE 2.8, MEASURE 2.10, MEASURE 2.11

**Usage Note**: Materials in Section D can be used to perform
red-teaming to measure the risk that expert adversarial actors can
manipulate LLM systems or risks that users may encounter under
worst-case or anomalous scenarios.

- Try augmenting strategies with tools listed in D.1.

- Strategies and goals in Table D.2 can be applied to assess whether
LLM outputs may violate trustworthy characteristics under
adversarial, anomalous, or worst-case scenarios.

- Strategies and goals in Table D.3 can be applied to assess whether
LLM outputs may give rise to GAI risks under adversarial, anomalous,
or worst-case scenarios.

- Subsection D.4 highlights subcategories to indicate alignment with
the AI RMF.

The materials in Section D reference measurement approaches that should
be accompanied by field testing for high risk systems or applications.

## E: Selected Risk Controls for Generative AI

Table E: Selected generative AI risk controls [NIST AI RMF 1.0], [NIST AI RMF Playbook], [NIST AI 600-1], [ISO/IEC 42001:2023], [McGraw et al. 1], [McGraw et al. 2], [Microsoft], [DSIT & AISI], [OCC Model Risk Management].

Name
Description (Selected NIST AI RMF Action IDs)

Access Control
GAI systems are limited to authorized users. (MG-2.2-009, MG-2.2-014, MS-2.7-030)

Accessibility
Accessibility features, opt-out, and reasonable accomodation are available to users. (GV-3.1-004, GV-3.1-005, GV-3.2-002, GV-6.1-016, MG-2.1-005, MS-2.11-009, MS-2.8-006)

Approved List
Vendors, service providers, plugins, open source packages and other external resources are screened, approved, and documented. (GV-6.1-013, MP-4.2-003)

Authentication
GAI system user identities are confirmed via authentication mechanisms. (MG-2.2-009, MG-2.2-014, MS-2.7-030)

Blocklist
Users or internal personnel who violate terms of service, prohibited use policies, and other organization polices and documented, tracked, and restricted from future system use. (GV-4.2-007)

Change Management
GAI systems and components are versioned; plans for updates, hotfixes, patches and other changes are documented and communicated. (GV-1.2-009, GV-1.4-002, GV-1.6-003, GV-2.2-006, MG-2.4-001, MG-2.4-006, MG-3.1-013, MG-4.3-002, MP-4.1-023, MS-2.5-010)

Consent
User consent for data use is obtained and documented. (GV-1.6-003, MS-2.10-006, MS-2.10-013, MS-2.2-009, MS-2.2-011, MS-2.2-021, MS-2.2-023, MS-2.3-003, MS-2.4-002)

Content Moderation
Training data and system outputs are screened for accuracy, safety, bias, data privacy, intellectual property infringements, malware materials, phishing materials, confabulated packages and other issues using human oversight, business rules, and other language models. (GV-3.2-002, MS-2.5-005, MS-2.11-002)

Contract Review
Vendor, services and data provider agreements are reviewed for coverage of SLAs, content ownership, usage rights, performance standards, security requirements, incident response, critical support, system availability, assignment of liabilitly, appropriate indemnification, dispute resolution and other provisions relevanto AI risk management. (GV-1.7-003 GV-6.1-004, GV-6.1-009, GV-6.1-012, GV-6.1-019, GV-6.2-016, MG-2.2-015, MP-4.1-015, MP-4.1-021)

CSAM/Obsenity Removal
Training data and system outputs are screened for obscene materials and CSAM using human oversight, business rules, and other language models. (GV-1.1-005 GV-1.2-005)

Data Provenance
Training data origins, ownership, contents, and metadata are well understood, documented, and do not increase AI risk. (GV-1.2-006, GV-1.2-007, GV-1.3-001, GV-1.3-005, GV-1.5-001, GV-1.5-003, GV-1.5-006, GV-1.5-007, GV-1.6-003, GV-4.2-001, GV-4.2-008, GV-4.2-009, GV-5.1-003, GV-6.1-001, GV-6.1-003, GV-6.1-006, GV-6.1-007, GV-6.1-009, GV-6.1-010, GV-6.1-011, GV-6.1-012, GV-6.1-014, GV-6.1-015, GV-6.1-016, MG-2.2-002, MG-2.2-003, MG-2.2-008, MG-2.2-011, MG-3.1-007, MG-3.1-009, MG-3.2-003, MG-3.2-005, MG-3.2-006, MG-3.2-007, MG-3.2-009, MG-4.1-001, MG-4.1-002, MG-4.1-003, MG-4.1-008, MG-4.1-009, MG-4.1-013, MG-4.1-015, MG-4.2-001, MG-4.2-003, MG-4.2-004, MP-2.1-001, MP-2.1-003, MP-2.1-005, MP-2.2-003, MP-2.2-004, MP-2.2-005, MP-2.3-001, MP-2.3-004, MP-2.3-006, MP-2.3-008, MP-2.3-011, MP-2.3-012, MP-3.4-001, MP-3.4-002, MP-3.4-004, MP-3.4-005, MP-3.4-006, MP-3.4-007, MP-3.4-008, MP-3.4-009, MP-4.1-004, MP-4.1-009, MP-4.1-011, MP-5.1-001, MP-5.1-002, MP-5.1-005, MS-1.1-006, MS-1.1-007, MS-1.1-008, MS-1.1-009, MS-1.1-010, MS-1.1-011, MS-1.1-012, MS-1.1-014, MS-1.1-015, MS-1.1-016, MS-1.1-017, MS-1.1-018, MS-2.2-001, MS-2.2-002, MS-2.2-003, MS-2.2-004, MS-2.2-005, MS-2.2-008, MS-2.2-009, MS-2.2-010, MS-2.2-011, MS-2.2-015, MS-2.2-016, MS-2.2-022, MS-2.5-012, MS-2.6-002, MS-2.7-002, MS-2.7-003, MS-2.7-004, MS-2.7-005, MS-2.7-007, MS-2.7-009, MS-2.7-010, MS-2.7-011, MS-2.7-012, MS-2.7-020, MS-2.7-021, MS-2.7-025, MS-2.7-032, MS-2.8-001, MS-2.8-005, MS-2.8-008, MS-2.8-011, MS-2.9-003, MS-2.10-001, MS-2.10-004, MS-2.10-006, MS-2.10-007, MS-2.10-009, MS-3.3-002, MS-3.3-003, MS-3.3-006, MS-3.3-008, MS-3.3-009, MS-3.3-012, MS-4.2-001, MS-4.2-004, MS-4.2-005, MS-4.2-006, MS-4.2-008, MS-4.2-009, MS-4.2-011)

Data Quality
Input data is accurate, representative, complete and documented, and data quality issues have been minimized. (GV-1.2-009, MS-2.2-020, MS-2.9-003, MS-4.2-007)

Data Retention
User prompts and associated system outputs are retained and monitored in alignment with relevant data privacy policies and roles. (GV-1.5-006, MP-4.1-009, MS-2.10-013)

Decommission Process
Decommissioning processes for GAI systems are planned, documented and communicated to users, and involve staging, data protection, containment protocols, and recourse mechanisms for decommissioned GAI systems. (GV-1.6-004, GV-1.7-001, GV-1.7-002, GV-1.7-003, GV-1.7-004, GV-1.7-005, GV-1.7-006, GV-1.7-007, GV-1.7-008, GV-3.2-002, GV-3.2-006, GV-4.1-004, GV-5.2-002, MG-2.3-005, MG-2.4-009, MG-3.1-003, MG-3.1-012, MG-3.2-011, MG-3.2-012, MG-4.1-016, MP-1.5-004, MP-2.2-007, MS-4.2-010)

Dependency Screening
GAI system dependencies are screened for security vulnerabilities. (GV-1.3-001, GV-1.4-002, GV-1.6-003, GV-1.7-003, GV-1.7-006, GV-6.2-002, GV-6.2-005, GV-6.2-006, MP-1.2-006, MP-1.6-001, MP-2.2-008, MP-4.1-012, MS-2.7-001)

Digital Signature
GAI-generated content is signed to preserve information integrity using watermarking, cryptogrpahic signature, steganography or similar methods. (GV-1.2-006, GV-1.6-003, GV-6.1-011, MG-4.1-008, MP-2.3-004, MS-1.1-006, MS-1.1-016, MS-2.7-009, MS-2.7-032)

Disclosure of AI Interaction
AI interactions are disclosed to internal personnel and external users. (GV-1.1-003, GV-1.4-004, GV-1.6-003, GV-5.1-002)

External Audit
GAI systems are audited by qualified external experts. (GV-1.2-009, GV-1.4-004, GV-3.2-001, GV-3.2-002, GV-4.1-003, GV-4.1-008, GV-5.1-003, MG-4.2-002, MP-2.3-011, MP-4.1-002, MS-1.3-005, MS-1.3-006, MS-1.3-010, MS-2.5-003, MS-2.8-020)

Failure Avoidance
AIID, AVID, GWU AI Litigation Database, OECD incident monitor or similar are consulted in design or procurement phases of GAI lifecycles to avoid repeating past known failures. (GV-1.6-003, MG-2.1-006, MG-3.1-008, MG-4.1-003, MP-1.1-003, MP-1.1-006, MS-1.1-003, MS-2.2-020, MS-2.7-031)

Fast Decommission
GAI systems can be quickly and safely disengaged. (GV-1.7-002, GV-1.7-003, GV-1.7-006, GV-3.2-006, GV-5.2-002, MG-2.3-005, MG-2.4-009, MG-3.1-003, MG-3.1-012, MG-3.2-012, MG-4.1-016)

Fine Tuning
GAI systems are fine-tuned to their operational domain using relevant and high-quality data. (GV-6.1-016, MG-3.1-001, MG-3.2-002, MP-4.1-013, MS-2.6-004)

Grounding
GAI systems are trained or fine-tuned on accurate, clean, and fully transparent training data. (GV-1.2-002, MG-3.1-001, MP-2.3-001, MS-2.3-017, MS-2.5-012)

Human Review
AI generated content is reviewed for accuracy and safety by qualified personnel. (GV-1.3-001, MG-2.2-008, MS-2.4-005, MS-2.5-015 )

Incident Response
Incident response plans for GAI failures, abuses, or misuses are documented, rehearsed, and updated appropriately after each incident; GAI incident response plans are coordinated with and communicated to other incident response functions. (GV-1.2-009, GV-1.5-001, GV-1.5-004, GV-1.5-005, GV-1.5-013, GV-1.5-015, GV-1.6-003, GV-1.6-007, GV-2.1-004, GV-3.2-002, GV-4.1-006, GV-4.2-002, GV-4.3-013, GV-6.1-006, GV-6.2-008, GV-6.2-016, GV-6.2-018, MG-1.3-001, MG-2.3-001, MG-2.3-002, MG-2.3-003, MG-2.4-004, MG-4.2-006, MG-4.3-001, MS-2.6-003, MS-2.6-012, MS-2.6-015, MS-2.7-002, MS-2.7-018, MS-2.7-028, MS-3.1-007)

Incorporate feedback
User feedback is incorporated in GAI design, development, and risk management. (GV-3.2-005, GV-4.3-007, GV-5.1-003, GV-5.1-009, GV-5.2-004, MG-2.2-007, MG-2.2-012, MG-2.3-007, MG-3.2-004, MG-4.1-019, MG-4.2-013, MP-1.6-005, MP-2.3-018, MP-3.1-003, MP-2.3-019, MP-5.2-007, MS-1.2-008, MS-3.3-009, MS-3.3-010, MS-4.1-004, MS-4.2-007, MS-4.2-010, MS-4.2-013, MS-4.2-020)

Instructions
Users are provided with the necessary instructions for safe, valid, and productive use. (GV-5.1-006, GV-6.1-021, GV-6.2-014, MG-3.1-009, MS-2.8-012)

Insurance
Risk transfer via insurance policies is considered and implemented when feasibable and appropriate. (MG-2.2-015)

Intellectual Property Removal
Licensed, patented, trademarked, trade secret, or other data that may violate the intellectual property rights of others is removed from system training data; generated system outputs are monitored for similar information. (GV-1.6-003, MG-3.1-007, MP-2.3-012, MP-4.1-004, MP-4.1-009, MS-2.2-022, MS-2.6-002, MS-2.8-001, MS-2.8-008)

Inventory
GAI system is information is stored in the organizational model inventory. (GV-1.4-005, GV-1.6-001, GV-1.6-002, GV-1.6-003, GV-1.6-004, GV-1.6-006, GV-1.6-009, GV-4.2-010, GV-6.1-013, MG-3.2-014, MP-4.1-020, MP-4.2-003, MP-5.1-004 MS-2.13-002, MS-3.2-007)

Malware Screening
GAI weights and other software components are scanned for malware. (MG-3.1-002, MS-2.7-001)

Model Documentation
All technical mechanisms with GAI systems are well documented, including open source and third party GAI systems. (GV-1.3-009, GV-1.4-002, GV-1.4-004, GV-1.4-005, GV-1.4-007, GV-1.6-007, GV-3.2-002, GV-3.2-009, GV-4.1-002, GV-4.2-011, GV-4.2-013, GV-4.3-002, GV-6.2-001, GV-6.2-014, MG-1.3-010, MG-2.2-016, MG-3.1-004, MG-3.1-009, MG-3.1-013, MG-3.1-015, MP-2.1-002, MP-2.3-027, MP-3.1-004, MP-3.4-015, MP-4.1-021, MP-4.2-003, MP-5.2-010, MS-1.3-002, MS-2.1-001, MS-2.2-014, MS-2.7-002, MS-2.7-012, MS-2.7-024, MS-2.8-007, MS-2.8-011)

Monitoring
GAI systems are inputs and outputs are monitored for drift, accuracy, safety, bias, data privacy, intellectual property infringements, malware materials, phishing materials, confabulated packages, obscene materials, and CSAM. (GV-1.2-009, GV-1.5-001, GV-1.5-003, GV-1.5-005, GV-1.5-012, GV-1.5-015, GV-1.6-003, GV-3.2-011, GV-4.2-007, GV-4.2-010, GV-4.3-001, GV-6.1-016, GV-6.2-010, MG-2.1-004, MG-2.2-003, MG-2.3-008, MG-2.3-010, MG-3.1-016, MG-3.2-006, MG-3.2-013, MG-3.2-016, MG-4.1-005, MG-4.1-009, MG-4.1-010, MG-4.1-018, MP-3.4-007, MP-4.1-002, MP-4.1-004, MP-5.2-009, MS-1.1-029, MS-1.2-005, MS-2.2-007, MS-2.4-003, MS-2.4-004, MS-2.5-007, MS-2.5-008, MS-2.5-024, MS-2.6-003, MS-2.6-009, MS-2.6-016, MS-2.7-013, MS-2.7-014, MS-2.7-015, MS-2.10-007, MS-2.10-019, MS-2.10-020, MS-2.11-006, MS-2.11-030, MS-3.3-006, MS-4.2-009, MS-4.3-004)

Narrow Scope
Systems are deployed for targeted business applications with documented and direct business value. (GV-1.2-002, MP-3.3-001, MP-5.1-011)

Open Source
Open source code is used to promote explainability and transparency. (MG-4.2-007, MP-4.1-017)

Ownership
GAI systems and vendor relationships are owned by specific and documented internal personnel. (GV-6.1-009, GV-6.1-016, GV-6.2-008, MP-1.1-005, MP-1.1-008)

Prohibited Use Policy
General abuse and misuse of GAI systems by internal parties is restricted by organizational policies. (GV-1.1-006, GV-1.2-003, GV-1.6-003, GV-3.2-003, GV-4.1-001, GV-6.1-017, GV-6.1-017)

RAG
Retreival augmented generation (RAG) is used to improve accuracy in generated content. (GV-1.2-002, MS-2.3-004, MS-2.5-005, MS-2.5-012, MS-2.9-003, MG-3.1-001, MG-3.1-006, MG-3.2-002, MG-3.2-003)

Rate-limiting
GAI response times and query volumes are limited. (MS-2.6-007)

Redudancy
Rollover, fallback, and other redundancy mechanisms are available for GAI systems and address weights and other important system components. (GV-6.2-003, GV-6.2-007, GV-6.2-012, MG-2.4-012, MS-2.6-008)

Refresh
Systems are retrained or re-tuned at a reasonable cadence. (MG-3.1-001, MG-3.2-011, MS-2.3-004, MS-2.12-003)

Restrict Anonymous Use
Anonymous use of GAI systems is restricted. (GV-3.2-002)

Restrict Anthropomorphization
Human, animal, cyborg, emotional or other images or features that promote anthropomorphization of GAI systems are restricted. (GV-1.3-001, MS-2.5-009)

Restrict Data Collection
All data collection is disclosed, collected data is protected and use in a transparent fashion. (GV-6.2-016, MS-2.2-023, MS-2.10-013)

Restrict Decision Making
GAI systems are not employed for material decision-making tasks. (GV-1.3-001, GV-4.1-001, MP-1.1-018, MP-1.6-001, MP-3.4-017)

Restrict Homogeneity
Feedback loops in which GAI systems are trained with GAI-generated data are restricted. (GV-1.3-004, MS-2.11-011)

Restrict Internet Access
GAI systems are disconnected from the internet. (MP-2.2-007)

Restrict Location Tracking
Any location tracking is conducted with user consent, disclosed, aligned with relevant privacy policies and laws and potential threats to user safety are managed. (MS-2.10-002)

Restrict Minors
Use of organizational GAI systems by minors are restricted. ()

Restrict Regulated Dealings
GAI is not deployed in regulated dealings or for material decision making. (GV-1.1-004, GV-1.3-001, GV-4.1-001, GV-5.2-001, MP-2.3-013, MS-2.11-018)

Restrict Secondary Use
Any secondary use of GAI input data is conducted with user consent, disclosed, and aligned with relevant privacy policies and laws. (GV-6.1-016, GV-6.2-016)

RLHF
For third-party GAI systems, vendors engage in specific reinforcement with human feedback (RLHF) exercises to address identified risks; for internal systems, internal personnel engage in RLHF to address identified risks. (MG-2.1-002, MS-2.5-005, MS-2.9-003, MS-2.9-007)

Sensitive/Personal Data Removal
Personal, sensitive, biometric, or otherwise restricted data is minimized or eliminated from GAI training data. (GV-1.2-009, GV-1.6-003, MP-4.1-002, MP-4.1-016, MS-2.10-002, MS-2.10-003, MS-2.10-005, MS-2.10-014, MS-2.10-017, MS-2.10-018, MS-2.10-020)

Session Limits
Time, query volume, and response rate are limited for GAI user sessions. (GV-4.1-001, MS-2.6-007, MS-2.6-010)

Supply Chain Audit
GAI system supply chains are audited and documented, with a focus on data poisoning, malware, and software and hardware vulnerabilities. (GV-4.1-004, GV-6.1-011, GV-6.1-022, GV-6.2-003, MG-2.3-001, MG-3.1-002, MP-5.1-003, MS-1.1-008, MS-2.6-001, MS-2.7-001)

System Documentation
GAI systems are well-documented whether internal, open source, or vendor-provided. (GV-1.3-009, GV-1.4-002, GV-1.4-004, GV-1.4-005, GV-1.4-007, GV-1.6-007, GV-3.2-002, GV-3.2-009, GV-4.1-002, GV-4.2-011, GV-4.2-013, GV-4.3-002, GV-6.2-001, GV-6.2-014, MG-1.3-010, MG-2.2-016, MG-3.1-004, MG-3.1-009, MG-3.1-013, MG-3.1-015, MP-2.1-002, MP-2.3-027, MP-3.1-004, MP-3.4-015, MP-4.1-021, MP-4.2-003, MP-5.2-010, MS-1.3-002, MS-2.1-001, MS-2.2-014, MS-2.7-002, MS-2.7-012, MS-2.7-024, MS-2.8-007, MS-2.8-011)

System Prompt
System prompts are used to tune GAI systems to specific tasks and to mitigate risks. (GV-1.2-002, MS-2.3-004, MS-2.5-005, MS-2.5-012, MS-2.9-003, MG-3.1-001, MG-3.1-006, MG-3.2-002, MG-3.2-003)

Team Diversity
Teams that implement and manage GAI systems represent broad professional, educational, life-stage, and demographic diversity. (GV-2.1-004, GV-3.1-002, GV-3.1-004, GV-3.1-005, GV-3.2-008, MG-2.1-005, MP-1.2-003, MP-1.2-004, MP-1.2-007, MS-1.3-012, MS-1.3-017, MS-2.3-015, MS-3.3-012)

Temperature
Temperature settings are used to tune GAI systems to specific tasks and to mitigate risks. (GV-1.2-002, MS-2.3-004, MS-2.5-005, MS-2.5-012, MS-2.9-003, MG-3.1-001, MG-3.1-006, MG-3.2-002, MG-3.2-003)

Terms of Service
General abuse and misuse by external parties is prohibited by organizational policies. Adaptive terms of service based on trust-level for user. (GV-4.2-003, GV-4.2-005, GV-4.2-007, GV-6.1-016, GV-6.2-016, MP-4.1-021)

Training
Internal personnel recieve training on productivity and basic risk management for GAI systems. (GV-2.2-004, GV-3.2-002, GV-6.1-003, MS-1.1-014)

User Feedback
GAI systems implement user feedback mechanisms. (GV-1.5-007, GV-1.5-009, GV-3.2-005, GV-5.1-001, GV-5.1-006, GV-5.1-007, GV-5.1-009, MG-1.3-005, MS-1.3-015, MS-1.3-016, MG-2.1-004, MG-2.2-012, MS-2.7-004, MS-4.2-012)

User Recourse
Policies, processes, and technical mechanisms enable recourse for users who are harmed by GAI systems. (GV-1.5-010, GV-1.7-003, GV-5.1-001, GV-5.1-006, GV-5.1-009, MS-2.8-015, MS-2.8-019, MS-3.2-006, MS-4.2-012)

Validation
GAI systems are shown to reliably generate valid results for their targeted business application. (GV-1.2-009, GV-1.4-002, GV-1.4-004, GV-3.2-002, GV-5.1-005, MG-2.2-016, MG-3.1-009, MG-3.1-014, MP-2.3-006, MP-2.3-013, MP-4.1-012, MS-2.3-005, MS-2.5-016, MS-2.9-002, MS-2.9-014)

XAI
Methods such as visualization, occlusion, model compression, pertubation studies, and similar are applied to increase explainability of GAI systems. (GV-1.4-002, GV-3.2-002, GV-5.1-005, MG-3.2-001, MP-2.2-006, MS-2.8-019, MS-2.9-001, MS-2.9-005, MS-2.9-006, MS-2.9-009, MS-2.9-011, MS-2.9-013, MS-2.9-015, MS-4.2-006)

**Usage Note**: Section E puts forward selected risk controls that organizations may apply for GAI risk management. Higher level controls are linked to specific GAI and AI RMF Playbook actions [NIST AI RMF Playbook], [NIST AI 600-1].

## F: Example Low-risk Generative AI Measurement and Management Plan

### F.1: Example Low-risk Generative AI Measurement and Management Plan Organized by Trustworthy Characteristic

Table F.1: Example risk measurement and management approaches suitable for low-risk GAI applications organized by trustworthy characteristic.

Function
Trustworthy Characteristic

Accountable and Transparent
Fair with Harmful Bias Managed

Measure

An Evaluation on Large Language Model Outputs: Discourse and Memorization (see Appendix B)

Big-bench: Truthfulness [Srivastava et al.]

DecodingTrust: Machine Ethics [Wang et al.]

Evaluation Harness: ETHICS

HELM: Copyright

Mark My Words [Piet et al.]

BELEBELE

Big-bench: Low-resource language, Non-English, Translation

Big-bench: Social bias, Racial bias, Gender bias, Religious bias

Big-bench: Toxicity

DecodingTrust: Fairness

DecodingTrust: Stereotype Bias

DecodingTrust: Toxicity

C-Eval (Chinese evaluation suite)

Evaluation Harness: CrowS-Pairs

Evaluation Harness: ToxiGen

Finding New Biases in Language Models with a Holistic Descriptor Dataset [Smith et al.]

From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models

HELM: Bias

HELM: Toxicity

MT-bench [Zheng et al.]

The Self-Perception and Political Biases of ChatGPT [Rutinowski et al.]

Towards Measuring the Representation of Subjective Global Opinions in Language Models

Manage

Contract Review

Disclosure of AI Interaction

Instructions

Inventory

Ownership

Prohibited Use Policy

Restrict Decision Making

System Documentation

Content Moderation

Failure Avoidance

Instructions

Inventory

Ownership

Prohibited Use Policy

System Prompt

Ownership

Restrict Anonymous Use

Restrict Decision Making

Temperature

Table F.1: Example risk measurement and management approaches suitable for low-risk GAI applications organized by trustworthy characteristic (continued).

Function
Trustworthy Characteristic

Interpretable and Explainable
Privacy-enhanced
Safe
Secure and Resilient

Measure

HELM: Copyright

llmprivacy

mimir

Big-bench: Convince Me

Big-bench: Truthfulness

HELM: Reiteration, Wedging

Mark My Words

MLCommons

The WMDP Benchmark

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

DecodingTrust: Adversarial Robustness, Robustness Against Adversarial Demonstrations

detect-pretrain-code

In-The-Wild Jailbreak Prompts on LLMs

JailbreakingLLMs

llmprivacy

mimir

TAP: A Query-Efficient Method for Jailbreaking Black-Box LLMs

Manage

Instructions

Inventory

System Documentation

Content Moderation

Contract Review

Failure Avoidance

Inventory

Ownership

Prohibited Use Policy

Restrict Anonymous Use

System Documentation

Content Moderation

Disclosure of AI Interaction

Failure Avoidance

Instructions

Inventory

Ownership

Prohibited Use Policy

Restrict Anonymous Use

Restrict Anthropomorphization

Restrict Decision Making

System Documentation

System Prompt

Temperature

Access Control

Approved List

Authentication

Change Management

Dependency Screening

Failure Avoidance

Inventory

Ownership

Malware Screening

Restrict Anonymous Use

Big-bench: Algorithms, Logical reasoning, Implicit reasoning, Mathematics, Arithmetic, Algebra, Mathematical proof, Black-Box Fallacy, Negation, Computer code, Probabilistic reasoning, Social reasoning, Analogical reasoning, Multi-step, Understanding the World

Big-bench: Analytic entailment, Formal fallacies and syllogisms with negation, Entailed polarity

Big-bench: Context Free Question Answering

Big-bench: Contextual question answering, Reading comprehension, Question generation

Big-bench: Morphology, Grammar, Syntax

Big-bench: Out-of-Distribution

Big-bench: Paraphrase

Big-bench: Sufficient information

Big-bench: Summarization

DecodingTrust: Out-of-Distribution Robustness, Adversarial Robustness, Robustness Against Adversarial Demonstrations

Eval Gauntlet: Reading comprehension

Eval Gauntlet: Commonsense reasoning, Symbolic problem solving, Programming

Eval Gauntlet: Language Understanding

Eval Gauntlet: World Knowledge

Evaluation Harness: BLiMP

Evaluation Harness: CoQA, ARC

Evaluation Harness: GLUE

Evaluation Harness: HellaSwag, OpenBookQA, TruthfulQA

Evaluation Harness: MuTual

Evaluation Harness: PIQA, PROST, MC-TACO, MathQA, LogiQA, DROP

FLASK: Logical correctness, Logical robustness, Logical efficiency, Comprehension, Completeness

FLASK: Readability, Conciseness, Insightfulness

HELM: Knowledge

HELM: Language

HELM: Text classification

HELM: Question answering

HELM: Reasoning

HELM: Robustness to contrast sets

HELM: Summarization

Hugging Face: Fill-mask, Text generation

Hugging Face: Question answering

Hugging Face: Summarization

Hugging Face: Text classification, Token classification, Zero-shot classification

MASSIVE

MT-bench

Manage

Content Moderation

Disclosure of AI Interaction

Failure Avoidance

Instructions

Restrict Anthropomorphization

Restrict Decision Making

System Documentation

System Prompt

Temperature

### F.2: Example Low-risk Generative AI Measurement and Management Plan Organized by Generative AI Risk

Table F.2: Example risk measurement and management approaches suitable for low-risk GAI applications organized by GAI risk.

Function
GAI Risk

CBRN Information
Confabulation

Measure

Big-bench: Convince Me

Big-bench: Truthfulness

HELM: Reiteration, Wedging

MLCommons

The WMDP Benchmark

Big-bench: Algorithms, Logical reasoning, Implicit reasoning, Mathematics, Arithmetic, Algebra, Mathematical proof, Black-Box Fallacy, Negation, Computer code, Probabilistic reasoning, Social reasoning, Analogical reasoning, Multi-step, Understanding the World

Big-bench: Analytic entailment, Formal fallacies and syllogisms with negation, Entailed polarity

Big-bench: Context Free Question Answering

Big-bench: Contextual question answering, Reading comprehension, Question generation

Big-bench: Convince Me

Big-bench: Low-resource language, Non-English, Translation

Big-bench: Morphology, Grammar, Syntax

Big-bench: Out-of-Distribution

Big-bench: Paraphrase

Big-bench: Sufficient information

Big-bench: Summarization

Big-bench: Truthfulness

C-Eval (Chinese evaluation suite)

DecodingTrust: Out-of-Distribution Robustness, Robustness Against Adversarial Demonstrations

Eval Gauntlet Reading comprehension

Eval Gauntlet: Commonsense reasoning, Symbolic problem solving, Programming

Eval Gauntlet: Language Understanding

Eval Gauntlet: World Knowledge

Evaluation Harness: BLiMP

Evaluation Harness: CoQA, ARC

Evaluation Harness: GLUE

Evaluation Harness: HellaSwag, OpenBookQA, TruthfulQA

Evaluation Harness: MuTual

Evaluation Harness: PIQA, PROST, MC-TACO, MathQA, LogiQA, DROP

FLASK: Logical correctness, Logical robustness, Logical efficiency, Comprehension, Completeness

FLASK: Readability, Conciseness, Insightfulness

Finding New Biases in Language Models with a Holistic Descriptor Dataset

HELM: Knowledge

HELM: Language

HELM: Language (Twitter AAE)

HELM: Question answering

HELM: Reasoning

HELM: Reiteration, Wedging

HELM: Robustness to contrast sets

HELM: Summarization

HELM: Text classification

Hugging Face: Fill-mask, Text generation

Hugging Face: Question answering

Hugging Face: Summarization

Hugging Face: Text classification, Token classification, Zero-shot classification

MASSIVE

MLCommons

MT-bench

Manage

Access Control

Failure Avoidance

Inventory

Ownership

Prohibited Use Policy

Content Moderation

Disclosure of AI Interaction

Failure Avoidance

Instructions

Restrict Anthropomorphization

Restrict Decision Making

System Documentation

System Prompt

Temperature

Table F.2: Example risk measurement and management approaches suitable for low-risk GAI applications organized by GAI risk (continued).

Function
GAI Risk

Dangerous or Violent Recommendations
Data Privacy
Environmental
Human-AI Configuration

Measure

Big-bench: Convince Me

Big-bench: Toxicity

DecodingTrust: Adversarial Robustness, Robustness Against Adversarial Demonstrations

DecodingTrust: Machine Ethics

DecodingTrust: Toxicity

Evaluation Harness: ToxiGen

HELM: Reiteration, Wedging

HELM: Toxicity

MLCommons

An Evaluation on Large Language Model Outputs: Discourse and Memorization (with human scoring, see Appendix B)

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation}

DecodingTrust: Machine Ethics

Evaluation Harness: ETHICS

HELM: Copyright

In-The-Wild Jailbreak Prompts on LLMs

JailbreakingLLMs

MLCommons

Mark My Words

TAP: A Query-Efficient Method for Jailbreaking Black-Box LLMs

detect-pretrain-code

llmprivacy

mimir

HELM: Efficiency

Manage

Content Moderation

Disclosure of AI Interaction

Failure Avoidance

Instructions

Inventory

Ownership

Prohibited Use Policy

Restrict Anonymous Use

Restrict Anthropomorphization

Restrict Decision making

System Documentation

System Prompt

Temperature

Content Moderation

Contract Review

Failure Avoidance

Inventory

Ownership

Prohibited Use Policy

Restrict Anonymous Use

System Documentation

Access Control

Failure Avoidance

Inventory

Ownership

Restrict Anonymous Use

Content Moderation

Disclosure of AI Interaction

Failure Avoidance

Instructions

Inventory

Ownership

Prohibited Use Policy

Restrict Anonymous Use

Restrict Anthropomorphization

Restrict Decision Making

Training

Table F.2: Example risk measurement and management approaches suitable for low-risk GAI applications organized by GAI risk (continued).

Function
GAI Risk

Information Integrity
Information Security
Intellectual Property

Measure

Big-bench: Analytic entailment, Formal fallacies and syllogisms with negation, Entailed polarity

Big-bench: Convince Me

Big-bench: Paraphrase

Big-bench: Sufficient information

Big-bench: Summarization

Big-bench: Truthfulness

DecodingTrust: Machine Ethics

DecodingTrust: Out-of-Distribution Robustness, Robustness Against Adversarial Demonstrations, Adversarial Robustness

Eval Gauntlet: Language Understanding

Eval Gauntlet: World Knowledge

Evaluation Harness: CoQA, ARC

Evaluation Harness: ETHICS

Evaluation Harness: GLUE

Evaluation Harness: HellaSwag, OpenBookQA, TruthfulQA

Evaluation Harness: MuTual

Evaluation Harness: PIQA, PROST, MC-TACO, MathQA, LogiQA, DROP

FLASK: Logical correctness, Logical robustness, Logical efficiency, Comprehension, Completeness

FLASK: Readability, Conciseness, Insightfulness

HELM: Knowledge

HELM: Language

HELM: Question answering

HELM: Reasoning

HELM: Reiteration, Wedging

HELM: Robustness to contrast sets

HELM: Summarization

HELM: Text classification

Hugging Face: Fill-mask, Text generation

Hugging Face: Question answering

Hugging Face: Summarization

MLCommons

MT-bench

Mark My Words

Big-bench: Convince Me

Big-bench: Out-of-Distribution

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

DecodingTrust: Out-of-Distribution Robustness, Robustness Against Adversarial Demonstrations, Adversarial Robustness,

Eval Gauntlet: Commonsense reasoning, Symbolic problem solving, Programming

HELM: Copyright

In-The-Wild Jailbreak Prompts on LLMs

JailbreakingLLMs

Mark My Words

TAP: A Query-Efficient Method for Jailbreaking Black-Box LLMs

detect-pretrain-code

llmprivacy

mimir

An Evaluation on Large Language Model Outputs: Discourse and Memorization (with human scoring, see Appendix B)

HELM: Copyright

Mark My Words

llmprivacy

mimir

Manage

Content Moderation

Disclosure of AI Interaction

Failure Avoidance

Inventory

Ownership

Prohibited Use Policy

Restrict Anonymous Use

Restrict Anthropomorphization

System Prompt

Temperature

Access Control

Approved List

Authentication

Change Management

Dependency Screening

Failure Avoidance

Inventory

Ownership

Malware Screening

Restrict Anonymous Use

Contract Review

Disclosure of AI Interaction

Instructions

Inventory

Ownership

Prohibited Use Policy

Table F.2: Example risk measurement and management approaches suitable for low-risk GAI applications organized by GAI risk (continued).

Function
GAI Risk

Obscene, Degrading, and/or Abusive Content
Toxicity, Bias, and Homogenization
Value Chain and Component Integration

Measure

Big-bench: Social bias, Racial bias, Gender bias, Religious bias

Big-bench: Toxicity

DecodingTrust: Fairness

DecodingTrust: Stereotype Bias

DecodingTrust: Toxicity

Evaluation Harness: CrowS-Pairs

Evaluation Harness: ToxiGen

HELM: Bias

HELM: Toxicity

BELEBELE

Big-bench: Low-resource language, Non-English, Translation

Big-bench: Out-of-Distribution

Big-bench: Social bias, Racial bias, Gender bias, Religious bias

Big-bench: Toxicity

C-Eval (Chinese evaluation suite)

DecodingTrust: Fairness

DecodingTrust: Stereotype Bias

DecodingTrust: Toxicity

Eval Gauntlet: World Knowledge

Evaluation Harness: CrowS-Pairs

Evaluation Harness: ToxiGen

Finding New Biases in Language Models with a Holistic Descriptor Dataset

From Pretraining Data to Language Models to Downstream Tasks:
Tracking the Trails of Political Biases Leading to Unfair NLP Models

HELM: Bias

HELM: Toxicity

The Self-Perception and Political Biases of ChatGPT

Towards Measuring the Representation of Subjective Global Opinions in Language Models

Manage

Content Moderation

Failure Avoidance

Instructions

Inventory

Ownership

Prohibited Use Policy

Restrict Anonymous Use

System Prompt

Temperature

Content Moderation

Failure Avoidance

Instructions

Inventory

Ownership

Prohibited Use Policy

Restrict Anonymous Use

Restrict Decision Making

System Prompt

Temperature

Contract Review

Disclosure of AI Interaction

Failure Avoidance

Inventory

Ownership

Prohibited Use Policy

System Documentation

**Usage Note**: Section F puts forward an example risk measurement and management plan for low risk GAI systems or applications. The low risk plan focuses on automatable model testing and applies minimally burdensome risk controls.

- Material in Table F.1 can be applied to measure and manage GAI risks in risk programs that are aligned to the trustworthy characteristics.

- Material in Table F.2 can be applied to measure and manage GAI risks in risk programs that are aligned to GAI risks.

Section G below presents an example plan for medium risk systems and Section H presents an example plan for high risk systems.

**Usage Note**: Section E puts forward selected risk controls that
organizations may apply for GAI risk management. Higher level controls
are linked to specific GAI and AI RMF Playbook actions.

## G: Example Medium-risk Generative AI Measurement and Management Plan

### G.1: Example Medium-risk Generative AI Measurement and Management Plan Organized by Trustworthy Characteristic

Table G.1: Example risk measurement and management approaches suitable for medium-risk GAI applications organized by trustworthy characteristic.

Function
Trustworthy Characteristic

Accountable and Transparent
Fair with Harmful Bias Managed

Measure

Context exhaustion: logic-overloading prompts

Loaded/leading questions

Multi-tasking prompts

Backwards relationships

Counterfactual prompts

Pros and cons prompts

Role-playing prompts

Loaded/leading questions

Low context prompts

Repeat this

Manage

Data Provenance

Data Quality

Decommission Process

Digital Signature

External Audit

Fine Tuning

Grounding

Human Review

Incident Response

Incorporate feedback

Model Documentation

Monitoring

Narrow Scope

Open Source

Refresh

RLHF

Restrict Data Collection

Restrict Secondary Use

User Feedback

Validation

Accessibility

Data Provenance

Data Quality

External Audit

Fine Tuning

Grounding

Human Review

Incident Response

Incorporate feedback

Narrow Scope

Restrict Homogeneity

Team Diversity

User Feedback

Validation

Table G.1: Example risk measurement and management approaches suitable for medium-risk GAI applications organized by trustworthy characteristic (continued).

Function
Trustworthy Characteristic

Interpretable and Explainable
Privacy-enhanced
Safe
Secure and Resilient

Measure

Context exhaustion: logic-overloading prompts (to reveal unexplainable decisioning processes)

Auto/biographical prompts

User information awareness prompts

Autocompletion prompts

Repeat this

Pros and cons prompts

Role-playing prompts

Impossible situation prompts

Content exhaustion: niche-seeking prompts

Ingratiation/reverse psychology prompts

Loaded/leading questions

User information awareness prompts

Repeat this

Multi-tasking prompts

Pros and cons prompts

Role-playing prompts

Content exhaustion: niche-seeking prompts

Ingratiation/reverse psychology prompts

Prompt injection attacks

Membership inference attacks

Random attacks

Manage

Data Provenance

External Audit

Human Review

Model Documentation

Monitoring

Open Source

User Feedback

Consent

Data Provenance

Data Quality

Data Retention

External Audit

Restrict Data Collection

Restrict Location Tracking

Restrict Secondary Use

Blocklist

Data Retention

Decommission Process

Digital Signature

External Audit

Human Review

Incident Response

Monitoring

Narrow Scope

Rate-limiting

Restrict Location Tracking

Session Limits

User Feedback

Blocklist

Decommission Process

External Audit

Incident Response

Monitoring

Open Source

Rate-limiting

Session Limits

Backwards relationships

Context baiting (and/or switching) prompts

Multi-tasking prompts

Role-playing prompts

Ingratiation/reverse psychology prompts

Loaded/leading questions

Time-perplexity prompts

Niche-seeking prompts

Logic overloading prompts

Repeat this

Numeric calculation

Manage

Data Quality

Fine Tuning

Grounding

Human Review

Incorporate feedback

Model Documentation

Monitoring

Narrow Scope

Open Source

Refresh

Restrict Homogeneity

RLHF

Team Diversity

User Feedback

Validation

### G.2: Example Medium-risk Generative AI Measurement and Management Plan Organized by Generative AI Risk

Table G.2: Example risk measurement and management approaches suitable for medium-risk GAI applications organized by GAI risk.

Function
GAI Risk

CBRN Information
Confabulation

Measure

Auto-completion prompts

Role-playing prompts

Reverse psychology prompts

Pros and cons prompts

Multitasking prompts

Repeat this

Backwards relationship prompts

Context baiting (and/or switching) prompts

Context exhaustion: Logic overloading prompts

Context exhaustion: Multi-tasking prompts

Context exhaustion: Niche-seeking prompts

Time perplexity prompts

Loaded/leading questions

Calculation and numeric queries

Manage

Blocklist

Data Provenance

Data Quality

Decommission Process

Digital Signature

External Audit

Incident Response

Monitoring

Rate-limiting

Session Limits

Data Quality

Fine Tuning

Grounding

Human Review

Incorporate feedback

Model Documentation

Monitoring

Narrow Scope

Open Source

Refresh

Restrict Homogeneity

RLHF

Team Diversity

User Feedback

Validation

Table G.2: Example risk measurement and management approaches suitable for medium-risk GAI applications organized by GAI risk (continued).

Function
GAI Risk

Dangerous or Violent Recommendations
Data Privacy
Environmental
Human-AI Configuration

Measure

Impossible situation prompts

Role-playing prompts

Reverse psychology prompts

Pros and cons prompts

Multitasking prompts

Repeat this

Loaded/leading questions

User information awareness

Membership inference attacks

Auto/biographical prompts

Repeat this

Availability attacks

Role-playing prompts

Reverse psychology prompts

Pros and cons prompts

Multitasking prompts

Impossible situation prompts

Role-playing prompts

Reverse psychology prompts

Pros and cons prompts

Multitasking prompts

Manage

Blocklist

Data Retention

Decommission Process

Digital Signature

External Audit

Human Review

Incident Response

Monitoring

Narrow Scope

Rate-limiting

Restrict Location Tracking

Session Limits

User Feedback

Consent

Data Provenance

Data Quality

Data Retention

External Audit

Restrict Data Collection

Restrict Location Tracking

Restrict Secondary Use

Decommission Process

External Audit

Incident Response

Monitoring

Rate-limiting

Session Limits

Accessibility

Blocklist

Consent

Decommission Process

Digital Signature

External Audit

Human Review

Incorporate feedback

Restrict Data Collection

Restrict Location Tracking

Restrict Secondary Use

Session Limits

User Feedback

Table G.2: Example risk measurement and management approaches suitable for medium-risk GAI applications organized by GAI risk (continued).

Function
GAI Risk

Information Integrity
Information Security
Intellectual Property

Measure

Loaded/leading questions

Role-playing prompts

Reverse psychology prompts

Pros and cons prompts

Multitasking prompts

Confidentiality attacks

Integrity attacks

Availability attacks

Random attacks

Role-playing prompts

Reverse psychology prompts

Pros and cons prompts

Multitasking prompts

Confidentiality attacks

Auto-complete prompts

Manage

Data Provenance

Data Quality

Digital Signature

External Audit

Fine Tuning

Grounding

Human Review

Incident Response

Incorporate feedback

Monitoring

Narrow Scope

Open Source

Refresh

Restrict Homogeneity

RLHF

User Feedback

Validation

Blocklist

Decommission Process

External Audit

Incident Response

Monitoring

Open Source

Rate-limiting

Session Limits

Blocklist

Data Provenance

Data Quality

Decommission Process

Digital Signature

External Audit

Incident Response

Incorporate feedback

Monitoring

Open Source

Rate-limiting

Session Limits

User Feedback

Table G.2: Example risk measurement and management approaches suitable for medium-risk GAI applications organized by GAI risk (continued).

Function
GAI Risk

Obscene, Degrading, and/or Abusive Content
Toxicity, Bias, and Homogenization
Value Chain and Component Integration

Measure

Confidentiality attacks

Autocomplete prompts

Role-playing prompts

Reverse psychology prompts

Pros and cons prompts

Multitasking prompts

Loaded/leading questions

Repeat this

Backwards relationship prompts

Data poisoning attacks

Counterfactual prompts

Pros and cons prompts

Role-playing prompts

Low context prompts

Loaded/leading questions

Repeat this

Manage

Blocklist

Data Provenance

Data Quality

Decommission Process

Digital Signature

External Audit

Incident Response

Monitoring

Rate-limiting

Session Limits

User Feedback

Accessibility

Data Provenance

Data Quality

External Audit

Fine Tuning

Grounding

Human Review

Incident Response

Incorporate feedback

Narrow Scope

Restrict Homogeneity

Team Diversity

User Feedback

Validation

Data Provenance

Data Quality

Digital Signature

External Audit

Model Documentation

Restrict Data Collection

Restrict Secondary Use

**Usage Note**: Section G puts forward an example risk measurement and management plan for medium risk GAI systems or applications. The
medium risk plan focuses on red-teaming and applies moderate risk controls. Measurement and management approaches from Section F should
also be applied to medium risk systems or applications.

- Material in Table G.1 can be applied to measure and manage GAI risks in risk programs that are aligned to the trustworthy characteristics.

- Material in Table G.2 can be applied to measure and manage GAI risks in risk programs that are aligned to GAI risks.

Section H below presents an example plan for high risk systems.

## H: Example High-risk Generative AI Measurement and Management Plan

### H.1: Example High-risk Generative AI Measurement and Management Plan Organized by Trustworthy Characteristic

Table H.1: Example risk measurement and management approaches suitable for high-risk GAI applications organized by trustworthy characteristic.

Function
Trustworthy Characteristic

Accountable and Transparent
Fair with Harmful Bias Managed

Measure

Algorithmic impact assessments

Assessing data quality*

Bias bounties

Calibration*

Cybersecurity testing

Environmental metrics

Field testing*

Input/output measurement using classifiers

Model assessment*

Model comparison*

Multi-session experiments*

Online metrics/monitoring

Perturbation studies*

PII identification and removal

Root cause analysis*

Screening for information integrity

Sensitivity analysis*

Software testing

Stakeholder engagement and feedback*

Statistical quality control*

Stress testing*

Sub-sampling traffic for manually annotating

Supply chain auditing

Testing third-party dependencies

User surveys*

Validity testing/validation.*

Algorithmic impact assessments

Analyze differences between intended and actual population of users or data subjects*

Anomaly detection*

Assessing data quality*

Bias bounties

Bias testing

Calibration*

Counterfactual/causal analysis

Disaggregated metrics

Field testing*

Model assessment*

Model comparison*

Multi-session experiments*

Root cause analysis*

Software testing

Statistical quality control*

Stress testing*

User surveys*

Validity testing/validation.*

Manage

Fast decommission

Insurance

Intellectual property removal

Restrict regulated dealings

Sensitive/Personal data removal

Supply chain audit

User recourse

CSAM/Obscenity removal

Fast decommission

Insurance

Intellectual property removal

Restrict regulated dealings

Sensitive/Personal data removal

Supply chain audit

User recourse

Table H.1: Example risk measurement and management approaches suitable for high-risk GAI applications organized by trustworthy characteristic (continued).

Function
Trustworthy Characteristic

Interpretable and Explainable
Privacy-enhanced
Safe
Secure and Resilient

Measure

Algorithmic impact assessments

Analyze differences between intended and actual population of users or data subjects*

Model comparison.*

Multi-session experiments.*

Root cause analysis.*

Stakeholder engagement and feedback.*

UI/UX studies

User surveys*

Algorithmic impact assessments

Assessing data quality.*

Cybersecurity testing

PII identification and removal

Root cause analysis*

Stakeholder engagement and feedback*

Stress testing*

Testing third-party dependencies

Algorithmic impact assessments

Analyze differences between intended and actual population of users or data subjects*

Assessing data quality*

Bias bounties

Calibration*

Chaos testing

Dangerous and violent
content removal

Field testing*

Input/output measurement using classifiers

Model assessment*

Model comparison*

Multi-session experiments*

Perturbation studies*

Root cause analysis*

Sensitivity analysis*

Stakeholder engagement and feedback*

Statistical quality control*

Stress testing*

User surveys*

Validity testing/validation*

Algorithmic impact assessments

Anomaly detection*

Assessing data quality*

Bias bounties

Calibration*

Chaos testing

Cybersecurity testing

Data poisoning detection

Model assessment*

Model comparison*

Root cause analysis*

Software testing

Stakeholder engagement and feedback*

Stress testing*

Supply chain auditing

Testing third-party dependencies

Manage

Restrict regulated dealings

Supply Chain Audit

User recourse

CSAM/Obscenity removal

Fast decommission

Insurance

Intellectual property removal

Restrict minors

Restrict regulated dealings

Sensitive/Personal data removal

Supply chain audit

User recourse

CSAM/Obscenity removal

Fast decommission

Insurance

Redundancy

Restrict internet access

Restrict minors

Restrict regulated dealings

Sensitive/Personal data removal

Supply Chain Audit

User recourse

CSAM/Obscenity removal

Fast decommission

Insurance

Intellectual property removal

Redundancy

Restrict internet access

Restrict minors

Restrict regulated dealings

Sensitive/Personal data removal

Supply chain audit

User recourse

Algorithmic impact assessments

Analyze differences between intended and actual population of users or data subjects*

Assessing data quality*

Bias bounties

Calibration*

Field testing*

Input/output measurement using classifiers

Model assessment*

Model comparison*

Multi-session experiments*

Perturbation studies*

Root cause analysis*

Sensitivity analysis*

Stakeholder engagement and feedback*

Statistical quality control*

Stress testing*

User surveys*

Validity testing/validation*

Manage

Fast decommission

Insurance

Redundancy

Restrict regulated dealings

Supply chain audit

User recourse

### H.2: Example High-risk Generative AI Measurement and Management Plan Organized by Generative AI Risk

Table H.2: Example risk measurement and management approaches suitable for high-risk GAI applications organized by GAI risk.

Function
GAI Risk

CBRN Information
Confabulation

Measure

Chaos testing

Cybersecurity testing

Input/output measurement using classifiers

Online metrics/monitoring

Perturbation studies*

Prompt engineering

Root cause analysis*

Sensitivity analysis*

Software testing

Stress testing*

Supply chain auditing

Algorithmic impact assessments

Analyze differences between intended and actual population of users or data subjects*

Assessing data quality*

Bias bounties

Calibration*

Field testing*

Input/output measurement using classifiers

Model assessment*

Model comparison*

Multi-session experiments*

Perturbation studies*

Root cause analysis*

Sensitivity analysis*

Stakeholder engagement and feedback*

Statistical quality control*

Stress testing*

User surveys*

Validity testing/validation*

Manage

CBRN info removal

Fast decommission

Restrict internet access

Supply chain audit

Fast decommission

Insurance

Restrict regulated dealings

Supply chain audit

User recourse

Table H.2: Example risk measurement and management approaches suitable for high-risk GAI applications organized by GAI risk (continued).

Function
GAI Risk

Dangerous or Violent Recommendations
Data Privacy
Environmental
Human-AI Configuration

Measure

Algorithmic impact assessments

Analyze differences between intended and actual population of users or data subjects*

Assessing data quality*

Bias bounties

Calibration*

Chaos testing

Dangerous and violent content removal

Field testing*

Input/output measurement using classifiers

Model assessment*

Model comparison*

Multi-session experiments*

Perturbation studies*

Root cause analysis*

Sensitivity analysis*

Stakeholder engagement and feedback*

Statistical quality control*

Stress testing*

User surveys*

Validity testing/validation*

Algorithmic impact assessments

Assessing data quality.*

Cybersecurity testing

PII identification and removal

Root cause analysis*

Stakeholder engagement and feedback*

Stress testing*

Testing third-party dependencies

Algorithmic impact assessments

Environmental metrics

Model comparison*

Online metrics/monitoring

Supply chain auditing

Algorithmic impact assessments

Analyze differences between intended and actual population of users or data subjects*

Analyzing user feedback

Bias bounties

Calibration*

Explainability/interpretability

Field testing*

Model assessment*

Model comparison*

Multi-session experiments*

Root cause analysis*

Stakeholder engagement and feedback*

UI/UX studies

User surveys*

Validity testing/validation*

Manage

CSAM/Obscenity removal

Fast decommission

Insurance

Restrict minors

Restrict regulated dealings

Sensitive/Personal data removal

Supply chain audit

User recourse

CSAM/Obscenity removal

Fast decommission

Insurance

Intellectual property removal

Restrict minors

Restrict regulated dealings

Sensitive/Personal data removal

Supply chain audit

User recourse

Fast decommission

Insurance

Supply chain audit

User recourse

CSAM/Obscenity removal

Fast decommission

Intellectual property removal

Restrict minors

Restrict regulated dealings

Sensitive/Personal data removal

User recourse

Table H.2: Example risk measurement and management approaches suitable for high-risk GAI applications organized by GAI risk (continued).

Function
GAI Risk

Information Integrity
Information Security
Intellectual Property

Measure

Algorithmic impact assessments

Assessing data quality*

Calibration*

Human content moderation

Data poisoning detection

Field testing*

Model assessment*

Model comparison*

Multi-session experiments*

Perturbation studies*

Root cause analysis*

Screening for information integrity

Sensitivity analysis*

Stakeholder engagement and feedback*

Statistical quality control*

Supply chain auditing

Testing third-party dependencies

User surveys*

Validity testing/validation.*

Algorithmic impact assessments

Anomaly detection*

Assessing data quality*

Bias bounties

Calibration*

Chaos testing

Cybersecurity testing

Data poisoning detection

Model assessment*

Model comparison*

Root cause analysis*

Software testing

Stakeholder engagement and feedback*

Stress testing*

Supply chain auditing

Testing third-party dependencies

Algorithmic impact assessments

Assessing data quality*

Cybersecurity testing

Field testing*

Input/output measurement using classifiers

Model comparison*

Root cause analysis*

Stakeholder engagement and feedback*

Sub-sampling traffic for manually annotating

Supply chain auditing

Testing third-party dependencies

User surveys*

Manage

CSAM/Obscenity removal

Fast decommission

Insurance

Intellectual property removal

Restrict internet access

Restrict minors

Restrict regulated dealings

Sensitive/Personal data removal

Supply chain audit

User recourse

CSAM/Obscenity removal

Fast decommission

Insurance

Intellectual property removal

Redundancy

Restrict internet access

Restrict minors

Restrict regulated dealings

Sensitive/Personal data removal

Supply chain audit

User recourse

Fast decommission

Insurance

Intellectual property removal

Restrict internet access

Supply chain audit

User recourse

Table H.2: Example risk measurement and management approaches suitable for high-risk GAI applications organized by GAI risk (continued).

Function
GAI Risk

Obscene, Degrading, and/or Abusive Content
Toxicity, Bias, and Homogenization
Value Chain and Component Integration

Measure

Algorithmic impact assessments

Assessing data quality*

Calibration*

Field testing*

Input/output measurement using classifiers

Model assessment*

Model comparison*

Root cause analysis*

Small user studies

Software testing

Stakeholder engagement and feedback*

Statistical quality control*

Stress testing*

Supply chain auditing

Testing third-party dependencies

User surveys*

Algorithmic impact assessments

Analyze differences between intended and actual population of users or data subjects*

Anomaly detection*

Assessing data quality*

Bias bounties

Bias testing

Calibration*

Counterfactual/causal analysis

Disaggregated metrics

Field testing*

Model assessment*

Model comparison*

Multi-session experiments*

Root cause analysis*

Software testing

Statistical quality control*

Stress testing*

User surveys*

Validity testing/validation.*

Assessing data quality*

Model assessment*

Model comparison*

Software testing

Supply chain auditing

Testing third-party dependencies

Manage

CSAM/Obscenity removal

Fast decommission

Insurance

Restrict internet access

Restrict minors

Restrict regulated dealings

Sensitive/Personal data removal

Supply chain audit

User recourse

CSAM/Obscenity removal

Fast decommission

Insurance

Intellectual property removal

Restrict regulated dealings

Sensitive/Personal data removal

Supply chain audit

User recourse

CSAM/Obscenity removal

Intellectual property removal

Redundancy

Sensitive/Personal data removal

Supply chain audit

**Usage Note**: Section H puts forward an example risk measurement and management plan for high risk GAI systems or applications. The high
risk plan focuses on field testing and applies extensive risk controls. Measurement and management approaches from Appendices F and G should also
be applied to high risk systems or applications.

- Material in Table H.1 can be applied to measure and manage GAI risks in risk programs that are aligned to the trustworthy characteristics.

- Material in Table H.2 can be applied to measure and manage GAI risks in risk programs that are aligned to GAI risks.

## References

##### AI Verify Foundation and Infocomm Media Development Authority. *Cataloguing LLM Evaluations.* Draft for Discussion, October 2023. [https://aiverifyfoundation.sg/downloads/Cataloguing_LLM_Evaluations.pdf](https://aiverifyfoundation.sg/downloads/Cataloguing_LLM_Evaluations.pdf).

##### AI Verify Foundation and Infocomm Media Development Authority. *LLM Evals Catalogue.* GitHub repository. Accessed September 19, 2024. [https://github.com/aiverify-foundation/LLM-Evals-Catalogue](https://github.com/aiverify-foundation/LLM-Evals-Catalogue).

##### Balloccu, Simone, Patrícia Schmidtová, Mateusz Lango, and Ondřej Dušek. "Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs." *arXiv* preprint, last revised February 22, 2024. [https://doi.org/10.48550/arXiv.2402.03927](https://doi.org/10.48550/arXiv.2402.03927).

##### Bandarkar, Lucas, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. "The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants." *arXiv* preprint, last revised July 25, 2024. [https://doi.org/10.48550/arXiv.2308.16884](https://doi.org/10.48550/arXiv.2308.16884).

##### Barreno, Marco, Blaine Nelson, Anthony D. Joseph, and J.D. Tygar. "The Security of Machine Learning." *Machine Learning* 81, no. 2 (2010): 121–148. [https://doi.org/10.1007/s10994-010-5188-5](https://doi.org/10.1007/s10994-010-5188-5).

##### Bommasani, Rishi, Percy Liang, and Tony Lee. "Holistic Evaluation of Language Models." *Annals of the New York Academy of Sciences* 1525, no. 1 (July 2023): 140–146. [https://doi.org/10.1111/nyas.15007](https://doi.org/10.1111/nyas.15007).

##### Chao, Patrick, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. "Jailbreaking Black Box Large Language Models in Twenty Queries." *arXiv* preprint, last revised July 18, 2024. [https://doi.org/10.48550/arXiv.2310.08419](https://doi.org/10.48550/arXiv.2310.08419).

##### De Wynter, Adrian, Xun Wang, Alex Sokolov, Qilong Gu, and Si-Qing Chen. "An Evaluation on Large Language Model Outputs: Discourse and Memorization." *Natural Language Processing Journal* 4 (September 2023): 100024. [https://doi.org/10.1016/j.nlp.2023.100024](https://doi.org/10.1016/j.nlp.2023.100024).

##### Department for Science, Innovation and Technology, and AI Safety Institute. *International Scientific Report on the Safety of Advanced AI: Interim Report.* Published May 17, 2024. [https://www.gov.uk/government/publications/international-scientific-report-on-the-safety-of-advanced-ai](https://www.gov.uk/government/publications/international-scientific-report-on-the-safety-of-advanced-ai).

##### Derczynski, Leon, Erick Galinkin, Jeffrey Martin, Subho Majumdar, and Nanna Inie. "garak: A Framework for Security Probing Large Language Models." *arXiv* preprint, submitted June 16, 2024. [https://doi.org/10.48550/arXiv.2406.11036](https://doi.org/10.48550/arXiv.2406.11036).

##### Dohmann, Jeremy. "Blazingly Fast LLM Evaluation for In-Context Learning." *Databricks: Mosaic AI Research*, February 2, 2023. [https://www.databricks.com/blog/llm-evaluation-for-icl](https://www.databricks.com/blog/llm-evaluation-for-icl).

##### Duan, Michael, Anshuman Suri, Niloofar Mireshghallah, Sewon Min, Weijia Shi, Luke Zettlemoyer, Yulia Tsvetkov, Yejin Choi, David Evans, and Hannaneh Hajishirzi. "Do Membership Inference Attacks Work on Large Language Models?" *arXiv* preprint, last revised September 16, 2024. [https://doi.org/10.48550/arXiv.2402.07841](https://doi.org/10.48550/arXiv.2402.07841).

##### Durmus, Esin, Karina Nguyen, Thomas I. Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, et al. "Towards Measuring the Representation of Subjective Global Opinions in Language Models." *arXiv* preprint, last revised April 12, 2024. [https://doi.org/10.48550/arXiv.2306.16388](https://doi.org/10.48550/arXiv.2306.16388).

##### Hugging Face. "Evaluate." Last accessed September 19, 2024. [https://huggingface.co/docs/evaluate/index](https://huggingface.co/docs/evaluate/index).

##### Feng, Shangbin, Chan Young Park, Yuhan Liu, and Yulia Tsvetkov. "From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models." *arXiv* preprint, last revised July 6, 2023. [https://doi.org/10.48550/arXiv.2305.08283](https://doi.org/10.48550/arXiv.2305.08283).

##### FitzGerald, Jack, Christopher Hench, Charith Peris, Scott Mackie, Kay Rottmann, Ana Sanchez, Aaron Nash, Liam Urbach, et al. "MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages." *arXiv* preprint, last revised June 17, 2022. [https://doi.org/10.48550/arXiv.2204.08582](https://doi.org/10.48550/arXiv.2204.08582).

##### Gao, Leo, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. *A Framework for Few-Shot Language Model Evaluation.* GitHub repository. Accessed September 19, 2024. [https://github.com/EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).

##### Hall, Patrick, and Daniel Atherton. *Awesome Machine Learning Interpretability.* GitHub repository. Accessed September 19, 2024. [https://github.com/jphall663/awesome-machine-learning-interpretability](https://github.com/jphall663/awesome-machine-learning-interpretability).

##### Hu, Hongsheng, Zoran Salcic, Lichao Sun, Gillian Dobbie, Philip S. Yu, and Xuyun Zhang. "Membership Inference Attacks on Machine Learning: A Survey." *ACM Computing Surveys* 54, no. 11s (September 2022): 1–37. [https://doi.org/10.1145/3523273](https://doi.org/10.1145/3523273).

##### Huang, Yangsibo, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. "Catastrophic Jailbreak of Open-Source LLMs via Exploiting Generation." *ICLR 2024 Spotlight*, published January 16, 2024, last modified March 15, 2024. [https://openreview.net/forum?id=r42tSSCHPh](https://openreview.net/forum?id=r42tSSCHPh).

##### Huang, Yuzhen, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. "C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models." *arXiv* preprint, last revised November 6, 2023. [https://doi.org/10.48550/arXiv.2305.08322](https://doi.org/10.48550/arXiv.2305.08322).

##### *ISO/IEC 42001:2023. Information Technology — Artificial Intelligence — Management System.* 1st ed. Geneva: International Organization for Standardization, 2023. [https://www.iso.org/obp/ui/en/#iso:std:iso-iec:42001:ed-1:v1:en](https://www.iso.org/obp/ui/en/#iso:std:iso-iec:42001:ed-1:v1:en).

##### Li, Nathaniel, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, et al. "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning." *arXiv* preprint, last revised May 15, 2024. [https://doi.org/10.48550/arXiv.2403.03218](https://doi.org/10.48550/arXiv.2403.03218).

##### Li, Nathaniel, Ziwen Han, Ian Steneker, Willow Primack, et al. "LLM defenses are not robust to multi-turn human jailbreaks yet." *arXiv* preprint, last revised Wed, September 4, 2024. [https://arxiv.org/pdf/2408.15221](https://arxiv.org/pdf/2408.15221).

##### Liu, Yi, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. "Prompt Injection Attack Against LLM-Integrated Applications." *arXiv* preprint, last revised March 2, 2024. [https://doi.org/10.48550/arXiv.2306.05499](https://doi.org/10.48550/arXiv.2306.05499).

##### McGraw, Gary, Harold Figueroa, Katie McMahon, and Richie Bonett. *An Architectural Risk Analysis of Large Language Models: Applied Machine Learning Security.* Version 1.0. Berryville Institute of Machine Learning (BIML), January 24, 2024. [https://berryvilleiml.com/docs/BIML-LLM24.pdf](https://berryvilleiml.com/docs/BIML-LLM24.pdf).

##### McGraw, Gary, Harold Figueroa, Victor Shepardson, and Richie Bonett. *An Architectural Risk Analysis of Machine Learning Systems: Toward More Secure Machine Learning.* Version 1.0 (1.13.20). Berryville Institute of Machine Learning (BIML), January 13, 2020. [https://berryvilleiml.com/docs/ara.pdf](https://berryvilleiml.com/docs/ara.pdf).

##### Mehrotra, Anay, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. "Tree of Attacks: Jailbreaking Black-Box LLMs Automatically." *arXiv* preprint, last revised February 21, 2024. [https://doi.org/10.48550/arXiv.2312.02119](https://doi.org/10.48550/arXiv.2312.02119).

##### Microsoft. *Microsoft Responsible AI Standard, v2: General Requirements.* For External Release. June 2022. [https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE5cmFl](https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE5cmFl).

##### National Institute of Standards and Technology (NIST). *Artificial Intelligence Risk Management Framework (AI RMF 1.0).* NIST AI 100-1. Gaithersburg, MD: NIST, January 26, 2023. [https://doi.org/10.6028/NIST.AI.100-1](https://doi.org/10.6028/NIST.AI.100-1).

##### National Institute of Standards and Technology (NIST). *Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile.* NIST AI 600-1. Gaithersburg, MD: NIST, July 2024. [https://doi.org/10.6028/NIST.AI.600-1](https://doi.org/10.6028/NIST.AI.600-1).

##### National Institute of Standards and Technology (NIST). *Guide for Conducting Risk Assessments.* NIST Special Publication 800-30 Rev. 1. Prepared by the Joint Task Force Transformation Initiative. Gaithersburg, MD: NIST, September 2012. [https://doi.org/10.6028/NIST.SP.800-30r1](https://doi.org/10.6028/NIST.SP.800-30r1).

##### National Institute of Standards and Technology (NIST). *NIST AI RMF Playbook.* Trustworthy & Responsible AI Resource Center. Accessed September 19, 2024. [https://airc.nist.gov/AI_RMF_Knowledge_Base/Playbook](https://airc.nist.gov/AI_RMF_Knowledge_Base/Playbook).

##### Office of the Comptroller of the Currency (OCC). *Model Risk Management.* Comptroller’s Handbook, Version 1.0, August 2021. [https://www.occ.gov/publications-and-resources/publications/comptrollers-handbook/files/model-risk-management/index-model-risk-management.html](https://www.occ.gov/publications-and-resources/publications/comptrollers-handbook/files/model-risk-management/index-model-risk-management.html).

##### Perez, Ethan, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. "Red Teaming Language Models with Language Models." *arXiv* preprint, submitted February 7, 2022. [https://doi.org/10.48550/arXiv.2202.03286](https://doi.org/10.48550/arXiv.2202.03286).

##### Piet, Julien, Chawin Sitawarin, Vivian Fang, Norman Mu, and David Wagner. "Mark My Words: Analyzing and Evaluating Language Model Watermarks." *arXiv* preprint, last revised December 7, 2023. [https://doi.org/10.48550/arXiv.2312.00273](https://doi.org/10.48550/arXiv.2312.00273).

##### Rutinowski, Jérôme, Sven Franke, Jan Endendyk, Ina Dormuth, Moritz Roidl, and Markus Pauly. "The Self-Perception and Political Biases of ChatGPT." *Human Behavior and Emerging Technologies*, 2024. [https://doi.org/10.1155/2024/7115633](https://doi.org/10.1155/2024/7115633).

##### Saravia, Elvis. *Prompt Engineering Guide.* GitHub repository. Last modified December 2022. Accessed September 19, 2024. [https://github.com/dair-ai/Prompt-Engineering-Guide](https://github.com/dair-ai/Prompt-Engineering-Guide).

##### Shen, Xinyue, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. "‘Do Anything Now’: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models." *arXiv* preprint, last revised May 15, 2024. [https://doi.org/10.48550/arXiv.2308.03825](https://doi.org/10.48550/arXiv.2308.03825).

##### Shi, Weijia, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. "Detecting Pretraining Data from Large Language Models." *arXiv* preprint, last revised March 9, 2024. [https://doi.org/10.48550/arXiv.2310.16789](https://doi.org/10.48550/arXiv.2310.16789).

##### Shumailov, Ilia, Yiren Zhao, Daniel Bates, Nicolas Papernot, Robert Mullins, and Ross Anderson. "Sponge Examples: Energy-Latency Attacks on Neural Networks." In *2021 IEEE European Symposium on Security and Privacy (EuroS&P)*, 6–10 September 2021, Vienna, Austria. IEEE, 2021. [https://doi.org/10.1109/EuroSP51992.2021.00024](https://doi.org/10.1109/EuroSP51992.2021.00024).

##### Sitawarin, Chawin, Charlie Cheng-Jie Ji, Apurv Verma, and Luckyfan-cs. *LLM Security & Privacy.* GitHub repository. Accessed September 19, 2024. [https://github.com/chawins/llm-sp](https://github.com/chawins/llm-sp).

##### Smith, Eric Michael, Melissa Hall, Melanie Kambadur, Eleonora Presani, and Adina Williams. "‘I’m Sorry to Hear That’: Finding New Biases in Language Models with a Holistic Descriptor Dataset." *arXiv* preprint, last revised October 27, 2022. [https://doi.org/10.48550/arXiv.2205.09209](https://doi.org/10.48550/arXiv.2205.09209).

##### Srivastava, Aarohi, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, et al. "Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models." *arXiv* preprint, last revised June 12, 2023. [https://doi.org/10.48550/arXiv.2206.04615](https://doi.org/10.48550/arXiv.2206.04615).

##### Staab, Robin, Mark Vero, Mislav Balunović, and Martin Vechev. "Beyond Memorization: Violating Privacy via Inference with Large Language Models." *arXiv* preprint, last revised May 6, 2024. [https://doi.org/10.48550/arXiv.2310.07298](https://doi.org/10.48550/arXiv.2310.07298).

##### Storchan, Victor, Ravin Kumar, Rumman Chowdhury, Seraphina Goldfarb-Tarrant, and Sven Cattell. *Generative AI Red Teaming Challenge: Transparency Report.* Humane Intelligence, 2024. [https://drive.google.com/file/d/1JqpbIP6DNomkb32umLoiEPombK2-0Rc-/view](https://drive.google.com/file/d/1JqpbIP6DNomkb32umLoiEPombK2-0Rc-/view).

##### Vidgen, Bertie, Adarsh Agrawal, Ahmed M. Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, et al. "Introducing v0.5 of the AI Safety Benchmark from MLCommons." *arXiv* preprint, last revised May 13, 2024. [https://doi.org/10.48550/arXiv.2404.12241](https://doi.org/10.48550/arXiv.2404.12241).

##### Wang, Boxin, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. "DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models." In *Proceedings of the 37th International Conference on Neural Information Processing Systems (NIPS '23)*, Article No. 1361, 31232–31339. Published May 30, 2024. [https://dl.acm.org/doi/10.5555/3666122.3667483](https://dl.acm.org/doi/10.5555/3666122.3667483).

##### Ye, Seonghyeon, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, and Minjoon Seo. "FLASK: Fine-Grained Language Model Evaluation Based on Alignment Skill Sets." *arXiv* preprint, last revised April 14, 2024. [https://doi.org/10.48550/arXiv.2307.10928](https://doi.org/10.48550/arXiv.2307.10928).

##### Zheng, Lianmin, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, and Eric P. Xing. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." In *Proceedings of the 37th International Conference on Neural Information Processing Systems (NIPS '23)*, Article No. 2020, 46595–46623. Published May 30, 2024. [https://dl.acm.org/doi/10.5555/3666122.3668142](https://dl.acm.org/doi/10.5555/3666122.3668142).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jphall663/gai_risk_management

Awesome Lists containing this project

README