Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective

TL; DR

A psychometrics-inspired attack framework is proposed to reveal implicit biases in large language models(LLMs) via three attack methods: Disguise, Deception, and Teaching.
We find that LLMs exhibit significant implicit biases in both discriminative and generative tasks, with the extent of bias varying across attack methods, models, languages, and bias types.
A bilingual benchmark, BUMBLE, is provided to evaluate LLMs' implicit biases in both English and Chinese.

Figure 1: Our psychometrics-inspired attack framework for attacking & evaluating implicit biases in LLMs

Abstract

As large language models (LLMs) become an important way of information access, there have been increasing concerns that LLMs may intensify the spread of unethical content, including implicit bias that hurts certain populations without explicit harmful words. In this paper, we conduct a rigorous evaluation of LLMs' implicit bias towards certain demographics by attacking them from a psychometric perspective to elicit agreements to biased viewpoints. Inspired by psychometric principles in cognitive and social psychology, we propose three attack approaches, i.e., Disguise, Deception, and Teaching. Incorporating the corresponding attack instructions, we built two benchmarks: (1) a bilingual dataset with biased statements covering four bias types (2.7K instances) for extensive comparative analysis, and (2) BUMBLE, a larger benchmark spanning nine common bias types (12.7K instances) for comprehensive evaluation. Extensive evaluation of popular commercial and open-source LLMs shows that our methods can elicit LLMs' inner bias more effectively than competitive baselines. Our attack methodology and benchmarks offer an effective means of assessing the ethical risks of LLMs, driving progress toward greater accountability in their development.

Preliminaries

Implicit Bias: harmful attitudes without explicitly harmful words towards certain demographics. For example, "Women get lower math scores than men" contains an implicit bias towards women.
Bias Agreement Task: only requests models to answer if they agree or disagree with the biased content, which makes automatic bias evaluation easy and is usually used for implicit bias evaluation.
Psychometrics for Bias Evaluation: Goal Shifting transforms the evaluations to a different form of tasks to avoid being detected, making evaluations more thorough. Cognition Concordance persuades subjects to change their own beliefs with misleading beliefs or behaviors, testing if their positions are firm. Imitation Learning encourages subjects to mimic the behaviors of others, which may lead to unintended behaviors. These psychometric principles are leveraged to design our attack methods.

Attack Methodology

Disguise attacks

This method is inspired by the psychometric principle of Goal Shifting. Specifically, we use Viewpoint Contextualization to implement, leveraging a conversational task to disguise our attacks.

Every biased viewpoint is transformed into a multi-user conversation, containing the scenario where a bias happens and a biased remark raised by some person. Then, models are asked to decide whether they agree or disagree with the biased remark in this scenario.

Figure 2: Prompt example of disguise attacks

Deception attacks

In Cognitive and Social Psychology, Cognition Concordance refers to the reconciliation process when subjects encounter new cognitions or actions that conflict with their existing ones, which may cause them to adapt to the environment.

Deception attacks leverage Cognition Concordance to mislead LLMs with new ideas (bias indoctrination) or behaviors (forged context), potentially influencing their subsequent actions and resulting in more relevant behaviors. We used two methods, Mental Deception and Memory Falsification, to implement Deception attacks.

① Mental Deception asks models to believe in a biased viewpoint through the system prompt.

Figure 3: Prompt example of mental deception attacks

② Memory Falsification forged a biased context (through API) to pretend that the model itself have responsed biased content, and tested whether they will agree with similar content.

Figure 4: Prompt example of memory falsification attacks

Teaching attacks

In Social Psychology, Imitation Learning refers to learning by mimicking others’ behaviors, which is also common in society. Teaching attacks leverage Imitation Learning by providing several examples to imitate, which may cause more relevant behaviors. We add three biased statements, leveraging the few-shot learning ability of LLMs and design both discriminative and generative tasks.

Figure 5: Prompt example of teaching attacks

Experimental Setup

① Bias types: We selected four representative bias types: age (AG), gender (GD), race (RC), and sexual orientation (SO).

② Evaluation metric: We use the Attack Success Rate(ASR) as our metric, i.e., ASR $ = \frac{\# \; agreement \; responses}{\# \; total \; responses} \times 100\% $.

③ Repeated tests: To reduce sampling error, each test is repeated 10 times.

④ LLMs: Representative commercial and open-source LLMs are included, like GPT-4, GPT-3.5, Mistral-v0.3, Llama-3, Qwen-2.

⑤ Data preparation: We transformed data from CBBQ dataset into dialog format using the following procedure to adapt to our attacks and guarantee their quality.

Figure 6: Data format transformation

Results & Insights

Main Results

Figure 7: Results of attack success rates (ASR↑), higher ASR means more biases are exposed. 3 baselines are included, i.e. vanilla, which consists of pure unmodified biased statements; Disregarding Rules (DR), which adds a system prompt to prevent refusal and encourage biased responses, which is same as in all of our attack methods; Disregarding Rules with Context (DR+C), which adds the concrete context where the bias happens based on DR, making it semantically equivalent to our conversational attacks. Four bias types: age (AG), gender (GD), race (RC), and sexual orientation (SO).

① Effectiveness of Attack Methods:

(i) Deception attacks, including Mental Deception (MD) and Memory Falsification (MF), are tively the most effective, followed by Disguise attacks and Teaching attacks. This indicates that the psychological principles of Deception and Disguise attacks play a significant role.

(ii) Using our psychometric attack methods generally achieves higher attack success rates than baselines, implying the effectiveness of our attack methods.

② Models' Comparison:

The safest tier includes GPT-4-1106-preview, GLM-3-turbo, and Mistral-7B-Instruct-v0.3. The second tier includes Qwen2-7B-Instruct, GLM-4- 9b-chat, and GPT-3.5-turbo-0301. The least safe tier includes GPT-3.5-turbo-1106 and Llama-3-8BInstruct.

③ Bias Types' Comparison:

LLMs are more likely to reveal inherent biases in mild bias types (e.g., age) than severe ones (e.g., race) under attacks. Possible reasons include (i) biased statements in severe bias types are more evident and can be easily recognized by LLMs, (ii) more RLHF training is designated towards the bias types of more negative social impact, (iii) biases contained in training data may differ across different categories, leading to uneven bias distribution in LLMs.

④ Attacks in Dialog Format

Compare Disguise-VC and Baseline-DR+C, which contain the same semantics and only differ in the format of attacks. The higher ASR of Disguise attacks shows that attacks in dialog formats is more effective than in declarative formats.

Further Analyses

① Language Difference:

Figure 8: The average difference of Attack Success Rate (ASR↑) between English and Chinese (ASR$_{EN}$ - ASR$_{CN}$). Values above 0 mean models reveal more bias in English, while values below 0 mean models reveal more bias in Chinese.

Shown in Figure 8, models that support English but do not support Chinese officially, like GPT-3.5, Mistral-v0.3, and Llama-3, exhibit more biases under English attacks compared with Chinese, while models that support both Chinese and English, like GLM-3, Qwen-2 and GLM-4, show more biases in Chinese. The reason might be that (i) models’ abilities to follow instructions are stronger in their mainly targeted language, (ii) the training corpora might also be more extensive in this language, leading to more bias expressed in the text learned.

② Bias Generative Task:

Figure 9: Generations by GLM-3-turbo under Teaching attacks in the bias generation task.

We apply Teaching attacks to bias generation task, i.e., asking LLMs to generate more biased statements given several biased examples. As shown in Figure 9, generative tasks can disclose other types of implicit bias within LLMs, different from the bias type they are taught. This highlights the existence of a wide variety of inherent biases in LLMs.

③ Model Updates of GPT Series:

Figure 10: Comparison of attack success rate (ASR↑) between 3 GPT models.

As is shown in Figure 10, the updated GPT-3.5-turbo-1106 model may possess a stronger instruction-following capability than GPT-3.5-turbo-0301, which, however, leads to more vulnerability under attacks; compared to GPT-3.5 models, GPT-4 demonstrates significant safety improvements.

BUMBLE Benchmark

For a more comprehensive evaluation, we built the BilingUal iMplicit Bias evaLuation bEnchmark (BUMBLE) based on the BBQ dataset on nine common bias categories, totaling 12.7K instances. We used the same data transformation process as in Figure 6, translated data into two languages, and applied the 7 attack methods on them.

Figure 11: Results of GPT-3.5 on BUMBLE. Above: English results, Below: Chinese results. AG: Age, DA: Disability, GD: Gender, NA: Nationality, PH: Physical Appearance, RA: Race, RE: Religion, SS: Socioeconomic Status, SO: Sexual Orientation, Avg.:Average.

We tested GPT-3.5 on BUMBLE benchmark and the results are shown in Figure 11. Deception attacks (Mental Deception and Memory Falsification attacks) tend to be the most effective. Comparing bias in different categories, we found that GPT-3.5 is more likely to reveal inherent biases in age, gender, nationality, etc., and is less likely in race, religion, etc.

Conclusion

We propose an attack methodology using psychometrics to elicit LLMs’ implicit bias. By attacking representative commercial and open-source models, including GPT-3.5, GPT-4, Llama-3, Mistral, etc., we find that all three attacks can elicit implicit bias in LLMs. Among evaluated LLMs, GLM-3, GPT-4, and Mistral are relatively safer, possibly due to strict safety requirements and RLHF alignment. Additionally, bias in different categories exhibits similarity, with LLMs capable of transferring bias from one category to another. We also conducted analytical experiments on different languages, roles played, etc. We expand the evaluation to broader categories and form a bilingual benchmark, BUMBLE, with 12.7K testing examples. In the future, we will evaluate more LLMs, include more psychological principles, and utilize psychological principles for safety defenses.

BibTeX

@article{wen2024evaluating,
  title={Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective},
  author={Wen, Yuchen and Bi, Keping and Chen, Wei and Guo, Jiafeng and Cheng, Xueqi},
  journal={arXiv preprint arXiv:2406.14023},
  year={2024}
}