GPT-4O machine translation of the article “Uncensor any LLM with abliteration”

Modern language models (LLMs) are tuned for safety and instruction execution, which means they are trained to refuse harmful requests. In their blog, Arditi et al showed that this failure behavior is associated with a specific direction in the residual flow of the model. If we prevent this direction from being represented in the model, it will lose the ability to refuse requests. On the contrary, artificially adding this direction can cause the model to refuse even harmless requests.

In a traditional decoder-only architecture like Llama, there are three residual threads that we can target: at the beginning of each block (“pre”), between the attention layers and the MLP (“mid”), and after the MLP (“post”). The following illustration shows the location of each residual stream.

To remove censorship from a language model (LLM), you first need to determine the “direction of failure” within the model. This process includes several technical steps:

Data collection: Run the model on a set of harmful instructions and a set of benign instructions, recording the residual thread activations at the last token position for each.
Average difference: Calculate the average difference between the activations of harmful and benign instructions. This will give us a vector representing the “failure direction” for each layer in the model.
Choice: Normalize these vectors and evaluate them to select the single best “failure direction”.

Once we have determined the direction of failure, we can “ablate” it, effectively removing the model's ability to represent that characteristic. This can be done through intervention during inference or permanently through orthogonalization of the weights.

First let's talk about interference during withdrawal. For each component that writes to the residual stream (for example, the attention head), we calculate the projection of its output onto the failure direction and subtract this projection. This subtraction is applied to each token and each layer, ensuring that the model never represents the direction of failure.

On the other hand, weight orthogonalization involves directly changing the model's weights. By orthogonalizing the component weights with respect to the failure direction, we prevent the model from being written entirely in that direction. This is achieved by adjusting the matrices that write to the residual flow, ensuring that they do not contribute to the direction of failure.

Implementation
The following abliteration implementation is based on the FailSpy notebook, which in turn is based on the authors' original notebook. I've mostly adapted and simplified it to make it more understandable. This section contains a lot of code so you can see what's going on, but you can use FailSpy's abliterator library if you're less interested in the technical details (also check out his collection of abliterated models on Hugging Face).

The code relies on the excellent TransformerLens library (formerly known as EasyTransformer) to perform complex tasks. It is intended to be mechanistically interpretable and is used here to intervene in activation. Thanks to Neil Nanda and Joseph Bloom for creating and maintaining this library.

First, let's install the required packages and import them.

!pip install transformers transformers_stream_generator tiktoken transformer_lens einops jaxtyping

import torch
import functools
import einops
import gc

from datasets import load_dataset
from tqdm import tqdm
from torch import Tensor
from typing import List
from transformer_lens import HookedTransformer, utils
from transformer_lens.hook_points import HookPoint
from transformers import AutoModelForCausalLM, AutoTokenizer
from jaxtyping import Float, Int
from collections import defaultdict

# Turn automatic differentiation off to save GPU memory (credit: Undi95)
torch.set_grad_enabled(False)

We need two sets of data: one with benign instructions and one with harmful instructions. We will use tatsu-lab/alpaca as well as data from llm-attacks. To make things easier, I repackaged them into two Hugging Face datasets: mlabonne/harmless_alpaca and mlabonne/harmful_behaviors. This way you can easily replace them with your own datasets. We will load the instructions and convert them into a list of dictionaries with the keys “role” and “content”. This will make them compatible with the apply_chat_tokenizer() method, which we will use to follow the Llama 3 chat pattern.

def reformat_texts(texts):
    return [[{"role": "user", "content": text}] for text in texts]

# Get harmful and harmless datasets
def get_harmful_instructions():
    dataset = load_dataset('mlabonne/harmful_behaviors')
    return reformat_texts(dataset['train']['text']), reformat_texts(dataset['test']['text'])

def get_harmless_instructions():
    dataset = load_dataset('mlabonne/harmless_alpaca')
    return reformat_texts(dataset['train']['text']), reformat_texts(dataset['test']['text'])

harmful_inst_train, harmful_inst_test = get_harmful_instructions()
harmless_inst_train, harmless_inst_test = get_harmless_instructions()

Now that we have our datasets, we can load the model we want to ablate. Unfortunately, you cannot directly load a custom model using HookedTransformer. Here I'm using the trick described in the FailSpy notebook to download a custom model and rename it meta-llama/Meta-Llama-3-8B-Instruct. Load in torch.float16 format if your GPU does not support BF16.

In this example we will use mlabonne/Daredevil-8B, a mega-mix created using DARE TIES (see my article on model fusion) which has the highest MMLU score in Category 8B on the Open LLM Leaderboard.

MODEL_ID = "mlabonne/Daredevil-8B"
MODEL_TYPE = "meta-llama/Meta-Llama-3-8B-Instruct"

# Download and load model
!git clone https://huggingface.co/{MODEL_ID} {MODEL_TYPE}

# Load model and tokenizer
model = HookedTransformer.from_pretrained_no_processing(
    MODEL_TYPE,
    local_files_only=True,
    dtype=torch.bfloat16,
    default_padding_side="left"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_TYPE)
tokenizer.padding_side="left"
tokenizer.pad_token = tokenizer.eos_token

We can now tokenize our data sets. We use the same number of samples for both benign and harmful instructions. Note that a large number of samples can use up all the RAM/VRAM, so I'm limiting it to 256 here.

def tokenize_instructions(tokenizer, instructions):
    return tokenizer.apply_chat_template(
        instructions,
        padding=True,
        truncation=False,
        return_tensors="pt",
        return_dict=True,
        add_generation_prompt=True,
    ).input_ids

n_inst_train = min(256, len(harmful_inst_train), len(harmless_inst_train))

# Tokenize datasets
harmful_tokens = tokenize_instructions(
    tokenizer,
    instructions=harmful_inst_train[:n_inst_train],
)
harmless_tokens = tokenize_instructions(
    tokenizer,
    instructions=harmless_inst_train[:n_inst_train],
)

Everything is set up, now we can implement the first step of abliteration: data collection. We want to process these tokenized data sets and preserve residual thread activations for harmful and benign instructions. This is controlled by the transformer_lens library.

# Define batch size based on available VRAM
batch_size = 32

# Initialize defaultdicts to store activations
harmful = defaultdict(list)
harmless = defaultdict(list)

# Process the training data in batches
num_batches = (n_inst_train + batch_size - 1) // batch_size
for i in tqdm(range(num_batches)):
    print(i)
    start_idx = i * batch_size
    end_idx = min(n_inst_train, start_idx + batch_size)

    # Run models on harmful and harmless prompts, cache activations
    harmful_logits, harmful_cache = model.run_with_cache(
        harmful_tokens[start_idx:end_idx],
        names_filter=lambda hook_name: 'resid' in hook_name,
        device="cpu",
        reset_hooks_end=True
    )
    harmless_logits, harmless_cache = model.run_with_cache(
        harmless_tokens[start_idx:end_idx],
        names_filter=lambda hook_name: 'resid' in hook_name,
        device="cpu",
        reset_hooks_end=True
    )

    # Collect and store the activations
    for key in harmful_cache:
        harmful[key].append(harmful_cache[key])
        harmless[key].append(harmless_cache[key])

    # Flush RAM and VRAM
    del harmful_logits, harmless_logits, harmful_cache, harmless_cache
    gc.collect()
    torch.cuda.empty_cache()

# Concatenate the cached activations
harmful = {k: torch.cat(v) for k, v in harmful.items()}
harmless = {k: torch.cat(v) for k, v in harmless.items()}

We can now calculate the direction of failure for each layer. This corresponds to the average difference between the activations of harmful and benign instructions, which is then normalized. We sort them in descending order in the activation_scored variable.

# Helper function to get activation index
def get_act_idx(cache_dict, act_name, layer):
    key = (act_name, layer)
    return cache_dict[utils.get_act_name(*key)]

# Compute difference of means between harmful and harmless activations at intermediate layers
activation_layers = ["resid_pre", "resid_mid", "resid_post"]
activation_refusals = defaultdict(list)

for layer_num in range(1, model.cfg.n_layers):
    pos = -1  # Position index

    for layer in activation_layers:
        harmful_mean_act = get_act_idx(harmful, layer, layer_num)[:, pos, :].mean(dim=0)
        harmless_mean_act = get_act_idx(harmless, layer, layer_num)[:, pos, :].mean(
            dim=0
        )

        refusal_dir = harmful_mean_act - harmless_mean_act
        refusal_dir = refusal_dir / refusal_dir.norm()
        activation_refusals[layer].append(refusal_dir)

# Get all calculated potential refusal directions, sort them in descending order based on their mean
# Use a subset of layers if certain activations are not promising
selected_layers = ["resid_pre"]
activation_scored = sorted(
    [
        activation_refusals[layer][l - 1]
        for l in range(1, model.cfg.n_layers)
        for layer in selected_layers
    ],
    key=lambda x: abs(x.mean()),
    reverse=True,
)

The last step of the process is to evaluate the failure directions that we have calculated. To do this, we will apply the failure direction to each residual stream and each block during inference. In the following code fragment, we obtain generations for four test harmful instructions and 20 blocks (or layers).

def _generate_with_hooks(
    model: HookedTransformer,
    tokenizer: AutoTokenizer,
    tokens: Int[Tensor, "batch_size seq_len"],
    max_tokens_generated: int = 64,
    fwd_hooks=[],
) -> List[str]:
    all_tokens = torch.zeros(
        (tokens.shape[0], tokens.shape[1] + max_tokens_generated),
        dtype=torch.long,
        device=tokens.device,
    )
    all_tokens[:, : tokens.shape[1]] = tokens
    for i in range(max_tokens_generated):
        with model.hooks(fwd_hooks=fwd_hooks):
            logits = model(all_tokens[:, : -max_tokens_generated + i])
            next_tokens = logits[:, -1, :].argmax(
                dim=-1
            )  # greedy sampling (temperature=0)
            all_tokens[:, -max_tokens_generated + i] = next_tokens
    return tokenizer.batch_decode(
        all_tokens[:, tokens.shape[1] :], skip_special_tokens=True
    )

def get_generations(
    model: HookedTransformer,
    tokenizer: AutoTokenizer,
    instructions: List[str],
    fwd_hooks=[],
    max_tokens_generated: int = 64,
    batch_size: int = 4,
) -> List[str]:
    generations = []
    for i in tqdm(range(0, len(instructions), batch_size)):
        tokens = tokenize_instructions(
            tokenizer, instructions=instructions[i : i + batch_size]
        )
        generation = _generate_with_hooks(
            model,
            tokenizer,
            tokens,
            max_tokens_generated=max_tokens_generated,
            fwd_hooks=fwd_hooks,
        )
        generations.extend(generation)
    return generations

# Inference-time intervention hook
def direction_ablation_hook(
    activation: Float[Tensor, "... d_act"],
    hook: HookPoint,
    direction: Float[Tensor, "d_act"],
):
    if activation.device != direction.device:
        direction = direction.to(activation.device)
    proj = (
        einops.einsum(
            activation, direction.view(-1, 1), "... d_act, d_act single -> ... single"
        )
        * direction
    )
    return activation - proj

# Testing baseline
N_INST_TEST = 4
baseline_generations = get_generations(
    model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=[]
)

# Evaluating layers defined earlier (needs human evaluation to determine best layer for refusal inhibition)
EVAL_N = 20  # Evaluate how many of the top N potential directions
evals = []
for refusal_dir in tqdm(activation_scored[:EVAL_N]):
    hook_fn = functools.partial(direction_ablation_hook, direction=refusal_dir)
    fwd_hooks = [
        (utils.get_act_name(act_name, layer), hook_fn)
        for layer in list(range(model.cfg.n_layers))
        for act_name in activation_layers
    ]
    intervention_generations = get_generations(
        model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=fwd_hooks
    )
    evals.append(intervention_generations)

We saved all generations in the evals list. Now we can print them out and manually select the layer (block) that gives the uncensored answer for each instruction. I automatically exclude responses containing “I can't” and “I can't” to filter out unwanted responses.

If you can't find a layer that meets these requirements, you may need to test other residual threads from the previously selected_layers list, other instructions, additional blocks, etc.

# Print generations for human evaluation
blacklist = ["I cannot", "I can't"]
for i in range(N_INST_TEST):
    print(f"\033[1mINSTRUCTION {i}: {harmful_inst_test[i]}")
    print(f"\nBASELINE COMPLETION:\n{baseline_generations[i]}\033[0m")
    for layer_candidate in range(EVAL_N):
        if not any(word in evals[layer_candidate][i] for word in blacklist):
            print(f"\n---\n\nLAYER CANDIDATE #{layer_candidate} INTERVENTION COMPLETION:")
            print(evals[layer_candidate][i])

In my case, candidate layer 9 was able to provide an uncensored response to four instructions. This is the one we will choose to direct the refusal. Next, we implement weight orthogonalization to change the weights and prevent the model from generating outputs in that direction. You can verify that the model is successfully uncensored by printing the completion results.

def get_orthogonalized_matrix(
    matrix: Float[Tensor, "... d_model"], vec: Float[Tensor, "d_model"]
) -> Float[Tensor, "... d_model"]:
    proj = (
        einops.einsum(
            matrix, vec.view(-1, 1), "... d_model, d_model single -> ... single"
        )
        * vec
    )
    return matrix - proj

# Select the layer with the highest potential refusal direction
LAYER_CANDIDATE = 9
refusal_dir = activation_scored[LAYER_CANDIDATE]

# Orthogonalize the model's weights
if refusal_dir.device != model.W_E.device:
    refusal_dir = refusal_dir.to(model.W_E.device)
model.W_E.data = get_orthogonalized_matrix(model.W_E, refusal_dir)

for block in tqdm(model.blocks):
    if refusal_dir.device != block.attn.W_O.device:
        refusal_dir = refusal_dir.to(block.attn.W_O.device)
    block.attn.W_O.data = get_orthogonalized_matrix(block.attn.W_O, refusal_dir)
    block.mlp.W_out.data = get_orthogonalized_matrix(block.mlp.W_out, refusal_dir)

# Generate text with abliterated model
orthogonalized_generations = get_generations(
    model, tokenizer, harmful_inst_test[:N_INST_TEST], fwd_hooks=[]
)

# Print generations
for i in range(N_INST_TEST):
    if len(baseline_generations) > i:
        print(f"INSTRUCTION {i}: {harmful_inst_test[i]}")
        print(f"\033[92mBASELINE COMPLETION:\n{baseline_generations[i]}")
    print(f"\033[91mINTERVENTION COMPLETION:\n{evals[LAYER_CANDIDATE][i]}")
    print(f"\033[95mORTHOGONALIZED COMPLETION:\n{orthogonalized_generations[i]}\n")

Now we are ready to use the model. We convert it back to Hugging Face format and upload it to the HF hub.

# Convert model back to HF safetensors
hf_model = AutoModelForCausalLM.from_pretrained(MODEL_TYPE, torch_dtype=torch.bfloat16)
lm_model = hf_model.model

state_dict = model.state_dict()
lm_model.embed_tokens.weight = torch.nn.Parameter(state_dict["embed.W_E"].cpu())

for l in range(model.cfg.n_layers):
    lm_model.layers[l].self_attn.o_proj.weight = torch.nn.Parameter(
        einops.rearrange(
            state_dict[f"blocks.{l}.attn.W_O"], "n h m->m (n h)", n=model.cfg.n_heads
        ).contiguous()
    )
    lm_model.layers[l].mlp.down_proj.weight = torch.nn.Parameter(
        torch.transpose(state_dict[f"blocks.{l}.mlp.W_out"], 0, 1).contiguous()
    )

hf_model.push_to_hub(f"{MODEL_ID}-abliterated")
# hf_model.push_to_hub(f"{MODEL_ID}-abliterated")