Jailbreaking Experimentation with Inspect

This code was written to play around with various jailbreaking techniques ahead of the Gray Swan AI Ultimate Jailbreaking Competition. I've written that up here on my blog.

This is disorganized experimental mix-and-match code that helped me do a few things:

Test 10 off-the-shelf jailbreaking prompt templates on 15 sample HarmBench target behaviors against 18 LLMs from a range of providers, automatically scoring their outputs with 4 rating methods
Quickly pipe several combinations of Behavior x Attack x Paraphrase into a Google Sheet
Play around with running open-weight GCG attacks
Convert competition chat log files into spreadsheet format

It's most useful as a quick demo of how one might use the UK AISI's Inspect framework, which is a fantastic tool for doing all sorts of evals, whether on AI capabilities elicitation or red-teaming techniques. You may just want to read their docs, which will have more canonical ways of using their tooling, rather than starting from this baseline.

Don't imagine you will want to run this, but if you do, use uv to manage your Python environment, then uv run jailbreak.py, uv run gcg.py, etc. You'll also want to put API keys for some subset of OpenAI, Anthropic, Together.ai, HuggingFace, and Google into .env (following .env.example format).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
analyze_log.py		analyze_log.py
average_score_by_task.png		average_score_by_task.png
convert_chats.py		convert_chats.py
gcg.py		gcg.py
is_harmful_rate_by_transform.png		is_harmful_rate_by_transform.png
jailbreak.py		jailbreak.py
pyproject.toml		pyproject.toml
scorers.py		scorers.py
transforms.py		transforms.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Jailbreaking Experimentation with Inspect

About

Releases

Packages

Languages

nwinter/ultimate-jailbreaking-championship

Folders and files

Latest commit

History

Repository files navigation

Jailbreaking Experimentation with Inspect

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages