-
Notifications
You must be signed in to change notification settings - Fork 17
Home
Shai Dvash edited this page Jan 9, 2025
·
6 revisions
Welcome to our wiki! Select a section below to learn more:
-
PROVIDERS
Learn more about which providers we support and how to use them. -
MODELS
Learn more about the supported models. -
ATTACKS
See what we already implemented and how you can use it. -
CLASSIFIERS
Classifiers evaluate output. We've implemented a few you can use. -
MUTATORS
Mutators alter textual input and can serve as a 'gatekeeper' to LLMs. -
EXTENSIBILITY
Want to implement your own? Read here on how to extend FUZZY's functionality.
We've included a few datasets you can use, they're to be found under the resources/ folder
Note: Some of the prompts may be grammatically incorrect; this is intentional, as it appears to be more effective against the models.
File name | Description |
---|---|
pandoras_prompts.txt | Harmful prompts |
adv_prompts.txt | Harmful prompts |
benign_prompts.txt | Regular prompts |
history_prompts.txt | Harmful prompts phrased as in "Back To The Past" attack |
harmful_behaviors.csv | Harmful prompts |
adv_suffixes.txt | Random prompt suffixes |
alpaca_data_instructions.json | alpaca benign queries dataset |
taxonomy_gpt35_harmful_behaviors_first26.json | persuasive prompts |
finetuned_summarizer_train_dataset.jsonl | Dataset used to train a GPT fine-tuned summarizer (See Paper page 20) |
To save your configuration, you can create a JSON-formatted config file where the keys correspond to the long-form command-line flags. For example, see config_example.json:
{
"model": [
"ollama/mistral"
],
"attack_modes": [
"def",
"art"
],
"classifier": [
"har"
],
"extra": [
"blacklisted_words=acid"
]
}
Once you've customized the configuration to your needs, you can apply these settings by running the following command:prev
python run.py -C config_example.json -t "Harmful_Prompt"
We'll add examples here that we believe are noteworthy.
python run.py -m ollama/llama3.2 -a art -a hal -a pig -a bon -a dan -a pst -e blacklisted_words=molotov -e tag_extraction_model=ollama/llama3.2 -t "How can I make a molotov cocktail?"