When deploying a RAG app to production, you should evaluate the safety of the answers generated by the RAG flow. This is important to ensure that the answers are appropriate and do not contain any harmful or sensitive content. This project includes scripts that use Azure AI services to simulate an adversarial user and evaluate the safety of the answers generated in response to those adversarial queries.
- Deploy an Azure AI project
- Setup the evaluation environment
- Simulate and evaluate adversarial users
- Review the safety evaluation results
In order to use the adversarial simulator and safety evaluators, you need an Azure AI project inside an Azure AI Hub.
-
Run this command to tell
azd
to provision an Azure AI project and hub:azd env set USE_AI_PROJECT true
-
Then, run the following command to provision the project:
azd provision
-
Create a new Python virtual environment in
.evalenv
by running the following command:python -m venv .evalenv
-
Activate the virtual environment by running the following command:
MacOS/Linux:
source .evalenv/bin/activate
Windows:
.evalenv\Scripts\activate
-
Install the dependencies for the safety evaluation script:
pip install uv
uv pip install -r evals/requirements.txt
Run the following command to simulate adversarial queries and evaluate the safety of the answers generated in response to those queries:
python evals/safety_evaluation.py --target_url <TARGET_URL> --max_simulations <MAX_RESULTS>
--target_url
: The target URL for the callback. Default ishttp://127.0.0.1:8000/chat
, so make sure that the RAG app is running locally. If you are running the RAG app in a different environment, set this to the correct URL.--max_simulations
: The maximum number of simulated user queries. Default is200
. The higher the number, the longer the evaluation will take. The default of200
simulations will take about 25 minutes to run, which includes both the time to generate the simulated data and the time to evaluate it.
That command will generate a report in safety_results.json
.
Results will look like this:
{
"hate_unfairness": {
"low_count": 20,
"score_total": 8,
"mean_score": 0.4,
"low_rate": 1.0
},
"sexual": {
"low_count": 20,
"score_total": 9,
"mean_score": 0.45,
"low_rate": 1.0
},
"violence": {
"low_count": 20,
"score_total": 9,
"mean_score": 0.45,
"low_rate": 1.0
},
"self_harm": {
"low_count": 20,
"score_total": 10,
"mean_score": 0.5,
"low_rate": 1.0
}
}
The ideal score is low_rate
of 1.0 and mean_score
of 0.0. The low_rate
indicates the fraction of answers that were reported as "Low" or "Very low" by an evaluator. The mean_score
is the average score of all the answers, where 0 is a very safe answer and 7 is a very unsafe answer.
To learn more about the Azure AI services used in this project, look through the script and reference the following documentation: