Skip to content

Latest commit

 

History

History
107 lines (78 loc) · 3.66 KB

safety_evaluation.md

File metadata and controls

107 lines (78 loc) · 3.66 KB

Evaluating RAG answer safety

When deploying a RAG app to production, you should evaluate the safety of the answers generated by the RAG flow. This is important to ensure that the answers are appropriate and do not contain any harmful or sensitive content. This project includes scripts that use Azure AI services to simulate an adversarial user and evaluate the safety of the answers generated in response to those adversarial queries.

Deploy an Azure AI project

In order to use the adversarial simulator and safety evaluators, you need an Azure AI project inside an Azure AI Hub.

  1. Run this command to tell azd to provision an Azure AI project and hub:

    azd env set USE_AI_PROJECT true
  2. Then, run the following command to provision the project:

    azd provision

Setup the evaluation environment

  1. Create a new Python virtual environment in .evalenv by running the following command:

    python -m venv .evalenv
  2. Activate the virtual environment by running the following command:

    MacOS/Linux:

    source .evalenv/bin/activate

    Windows:

    .evalenv\Scripts\activate
  3. Install the dependencies for the safety evaluation script:

pip install uv
uv pip install -r evals/requirements.txt

Simulate and evaluate adversarial users

Run the following command to simulate adversarial queries and evaluate the safety of the answers generated in response to those queries:

python evals/safety_evaluation.py --target_url <TARGET_URL> --max_simulations <MAX_RESULTS>
  • --target_url: The target URL for the callback. Default is http://127.0.0.1:8000/chat, so make sure that the RAG app is running locally. If you are running the RAG app in a different environment, set this to the correct URL.
  • --max_simulations: The maximum number of simulated user queries. Default is 200. The higher the number, the longer the evaluation will take. The default of 200 simulations will take about 25 minutes to run, which includes both the time to generate the simulated data and the time to evaluate it.

Review the safety evaluation results

That command will generate a report in safety_results.json. Results will look like this:

{
  "hate_unfairness": {
    "low_count": 20,
    "score_total": 8,
    "mean_score": 0.4,
    "low_rate": 1.0
  },
  "sexual": {
    "low_count": 20,
    "score_total": 9,
    "mean_score": 0.45,
    "low_rate": 1.0
  },
  "violence": {
    "low_count": 20,
    "score_total": 9,
    "mean_score": 0.45,
    "low_rate": 1.0
  },
  "self_harm": {
    "low_count": 20,
    "score_total": 10,
    "mean_score": 0.5,
    "low_rate": 1.0
  }
}

The ideal score is low_rate of 1.0 and mean_score of 0.0. The low_rate indicates the fraction of answers that were reported as "Low" or "Very low" by an evaluator. The mean_score is the average score of all the answers, where 0 is a very safe answer and 7 is a very unsafe answer.

Resources

To learn more about the Azure AI services used in this project, look through the script and reference the following documentation: