Skip to content

Commit

Permalink
Frontend updates (#92)
Browse files Browse the repository at this point in the history
* add frontend into docker-compose
* RAG_VERSION now defaults to git commit hash

* add CHANGELOG ci

* add more spaces to avoid merge conflict

* make changelog manual_dispatch, fix docker CI

---------

Signed-off-by: Jack Luar <[email protected]>
  • Loading branch information
luarss authored Nov 11, 2024
1 parent 260bba5 commit 7090704
Show file tree
Hide file tree
Showing 9 changed files with 141 additions and 11 deletions.
36 changes: 36 additions & 0 deletions .github/workflows/changelog_report.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
if __name__ == "__main__":
with open("../../CHANGELOG.md") as f:
temp = f.readlines()

# divide the file into four categories
commit, name, date, msg = [], [], [], []

# regex is <commit>\t<name>\t<date>\t<msg>
for line in temp:
line = line.split('\t')
commit.append(line[0])
name.append(line[1])
date.append(line[2])
msg.append(line[3])

# first detect the number of unique year-month combo
date_year_month = [x[:7] for x in date]
unique = sorted(list(set(date_year_month)), reverse=True)

# based on this write from the reverse order
final_lines = []

for year_month in unique:
# first write the header
final_lines.append(f"# {year_month}\n\n")

# loop through and stop when the year_month is lesser
for idx in range(len(date)):
if date[idx][:7] == year_month:
l = f"- {commit[idx]} {name[idx]} {date[idx]} {msg[idx]}"
final_lines.append(l)
final_lines.append("\n")

with open('../../CHANGELOG.md', 'w') as f:
for l in final_lines:
f.write(l)
41 changes: 41 additions & 0 deletions .github/workflows/ci-changelog.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
name: Update Changelog

on:
workflow_dispatch:

jobs:
updateChangeLog:
runs-on: ubuntu-latest
steps:
- name: Check out repository code
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'

- name: Create Changelog file
run: |
make changelog
- name: Create Pull Request
id: cpr
uses: peter-evans/create-pull-request@v6
with:
token: ${{ secrets.GH_PAT }}
commit-message: Update report
committer: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
author: ${{ github.actor }} <${{ github.actor_id }}+${{ github.actor }}@users.noreply.github.com>
signoff: true
base: master
branch: update-chglog
delete-branch: true
title: Changelog Update
body: |
- Auto-generated by [create-pull-request][1]
[1]: https://github.com/peter-evans/create-pull-request
labels: |
docs
assignees: luarss
reviewers: luarss
4 changes: 2 additions & 2 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ jobs:
cp ${{ secrets.PATH_TO_GOOGLE_APPLICATION_CREDENTIALS }} evaluation/auto_evaluation/src
- name: Build Docker image
run: |
make docker
make docker-up
sleep 900 # TODO: Remove this after docker-compose healthcheck timeout restored fixed.
- name: Run LLM CI
working-directory: evaluation
Expand All @@ -48,4 +48,4 @@ jobs:
- name: Teardown
if: always()
run: |
docker compose down --remove-orphans
make docker-down
9 changes: 8 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -16,5 +16,12 @@ check:
@. ./backend/.venv/bin/activate && \
pre-commit run --all-files

docker:
docker-up:
@docker compose up --build --wait

docker-down:
@docker compose down --remove-orphans

changelog:
@git log --pretty=format:"%h%x09%an%x09%ad%x09%s" --date=short --since="2024-06-01" > CHANGELOG.md
@cd .github/workflows && python changelog_report.py
9 changes: 9 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,15 @@ services:
# timeout: 10s
# retries: 5
# start_period: 30 s # todo: make sure that this healthcheck starts after the API in the backend is ready.

frontend:
build:
context: ./frontend
container_name: "orassistant-frontend"
ports:
- "8501:8501"
networks:
- orassistant-network

# health-checker:
# build: ./common
Expand Down
18 changes: 18 additions & 0 deletions frontend/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
FROM python:3.12.3-slim

WORKDIR /ORAssistant-frontend

COPY ./requirements.txt /ORAssistant-frontend/requirements.txt
COPY ./requirements-test.txt /ORAssistant-frontend/requirements-test.txt
COPY ./pyproject.toml /ORAssistant-frontend/pyproject.toml

RUN pip install --upgrade pip && \
pip install --no-cache-dir -r requirements.txt && \
pip install --no-cache-dir -r requirements-test.txt && \
pip install --no-cache-dir -e .

COPY streamlit_app.py .
COPY ./utils ./utils
COPY ./assets ./assets

CMD ["streamlit", "run", "streamlit_app.py"]
7 changes: 4 additions & 3 deletions frontend/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# FrontEnd For Streamlit
# Frontend For Streamlit

This Folder contains the frontend code for the OR Assistant using Streamlit. Follow the instructions below to set up the environment, run the application, and perform testing using a mock API.
This folder contains the frontend code for the OR Assistant using Streamlit. Follow the instructions below to set up the environment, run the application, and perform testing using a mock API.

## Preparing the Environment Variables

Expand Down Expand Up @@ -39,6 +39,7 @@ To collect feedback, you need to set up a Google Sheet and configure the necessa
```
4. **Set the Current Version for Feedback Evaluation:**
- Add the current version of the feedback evaluation to the environment variables.
- If unset, this defaults to the commit hash of `master`.
```plaintext
RAG_VERSION=<current-version>
Expand Down Expand Up @@ -73,4 +74,4 @@ This will start a mock API server that simulates responses for testing purposes.

## License

This project is licensed under the GNU GENERAL PUBLIC LICENSE - see the [LICENSE](../../LICENSE) file for details.
This project is licensed under the GNU GENERAL PUBLIC LICENSE - see the [LICENSE](../LICENSE) file for details.
2 changes: 1 addition & 1 deletion frontend/streamlit_app.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ def main() -> None:
base_url, endpoints = fetch_endpoints()

selected_endpoint = st.selectbox(
"Select preferred architecture",
"Select preferred endpoint",
options=endpoints,
index=0,
format_func=lambda x: x.split("/")[-1].capitalize(),
Expand Down
26 changes: 22 additions & 4 deletions frontend/utils/feedback.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,9 +89,6 @@ def submit_feedback_to_google_sheet(
"The FEEDBACK_SHEET_ID environment variable is not set or is empty."
)

if not os.getenv("RAG_VERSION"):
raise ValueError("The RAG_VERSION environment variable is not set or is empty.")

service_account_file = os.getenv("GOOGLE_CREDENTIALS_JSON")
scope = [
"https://spreadsheets.google.com/feeds",
Expand Down Expand Up @@ -186,10 +183,31 @@ def show_feedback_form(
sources=sources,
context=context,
issue=feedback,
version=os.getenv("RAG_VERSION", "N/A"),
version=os.getenv("RAG_VERSION", get_git_commit_hash()),
)

st.session_state.submitted = True

if st.session_state.submitted:
st.sidebar.success("Thank you for your feedback!")


def get_git_commit_hash() -> str:
"""
Get the latest commit hash from the Git repository.
Returns:
- str: The latest commit hash.
"""
import subprocess

try:
commit_hash = (
subprocess.check_output(["git", "rev-parse", "HEAD"])
.strip()
.decode("utf-8")
)
except subprocess.CalledProcessError:
commit_hash = "N/A"

return commit_hash

1 comment on commit 7090704

@luarss
Copy link
Collaborator Author

@luarss luarss commented on 7090704 Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

===================================
==> Dataset: EDA Corpus
==> Running tests for agent-retriever
/home/luarss/actions-runner/_work/ORAssistant/ORAssistant/evaluation/.venv/lib/python3.12/site-packages/deepeval/init.py:49: UserWarning: You are using deepeval version 1.4.9, however version 1.5.0 is available. You should consider upgrading via the "pip install --upgrade deepeval" command.
warnings.warn(

Fetching 2 files: 0%| | 0/2 [00:00<?, ?it/s]
Fetching 2 files: 50%|█████ | 1/2 [00:00<00:00, 4.28it/s]
Fetching 2 files: 100%|██████████| 2/2 [00:00<00:00, 8.56it/s]

Evaluating: 0%| | 0/100 [00:00<?, ?it/s]
Evaluating: 1%| | 1/100 [00:13<22:54, 13.89s/it]
Evaluating: 2%|▏ | 2/100 [00:25<20:14, 12.39s/it]
Evaluating: 3%|▎ | 3/100 [00:37<20:17, 12.55s/it]
Evaluating: 4%|▍ | 4/100 [00:48<18:36, 11.63s/it]
Evaluating: 5%|▌ | 5/100 [00:58<17:41, 11.17s/it]
Evaluating: 6%|▌ | 6/100 [01:09<17:08, 10.94s/it]
Evaluating: 7%|▋ | 7/100 [01:20<17:26, 11.26s/it]
Evaluating: 8%|▊ | 8/100 [01:32<17:30, 11.42s/it]
Evaluating: 9%|▉ | 9/100 [01:44<17:18, 11.41s/it]
Evaluating: 10%|█ | 10/100 [01:55<17:01, 11.35s/it]
Evaluating: 11%|█ | 11/100 [02:05<16:24, 11.06s/it]
Evaluating: 12%|█▏ | 12/100 [02:15<15:32, 10.60s/it]
Evaluating: 13%|█▎ | 13/100 [02:27<15:56, 11.00s/it]
Evaluating: 14%|█▍ | 14/100 [02:39<16:08, 11.26s/it]
Evaluating: 15%|█▌ | 15/100 [02:49<15:44, 11.11s/it]
Evaluating: 16%|█▌ | 16/100 [03:02<16:02, 11.46s/it]
Evaluating: 17%|█▋ | 17/100 [03:14<16:22, 11.83s/it]
Evaluating: 18%|█▊ | 18/100 [03:27<16:21, 11.97s/it]
Evaluating: 19%|█▉ | 19/100 [03:37<15:24, 11.41s/it]
Evaluating: 20%|██ | 20/100 [03:48<15:07, 11.34s/it]
Evaluating: 21%|██ | 21/100 [03:59<14:59, 11.38s/it]
Evaluating: 22%|██▏ | 22/100 [04:11<15:00, 11.55s/it]
Evaluating: 23%|██▎ | 23/100 [04:23<14:42, 11.47s/it]
Evaluating: 24%|██▍ | 24/100 [04:33<14:00, 11.06s/it]
Evaluating: 25%|██▌ | 25/100 [04:43<13:27, 10.77s/it]
Evaluating: 26%|██▌ | 26/100 [04:52<12:52, 10.44s/it]
Evaluating: 27%|██▋ | 27/100 [05:04<13:05, 10.76s/it]
Evaluating: 28%|██▊ | 28/100 [05:16<13:18, 11.09s/it]
Evaluating: 29%|██▉ | 29/100 [05:26<12:52, 10.89s/it]
Evaluating: 30%|███ | 30/100 [05:39<13:21, 11.45s/it]
Evaluating: 31%|███ | 31/100 [05:52<13:42, 11.92s/it]
Evaluating: 32%|███▏ | 32/100 [06:04<13:35, 12.00s/it]
Evaluating: 33%|███▎ | 33/100 [06:16<13:16, 11.89s/it]
Evaluating: 34%|███▍ | 34/100 [06:25<12:20, 11.22s/it]
Evaluating: 35%|███▌ | 35/100 [06:36<12:02, 11.12s/it]
Evaluating: 36%|███▌ | 36/100 [06:47<11:38, 10.91s/it]
Evaluating: 37%|███▋ | 37/100 [06:57<11:18, 10.76s/it]
Evaluating: 38%|███▊ | 38/100 [07:08<11:03, 10.71s/it]
Evaluating: 39%|███▉ | 39/100 [07:17<10:21, 10.19s/it]
Evaluating: 40%|████ | 40/100 [07:28<10:26, 10.44s/it]
Evaluating: 41%|████ | 41/100 [07:39<10:22, 10.55s/it]
Evaluating: 42%|████▏ | 42/100 [07:51<10:40, 11.05s/it]
Evaluating: 43%|████▎ | 43/100 [08:03<10:51, 11.44s/it]
Evaluating: 44%|████▍ | 44/100 [08:16<10:58, 11.75s/it]
Evaluating: 45%|████▌ | 45/100 [08:28<10:58, 11.98s/it]
Evaluating: 46%|████▌ | 46/100 [08:40<10:50, 12.04s/it]
Evaluating: 47%|████▋ | 47/100 [08:53<10:49, 12.25s/it]
Evaluating: 48%|████▊ | 48/100 [09:08<11:11, 12.92s/it]
Evaluating: 49%|████▉ | 49/100 [09:21<11:12, 13.18s/it]
Evaluating: 50%|█████ | 50/100 [09:36<11:14, 13.48s/it]
Evaluating: 51%|█████ | 51/100 [09:46<10:10, 12.46s/it]
Evaluating: 52%|█████▏ | 52/100 [09:58<09:53, 12.37s/it]
Evaluating: 53%|█████▎ | 53/100 [10:09<09:23, 11.99s/it]
Evaluating: 54%|█████▍ | 54/100 [10:20<09:06, 11.88s/it]
Evaluating: 55%|█████▌ | 55/100 [10:32<08:52, 11.84s/it]
Evaluating: 56%|█████▌ | 56/100 [10:43<08:21, 11.41s/it]
Evaluating: 57%|█████▋ | 57/100 [10:54<08:07, 11.33s/it]
Evaluating: 58%|█████▊ | 58/100 [11:05<07:50, 11.21s/it]
Evaluating: 59%|█████▉ | 59/100 [11:16<07:35, 11.12s/it]
Evaluating: 60%|██████ | 60/100 [11:27<07:28, 11.22s/it]
Evaluating: 61%|██████ | 61/100 [11:38<07:18, 11.23s/it]
Evaluating: 62%|██████▏ | 62/100 [11:49<07:02, 11.13s/it]
Evaluating: 63%|██████▎ | 63/100 [11:59<06:42, 10.87s/it]
Evaluating: 64%|██████▍ | 64/100 [12:11<06:33, 10.93s/it]
Evaluating: 65%|██████▌ | 65/100 [12:19<05:51, 10.04s/it]
Evaluating: 66%|██████▌ | 66/100 [12:29<05:44, 10.13s/it]
Evaluating: 67%|██████▋ | 67/100 [12:40<05:40, 10.32s/it]
Evaluating: 68%|██████▊ | 68/100 [12:50<05:32, 10.38s/it]
Evaluating: 69%|██████▉ | 69/100 [13:02<05:31, 10.69s/it]
Evaluating: 70%|███████ | 70/100 [13:12<05:19, 10.65s/it]
Evaluating: 71%|███████ | 71/100 [13:24<05:16, 10.91s/it]
Evaluating: 72%|███████▏ | 72/100 [13:35<05:08, 11.03s/it]
Evaluating: 73%|███████▎ | 73/100 [13:47<05:06, 11.36s/it]
Evaluating: 74%|███████▍ | 74/100 [13:59<04:58, 11.50s/it]
Evaluating: 75%|███████▌ | 75/100 [14:10<04:45, 11.44s/it]
Evaluating: 76%|███████▌ | 76/100 [14:22<04:37, 11.56s/it]
Evaluating: 77%|███████▋ | 77/100 [14:34<04:26, 11.58s/it]
Evaluating: 78%|███████▊ | 78/100 [14:45<04:15, 11.61s/it]
Evaluating: 79%|███████▉ | 79/100 [14:56<03:55, 11.20s/it]
Evaluating: 80%|████████ | 80/100 [15:06<03:38, 10.94s/it]
Evaluating: 81%|████████ | 81/100 [15:16<03:24, 10.79s/it]
Evaluating: 82%|████████▏ | 82/100 [15:27<03:12, 10.67s/it]
Evaluating: 83%|████████▎ | 83/100 [15:37<02:59, 10.56s/it]
Evaluating: 84%|████████▍ | 84/100 [15:47<02:48, 10.51s/it]
Evaluating: 85%|████████▌ | 85/100 [15:56<02:31, 10.07s/it]
Evaluating: 86%|████████▌ | 86/100 [16:09<02:31, 10.81s/it]
Evaluating: 87%|████████▋ | 87/100 [16:19<02:18, 10.65s/it]
Evaluating: 88%|████████▊ | 88/100 [16:31<02:12, 11.05s/it]
Evaluating: 89%|████████▉ | 89/100 [16:42<02:01, 11.02s/it]
Evaluating: 90%|█████████ | 90/100 [16:56<01:57, 11.78s/it]
Evaluating: 91%|█████████ | 91/100 [17:08<01:47, 11.95s/it]
Evaluating: 92%|█████████▏| 92/100 [17:21<01:37, 12.14s/it]
Evaluating: 93%|█████████▎| 93/100 [17:31<01:21, 11.66s/it]
Evaluating: 94%|█████████▍| 94/100 [17:42<01:07, 11.30s/it]
Evaluating: 95%|█████████▌| 95/100 [17:53<00:56, 11.22s/it]
Evaluating: 96%|█████████▌| 96/100 [18:04<00:44, 11.16s/it]
Evaluating: 97%|█████████▋| 97/100 [18:15<00:33, 11.24s/it]
Evaluating: 98%|█████████▊| 98/100 [18:26<00:22, 11.22s/it]
Evaluating: 99%|█████████▉| 99/100 [18:36<00:10, 10.84s/it]
Evaluating: 100%|██████████| 100/100 [18:49<00:00, 11.35s/it]
Evaluating: 100%|██████████| 100/100 [18:49<00:00, 11.29s/it]
✨ You're running DeepEval's latest Contextual Precision Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...
✨ You're running DeepEval's latest Contextual Recall Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...
✨ You're running DeepEval's latest Hallucination Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...

Evaluating 100 test case(s) in parallel: | | 0% (0/100) [Time Taken: 00:00, ?test case/s]
Evaluating 100 test case(s) in parallel: | | 1% (1/100) [Time Taken: 00:10, 10.35s/test case]
Evaluating 100 test case(s) in parallel: |▏ | 2% (2/100) [Time Taken: 00:10, 4.47s/test case]
Evaluating 100 test case(s) in parallel: |▎ | 3% (3/100) [Time Taken: 00:11, 2.93s/test case]
Evaluating 100 test case(s) in parallel: |▍ | 4% (4/100) [Time Taken: 00:11, 1.82s/test case]
Evaluating 100 test case(s) in parallel: |▌ | 5% (5/100) [Time Taken: 00:12, 1.47s/test case]
Evaluating 100 test case(s) in parallel: |▋ | 7% (7/100) [Time Taken: 00:13, 1.22test case/s]
Evaluating 100 test case(s) in parallel: |▉ | 9% (9/100) [Time Taken: 00:13, 1.96test case/s]
Evaluating 100 test case(s) in parallel: |█ | 11% (11/100) [Time Taken: 00:13, 2.89test case/s]
Evaluating 100 test case(s) in parallel: |█▎ | 13% (13/100) [Time Taken: 00:13, 3.75test case/s]
Evaluating 100 test case(s) in parallel: |█▌ | 15% (15/100) [Time Taken: 00:14, 3.37test case/s]
Evaluating 100 test case(s) in parallel: |█▌ | 16% (16/100) [Time Taken: 00:14, 3.12test case/s]
Evaluating 100 test case(s) in parallel: |█▋ | 17% (17/100) [Time Taken: 00:15, 3.03test case/s]
Evaluating 100 test case(s) in parallel: |█▊ | 18% (18/100) [Time Taken: 00:15, 3.14test case/s]
Evaluating 100 test case(s) in parallel: |██ | 20% (20/100) [Time Taken: 00:15, 4.58test case/s]
Evaluating 100 test case(s) in parallel: |██ | 21% (21/100) [Time Taken: 00:15, 4.81test case/s]
Evaluating 100 test case(s) in parallel: |██▎ | 23% (23/100) [Time Taken: 00:16, 5.91test case/s]
Evaluating 100 test case(s) in parallel: |██▍ | 24% (24/100) [Time Taken: 00:16, 4.96test case/s]
Evaluating 100 test case(s) in parallel: |██▌ | 26% (26/100) [Time Taken: 00:16, 5.71test case/s]
Evaluating 100 test case(s) in parallel: |██▋ | 27% (27/100) [Time Taken: 00:16, 5.05test case/s]
Evaluating 100 test case(s) in parallel: |██▉ | 29% (29/100) [Time Taken: 00:17, 5.57test case/s]
Evaluating 100 test case(s) in parallel: |███ | 30% (30/100) [Time Taken: 00:17, 5.47test case/s]
Evaluating 100 test case(s) in parallel: |███ | 31% (31/100) [Time Taken: 00:17, 4.73test case/s]
Evaluating 100 test case(s) in parallel: |███▏ | 32% (32/100) [Time Taken: 00:17, 4.67test case/s]
Evaluating 100 test case(s) in parallel: |███▍ | 34% (34/100) [Time Taken: 00:18, 5.62test case/s]
Evaluating 100 test case(s) in parallel: |███▌ | 35% (35/100) [Time Taken: 00:18, 4.18test case/s]
Evaluating 100 test case(s) in parallel: |███▌ | 36% (36/100) [Time Taken: 00:18, 4.73test case/s]
Evaluating 100 test case(s) in parallel: |████ | 40% (40/100) [Time Taken: 00:18, 9.67test case/s]
Evaluating 100 test case(s) in parallel: |████▏ | 42% (42/100) [Time Taken: 00:18, 11.24test case/s]
Evaluating 100 test case(s) in parallel: |████▌ | 46% (46/100) [Time Taken: 00:19, 16.01test case/s]
Evaluating 100 test case(s) in parallel: |████▉ | 49% (49/100) [Time Taken: 00:19, 13.12test case/s]
Evaluating 100 test case(s) in parallel: |█████▏ | 52% (52/100) [Time Taken: 00:19, 15.82test case/s]
Evaluating 100 test case(s) in parallel: |█████▌ | 55% (55/100) [Time Taken: 00:19, 13.79test case/s]
Evaluating 100 test case(s) in parallel: |█████▊ | 58% (58/100) [Time Taken: 00:19, 13.89test case/s]
Evaluating 100 test case(s) in parallel: |██████ | 60% (60/100) [Time Taken: 00:20, 14.24test case/s]
Evaluating 100 test case(s) in parallel: |██████▎ | 63% (63/100) [Time Taken: 00:20, 16.44test case/s]
Evaluating 100 test case(s) in parallel: |██████▌ | 66% (66/100) [Time Taken: 00:20, 17.73test case/s]
Evaluating 100 test case(s) in parallel: |██████▊ | 68% (68/100) [Time Taken: 00:20, 18.14test case/s]
Evaluating 100 test case(s) in parallel: |███████ | 71% (71/100) [Time Taken: 00:20, 17.71test case/s]
Evaluating 100 test case(s) in parallel: |███████▍ | 74% (74/100) [Time Taken: 00:20, 17.35test case/s]
Evaluating 100 test case(s) in parallel: |███████▌ | 76% (76/100) [Time Taken: 00:20, 15.69test case/s]
Evaluating 100 test case(s) in parallel: |███████▊ | 78% (78/100) [Time Taken: 00:21, 14.61test case/s]
Evaluating 100 test case(s) in parallel: |████████ | 80% (80/100) [Time Taken: 00:21, 9.50test case/s]
Evaluating 100 test case(s) in parallel: |████████▏ | 82% (82/100) [Time Taken: 00:21, 8.26test case/s]
Evaluating 100 test case(s) in parallel: |████████▍ | 84% (84/100) [Time Taken: 00:22, 8.58test case/s]
Evaluating 100 test case(s) in parallel: |████████▌ | 86% (86/100) [Time Taken: 00:22, 5.25test case/s]
Evaluating 100 test case(s) in parallel: |████████▊ | 88% (88/100) [Time Taken: 00:23, 5.30test case/s]
Evaluating 100 test case(s) in parallel: |████████▉ | 89% (89/100) [Time Taken: 00:23, 5.42test case/s]
Evaluating 100 test case(s) in parallel: |█████████ | 91% (91/100) [Time Taken: 00:24, 4.19test case/s]
Evaluating 100 test case(s) in parallel: |█████████▏| 92% (92/100) [Time Taken: 00:24, 3.71test case/s]
Evaluating 100 test case(s) in parallel: |█████████▎| 93% (93/100) [Time Taken: 00:24, 3.68test case/s]
Evaluating 100 test case(s) in parallel: |█████████▌| 95% (95/100) [Time Taken: 00:24, 5.22test case/s]
Evaluating 100 test case(s) in parallel: |█████████▋| 97% (97/100) [Time Taken: 00:26, 2.80test case/s]
Evaluating 100 test case(s) in parallel: |█████████▊| 98% (98/100) [Time Taken: 00:26, 3.12test case/s]
Evaluating 100 test case(s) in parallel: |█████████▉| 99% (99/100) [Time Taken: 00:30, 1.12s/test case]
Evaluating 100 test case(s) in parallel: |██████████|100% (100/100) [Time Taken: 00:35, 2.10s/test case]
Evaluating 100 test case(s) in parallel: |██████████|100% (100/100) [Time Taken: 00:35, 2.82test case/s]
✓ Tests finished 🎉! Run 'deepeval login' to save and analyze evaluation results
on Confident AI.
‼️ Friendly reminder 😇: You can also run evaluations with ALL of deepeval's
metrics directly on Confident AI instead.
Average Metric Scores:
Contextual Precision 0.7364126984126984
Contextual Recall 0.8805555555555555
Hallucination 0.5281958874458874
Metric Passrates:
Contextual Precision 0.71
Contextual Recall 0.82
Hallucination 0.57

Please sign in to comment.