Skip to content

Commit

Permalink
make changelog manual_dispatch, fix docker CI
Browse files Browse the repository at this point in the history
Signed-off-by: Jack Luar <[email protected]>
  • Loading branch information
luarss committed Nov 10, 2024
1 parent 66e2892 commit 9999a82
Show file tree
Hide file tree
Showing 3 changed files with 2 additions and 62 deletions.
2 changes: 0 additions & 2 deletions .github/workflows/ci-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@ name: Update Changelog

on:
workflow_dispatch:
pull_request:
branches: [ "master" ] # Temporarily to test

jobs:
updateChangeLog:
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ jobs:
cp ${{ secrets.PATH_TO_GOOGLE_APPLICATION_CREDENTIALS }} evaluation/auto_evaluation/src
- name: Build Docker image
run: |
make docker
make docker-up
sleep 900 # TODO: Remove this after docker-compose healthcheck timeout restored fixed.
- name: Run LLM CI
working-directory: evaluation
Expand All @@ -48,4 +48,4 @@ jobs:
- name: Teardown
if: always()
run: |
docker compose down --remove-orphans
make docker-down
58 changes: 0 additions & 58 deletions CHANGELOG.md

This file was deleted.

1 comment on commit 9999a82

@luarss
Copy link
Collaborator Author

@luarss luarss commented on 9999a82 Nov 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

===================================
==> Dataset: EDA Corpus
==> Running tests for agent-retriever
/home/luarss/actions-runner/_work/ORAssistant/ORAssistant/evaluation/.venv/lib/python3.12/site-packages/deepeval/init.py:49: UserWarning: You are using deepeval version 1.4.9, however version 1.5.0 is available. You should consider upgrading via the "pip install --upgrade deepeval" command.
warnings.warn(

Fetching 2 files: 0%| | 0/2 [00:00<?, ?it/s]
Fetching 2 files: 50%|█████ | 1/2 [00:00<00:00, 3.87it/s]
Fetching 2 files: 100%|██████████| 2/2 [00:00<00:00, 7.72it/s]

Evaluating: 0%| | 0/100 [00:00<?, ?it/s]
Evaluating: 1%| | 1/100 [00:19<32:15, 19.55s/it]
Evaluating: 2%|▏ | 2/100 [00:34<27:23, 16.77s/it]
Evaluating: 3%|▎ | 3/100 [00:51<27:29, 17.01s/it]
Evaluating: 4%|▍ | 4/100 [01:02<23:32, 14.72s/it]
Evaluating: 5%|▌ | 5/100 [01:18<23:34, 14.89s/it]
Evaluating: 6%|▌ | 6/100 [01:33<23:50, 15.22s/it]
Evaluating: 7%|▋ | 7/100 [01:49<23:50, 15.38s/it]
Evaluating: 8%|▊ | 8/100 [02:05<23:40, 15.44s/it]
Evaluating: 9%|▉ | 9/100 [02:21<23:40, 15.61s/it]
Evaluating: 10%|█ | 10/100 [02:36<23:08, 15.43s/it]
Evaluating: 11%|█ | 11/100 [02:51<22:36, 15.24s/it]
Evaluating: 12%|█▏ | 12/100 [03:04<21:22, 14.57s/it]
Evaluating: 13%|█▎ | 13/100 [03:21<22:12, 15.31s/it]
Evaluating: 14%|█▍ | 14/100 [03:37<22:17, 15.55s/it]
Evaluating: 15%|█▌ | 15/100 [03:54<22:56, 16.20s/it]
Evaluating: 16%|█▌ | 16/100 [04:10<22:22, 15.98s/it]
Evaluating: 17%|█▋ | 17/100 [04:28<22:48, 16.49s/it]
Evaluating: 18%|█▊ | 18/100 [04:43<21:59, 16.09s/it]
Evaluating: 19%|█▉ | 19/100 [04:58<21:22, 15.84s/it]
Evaluating: 20%|██ | 20/100 [05:14<21:17, 15.97s/it]
Evaluating: 21%|██ | 21/100 [05:29<20:26, 15.52s/it]
Evaluating: 22%|██▏ | 22/100 [05:46<20:41, 15.91s/it]
Evaluating: 23%|██▎ | 23/100 [06:01<20:11, 15.73s/it]
Evaluating: 24%|██▍ | 24/100 [06:15<19:22, 15.30s/it]
Evaluating: 25%|██▌ | 25/100 [06:30<18:51, 15.08s/it]
Evaluating: 26%|██▌ | 26/100 [06:43<18:02, 14.63s/it]
Evaluating: 27%|██▋ | 27/100 [07:02<19:12, 15.78s/it]
Evaluating: 28%|██▊ | 28/100 [07:18<19:15, 16.05s/it]
Evaluating: 29%|██▉ | 29/100 [07:31<17:51, 15.09s/it]
Evaluating: 30%|███ | 30/100 [07:48<18:07, 15.53s/it]
Evaluating: 31%|███ | 31/100 [08:02<17:30, 15.23s/it]
Evaluating: 32%|███▏ | 32/100 [08:19<17:40, 15.59s/it]
Evaluating: 33%|███▎ | 33/100 [08:35<17:39, 15.81s/it]
Evaluating: 34%|███▍ | 34/100 [08:50<16:56, 15.40s/it]
Evaluating: 35%|███▌ | 35/100 [09:08<17:33, 16.21s/it]
Evaluating: 36%|███▌ | 36/100 [09:29<19:03, 17.86s/it]
Evaluating: 37%|███▋ | 37/100 [09:45<17:59, 17.13s/it]
Evaluating: 38%|███▊ | 38/100 [10:01<17:24, 16.84s/it]
Evaluating: 39%|███▉ | 39/100 [10:15<16:22, 16.10s/it]
Evaluating: 40%|████ | 40/100 [10:30<15:45, 15.76s/it]
Evaluating: 41%|████ | 41/100 [10:46<15:34, 15.84s/it]
Evaluating: 42%|████▏ | 42/100 [11:02<15:07, 15.65s/it]
Evaluating: 43%|████▎ | 43/100 [11:17<14:41, 15.47s/it]
Evaluating: 44%|████▍ | 44/100 [11:34<15:05, 16.16s/it]
Evaluating: 45%|████▌ | 45/100 [11:59<17:00, 18.55s/it]
Evaluating: 46%|████▌ | 46/100 [12:16<16:29, 18.33s/it]
Evaluating: 47%|████▋ | 47/100 [12:35<16:16, 18.42s/it]
Evaluating: 48%|████▊ | 48/100 [12:50<15:03, 17.38s/it]
Evaluating: 49%|████▉ | 49/100 [13:04<14:01, 16.51s/it]
Evaluating: 50%|█████ | 50/100 [13:23<14:15, 17.11s/it]
Evaluating: 51%|█████ | 51/100 [13:37<13:19, 16.33s/it]
Evaluating: 52%|█████▏ | 52/100 [13:55<13:20, 16.69s/it]
Evaluating: 53%|█████▎ | 53/100 [14:10<12:35, 16.07s/it]
Evaluating: 54%|█████▍ | 54/100 [14:26<12:22, 16.15s/it]
Evaluating: 55%|█████▌ | 55/100 [14:40<11:43, 15.64s/it]
Evaluating: 56%|█████▌ | 56/100 [14:54<10:57, 14.94s/it]
Evaluating: 57%|█████▋ | 57/100 [15:09<10:52, 15.17s/it]
Evaluating: 58%|█████▊ | 58/100 [15:24<10:34, 15.11s/it]
Evaluating: 59%|█████▉ | 59/100 [15:40<10:29, 15.35s/it]
Evaluating: 60%|██████ | 60/100 [15:56<10:14, 15.35s/it]
Evaluating: 61%|██████ | 61/100 [16:10<09:45, 15.03s/it]
Evaluating: 62%|██████▏ | 62/100 [16:25<09:37, 15.21s/it]
Evaluating: 63%|██████▎ | 63/100 [16:40<09:16, 15.05s/it]
Evaluating: 64%|██████▍ | 64/100 [16:51<08:12, 13.67s/it]
Evaluating: 65%|██████▌ | 65/100 [17:04<07:52, 13.51s/it]
Evaluating: 66%|██████▌ | 66/100 [17:19<08:00, 14.12s/it]
Evaluating: 67%|██████▋ | 67/100 [17:35<07:58, 14.50s/it]
Evaluating: 68%|██████▊ | 68/100 [17:48<07:30, 14.07s/it]
Evaluating: 69%|██████▉ | 69/100 [18:01<07:10, 13.87s/it]
Evaluating: 70%|███████ | 70/100 [18:15<06:59, 13.99s/it]
Evaluating: 71%|███████ | 71/100 [18:30<06:48, 14.08s/it]
Evaluating: 72%|███████▏ | 72/100 [18:44<06:36, 14.18s/it]
Evaluating: 73%|███████▎ | 73/100 [19:00<06:39, 14.79s/it]
Evaluating: 74%|███████▍ | 74/100 [19:16<06:34, 15.17s/it]
Evaluating: 75%|███████▌ | 75/100 [19:43<07:41, 18.46s/it]
Evaluating: 76%|███████▌ | 76/100 [20:02<07:29, 18.72s/it]
Evaluating: 77%|███████▋ | 77/100 [20:18<06:56, 18.09s/it]
Evaluating: 78%|███████▊ | 78/100 [20:35<06:25, 17.51s/it]
Evaluating: 79%|███████▉ | 79/100 [20:50<05:53, 16.84s/it]
Evaluating: 80%|████████ | 80/100 [21:06<05:33, 16.69s/it]
Evaluating: 81%|████████ | 81/100 [21:22<05:09, 16.28s/it]
Evaluating: 82%|████████▏ | 82/100 [21:38<04:52, 16.28s/it]
Evaluating: 83%|████████▎ | 83/100 [21:53<04:31, 15.97s/it]
Evaluating: 84%|████████▍ | 84/100 [22:11<04:23, 16.44s/it]
Evaluating: 85%|████████▌ | 85/100 [22:22<03:45, 15.01s/it]
Evaluating: 86%|████████▌ | 86/100 [22:41<03:43, 15.99s/it]
Evaluating: 87%|████████▋ | 87/100 [22:54<03:17, 15.20s/it]
Evaluating: 88%|████████▊ | 88/100 [23:11<03:07, 15.62s/it]
Evaluating: 89%|████████▉ | 89/100 [23:26<02:51, 15.55s/it]
Evaluating: 90%|█████████ | 90/100 [23:42<02:37, 15.73s/it]
Evaluating: 91%|█████████ | 91/100 [24:00<02:26, 16.30s/it]
Evaluating: 92%|█████████▏| 92/100 [24:18<02:16, 17.05s/it]
Evaluating: 93%|█████████▎| 93/100 [24:36<01:59, 17.10s/it]
Evaluating: 94%|█████████▍| 94/100 [24:49<01:36, 16.05s/it]
Evaluating: 95%|█████████▌| 95/100 [25:06<01:20, 16.11s/it]
Evaluating: 96%|█████████▌| 96/100 [25:19<01:01, 15.38s/it]
Evaluating: 97%|█████████▋| 97/100 [25:35<00:46, 15.41s/it]
Evaluating: 98%|█████████▊| 98/100 [25:51<00:31, 15.76s/it]
Evaluating: 99%|█████████▉| 99/100 [26:08<00:16, 16.12s/it]
Evaluating: 100%|██████████| 100/100 [26:25<00:00, 16.45s/it]
Evaluating: 100%|██████████| 100/100 [26:25<00:00, 15.86s/it]
✨ You're running DeepEval's latest Contextual Precision Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...
✨ You're running DeepEval's latest Contextual Recall Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...
✨ You're running DeepEval's latest Hallucination Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...

Evaluating 100 test case(s) in parallel: | | 0% (0/100) [Time Taken: 00:00, ?test case/s]
Evaluating 100 test case(s) in parallel: | | 1% (1/100) [Time Taken: 00:10, 10.44s/test case]
Evaluating 100 test case(s) in parallel: |▏ | 2% (2/100) [Time Taken: 00:12, 5.78s/test case]
Evaluating 100 test case(s) in parallel: |▎ | 3% (3/100) [Time Taken: 00:13, 3.19s/test case]
Evaluating 100 test case(s) in parallel: |▌ | 5% (5/100) [Time Taken: 00:13, 1.42s/test case]
Evaluating 100 test case(s) in parallel: |▋ | 7% (7/100) [Time Taken: 00:14, 1.10s/test case]
Evaluating 100 test case(s) in parallel: |█ | 10% (10/100) [Time Taken: 00:14, 1.73test case/s]
Evaluating 100 test case(s) in parallel: |█▏ | 12% (12/100) [Time Taken: 00:15, 2.06test case/s]
Evaluating 100 test case(s) in parallel: |█▍ | 14% (14/100) [Time Taken: 00:16, 1.98test case/s]
Evaluating 100 test case(s) in parallel: |█▌ | 15% (15/100) [Time Taken: 00:16, 2.14test case/s]
Evaluating 100 test case(s) in parallel: |█▌ | 16% (16/100) [Time Taken: 00:16, 2.48test case/s]
Evaluating 100 test case(s) in parallel: |█▉ | 19% (19/100) [Time Taken: 00:17, 3.35test case/s]
Evaluating 100 test case(s) in parallel: |██▎ | 23% (23/100) [Time Taken: 00:17, 5.88test case/s]
Evaluating 100 test case(s) in parallel: |██▌ | 25% (25/100) [Time Taken: 00:18, 4.58test case/s]
Evaluating 100 test case(s) in parallel: |██▉ | 29% (29/100) [Time Taken: 00:18, 7.17test case/s]
Evaluating 100 test case(s) in parallel: |███▏ | 32% (32/100) [Time Taken: 00:18, 9.35test case/s]
Evaluating 100 test case(s) in parallel: |███▌ | 35% (35/100) [Time Taken: 00:18, 11.00test case/s]
Evaluating 100 test case(s) in parallel: |███▊ | 38% (38/100) [Time Taken: 00:18, 12.85test case/s]
Evaluating 100 test case(s) in parallel: |████ | 41% (41/100) [Time Taken: 00:19, 7.60test case/s]
Evaluating 100 test case(s) in parallel: |████▎ | 43% (43/100) [Time Taken: 00:21, 2.89test case/s]
Evaluating 100 test case(s) in parallel: |████▌ | 45% (45/100) [Time Taken: 00:23, 2.16test case/s]
Evaluating 100 test case(s) in parallel: |████▋ | 47% (47/100) [Time Taken: 00:23, 2.80test case/s]
Evaluating 100 test case(s) in parallel: |████▉ | 49% (49/100) [Time Taken: 00:23, 3.63test case/s]
Evaluating 100 test case(s) in parallel: |█████ | 51% (51/100) [Time Taken: 00:24, 2.66test case/s]
Evaluating 100 test case(s) in parallel: |█████▎ | 53% (53/100) [Time Taken: 00:24, 3.50test case/s]
Evaluating 100 test case(s) in parallel: |█████▌ | 55% (55/100) [Time Taken: 00:25, 4.53test case/s]
Evaluating 100 test case(s) in parallel: |█████▋ | 57% (57/100) [Time Taken: 00:25, 4.84test case/s]
Evaluating 100 test case(s) in parallel: |█████▉ | 59% (59/100) [Time Taken: 00:26, 3.96test case/s]
Evaluating 100 test case(s) in parallel: |██████ | 60% (60/100) [Time Taken: 00:26, 4.36test case/s]
Evaluating 100 test case(s) in parallel: |██████ | 61% (61/100) [Time Taken: 00:26, 4.25test case/s]
Evaluating 100 test case(s) in parallel: |██████▎ | 63% (63/100) [Time Taken: 00:26, 5.74test case/s]
Evaluating 100 test case(s) in parallel: |██████▌ | 65% (65/100) [Time Taken: 00:26, 7.22test case/s]
Evaluating 100 test case(s) in parallel: |██████▋ | 67% (67/100) [Time Taken: 00:27, 5.72test case/s]
Evaluating 100 test case(s) in parallel: |██████▉ | 69% (69/100) [Time Taken: 00:27, 5.63test case/s]
Evaluating 100 test case(s) in parallel: |███████ | 70% (70/100) [Time Taken: 00:28, 4.63test case/s]
Evaluating 100 test case(s) in parallel: |███████ | 71% (71/100) [Time Taken: 00:28, 4.27test case/s]
Evaluating 100 test case(s) in parallel: |███████▎ | 73% (73/100) [Time Taken: 00:28, 5.12test case/s]
Evaluating 100 test case(s) in parallel: |███████▍ | 74% (74/100) [Time Taken: 00:28, 4.33test case/s]
Evaluating 100 test case(s) in parallel: |███████▌ | 75% (75/100) [Time Taken: 00:29, 4.84test case/s]
Evaluating 100 test case(s) in parallel: |███████▌ | 76% (76/100) [Time Taken: 00:29, 4.04test case/s]
Evaluating 100 test case(s) in parallel: |███████▋ | 77% (77/100) [Time Taken: 00:29, 3.22test case/s]
Evaluating 100 test case(s) in parallel: |███████▉ | 79% (79/100) [Time Taken: 00:30, 4.66test case/s]
Evaluating 100 test case(s) in parallel: |████████ | 80% (80/100) [Time Taken: 00:30, 5.10test case/s]
Evaluating 100 test case(s) in parallel: |████████▏ | 82% (82/100) [Time Taken: 00:30, 6.50test case/s]
Evaluating 100 test case(s) in parallel: |████████▍ | 84% (84/100) [Time Taken: 00:30, 6.81test case/s]
Evaluating 100 test case(s) in parallel: |████████▌ | 85% (85/100) [Time Taken: 00:30, 6.83test case/s]
Evaluating 100 test case(s) in parallel: |████████▋ | 87% (87/100) [Time Taken: 00:31, 5.83test case/s]
Evaluating 100 test case(s) in parallel: |████████▊ | 88% (88/100) [Time Taken: 00:32, 3.18test case/s]
Evaluating 100 test case(s) in parallel: |████████▉ | 89% (89/100) [Time Taken: 00:32, 3.62test case/s]
Evaluating 100 test case(s) in parallel: |█████████ | 91% (91/100) [Time Taken: 00:32, 4.48test case/s]
Evaluating 100 test case(s) in parallel: |█████████▎| 93% (93/100) [Time Taken: 00:32, 5.61test case/s]
Evaluating 100 test case(s) in parallel: |█████████▌| 95% (95/100) [Time Taken: 00:33, 4.51test case/s]
Evaluating 100 test case(s) in parallel: |█████████▌| 96% (96/100) [Time Taken: 00:33, 4.62test case/s]
Evaluating 100 test case(s) in parallel: |█████████▋| 97% (97/100) [Time Taken: 00:33, 5.22test case/s]
Evaluating 100 test case(s) in parallel: |█████████▊| 98% (98/100) [Time Taken: 00:34, 4.05test case/s]
Evaluating 100 test case(s) in parallel: |█████████▉| 99% (99/100) [Time Taken: 00:34, 3.73test case/s]
Evaluating 100 test case(s) in parallel: |██████████|100% (100/100) [Time Taken: 00:37, 1.08s/test case]
Evaluating 100 test case(s) in parallel: |██████████|100% (100/100) [Time Taken: 00:37, 2.65test case/s]
✓ Tests finished 🎉! Run 'deepeval login' to save and analyze evaluation results
on Confident AI.
‼️ Friendly reminder 😇: You can also run evaluations with ALL of deepeval's
metrics directly on Confident AI instead.
Average Metric Scores:
Contextual Precision 0.7635952380952381
Contextual Recall 0.8677380952380952
Hallucination 0.5268152958152958
Metric Passrates:
Contextual Precision 0.73
Contextual Recall 0.82
Hallucination 0.64

Please sign in to comment.