-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
make changelog manual_dispatch, fix docker CI
Signed-off-by: Jack Luar <[email protected]>
- Loading branch information
Showing
3 changed files
with
2 additions
and
62 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
9999a82
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
===================================
==> Dataset: EDA Corpus
==> Running tests for agent-retriever
/home/luarss/actions-runner/_work/ORAssistant/ORAssistant/evaluation/.venv/lib/python3.12/site-packages/deepeval/init.py:49: UserWarning: You are using deepeval version 1.4.9, however version 1.5.0 is available. You should consider upgrading via the "pip install --upgrade deepeval" command.
warnings.warn(
Fetching 2 files: 0%| | 0/2 [00:00<?, ?it/s]
Fetching 2 files: 50%|█████ | 1/2 [00:00<00:00, 3.87it/s]
Fetching 2 files: 100%|██████████| 2/2 [00:00<00:00, 7.72it/s]
Evaluating: 0%| | 0/100 [00:00<?, ?it/s]
Evaluating: 1%| | 1/100 [00:19<32:15, 19.55s/it]
Evaluating: 2%|▏ | 2/100 [00:34<27:23, 16.77s/it]
Evaluating: 3%|▎ | 3/100 [00:51<27:29, 17.01s/it]
Evaluating: 4%|▍ | 4/100 [01:02<23:32, 14.72s/it]
Evaluating: 5%|▌ | 5/100 [01:18<23:34, 14.89s/it]
Evaluating: 6%|▌ | 6/100 [01:33<23:50, 15.22s/it]
Evaluating: 7%|▋ | 7/100 [01:49<23:50, 15.38s/it]
Evaluating: 8%|▊ | 8/100 [02:05<23:40, 15.44s/it]
Evaluating: 9%|▉ | 9/100 [02:21<23:40, 15.61s/it]
Evaluating: 10%|█ | 10/100 [02:36<23:08, 15.43s/it]
Evaluating: 11%|█ | 11/100 [02:51<22:36, 15.24s/it]
Evaluating: 12%|█▏ | 12/100 [03:04<21:22, 14.57s/it]
Evaluating: 13%|█▎ | 13/100 [03:21<22:12, 15.31s/it]
Evaluating: 14%|█▍ | 14/100 [03:37<22:17, 15.55s/it]
Evaluating: 15%|█▌ | 15/100 [03:54<22:56, 16.20s/it]
Evaluating: 16%|█▌ | 16/100 [04:10<22:22, 15.98s/it]
Evaluating: 17%|█▋ | 17/100 [04:28<22:48, 16.49s/it]
Evaluating: 18%|█▊ | 18/100 [04:43<21:59, 16.09s/it]
Evaluating: 19%|█▉ | 19/100 [04:58<21:22, 15.84s/it]
Evaluating: 20%|██ | 20/100 [05:14<21:17, 15.97s/it]
Evaluating: 21%|██ | 21/100 [05:29<20:26, 15.52s/it]
Evaluating: 22%|██▏ | 22/100 [05:46<20:41, 15.91s/it]
Evaluating: 23%|██▎ | 23/100 [06:01<20:11, 15.73s/it]
Evaluating: 24%|██▍ | 24/100 [06:15<19:22, 15.30s/it]
Evaluating: 25%|██▌ | 25/100 [06:30<18:51, 15.08s/it]
Evaluating: 26%|██▌ | 26/100 [06:43<18:02, 14.63s/it]
Evaluating: 27%|██▋ | 27/100 [07:02<19:12, 15.78s/it]
Evaluating: 28%|██▊ | 28/100 [07:18<19:15, 16.05s/it]
Evaluating: 29%|██▉ | 29/100 [07:31<17:51, 15.09s/it]
Evaluating: 30%|███ | 30/100 [07:48<18:07, 15.53s/it]
Evaluating: 31%|███ | 31/100 [08:02<17:30, 15.23s/it]
Evaluating: 32%|███▏ | 32/100 [08:19<17:40, 15.59s/it]
Evaluating: 33%|███▎ | 33/100 [08:35<17:39, 15.81s/it]
Evaluating: 34%|███▍ | 34/100 [08:50<16:56, 15.40s/it]
Evaluating: 35%|███▌ | 35/100 [09:08<17:33, 16.21s/it]
Evaluating: 36%|███▌ | 36/100 [09:29<19:03, 17.86s/it]
Evaluating: 37%|███▋ | 37/100 [09:45<17:59, 17.13s/it]
Evaluating: 38%|███▊ | 38/100 [10:01<17:24, 16.84s/it]
Evaluating: 39%|███▉ | 39/100 [10:15<16:22, 16.10s/it]
Evaluating: 40%|████ | 40/100 [10:30<15:45, 15.76s/it]
Evaluating: 41%|████ | 41/100 [10:46<15:34, 15.84s/it]
Evaluating: 42%|████▏ | 42/100 [11:02<15:07, 15.65s/it]
Evaluating: 43%|████▎ | 43/100 [11:17<14:41, 15.47s/it]
Evaluating: 44%|████▍ | 44/100 [11:34<15:05, 16.16s/it]
Evaluating: 45%|████▌ | 45/100 [11:59<17:00, 18.55s/it]
Evaluating: 46%|████▌ | 46/100 [12:16<16:29, 18.33s/it]
Evaluating: 47%|████▋ | 47/100 [12:35<16:16, 18.42s/it]
Evaluating: 48%|████▊ | 48/100 [12:50<15:03, 17.38s/it]
Evaluating: 49%|████▉ | 49/100 [13:04<14:01, 16.51s/it]
Evaluating: 50%|█████ | 50/100 [13:23<14:15, 17.11s/it]
Evaluating: 51%|█████ | 51/100 [13:37<13:19, 16.33s/it]
Evaluating: 52%|█████▏ | 52/100 [13:55<13:20, 16.69s/it]
Evaluating: 53%|█████▎ | 53/100 [14:10<12:35, 16.07s/it]
Evaluating: 54%|█████▍ | 54/100 [14:26<12:22, 16.15s/it]
Evaluating: 55%|█████▌ | 55/100 [14:40<11:43, 15.64s/it]
Evaluating: 56%|█████▌ | 56/100 [14:54<10:57, 14.94s/it]
Evaluating: 57%|█████▋ | 57/100 [15:09<10:52, 15.17s/it]
Evaluating: 58%|█████▊ | 58/100 [15:24<10:34, 15.11s/it]
Evaluating: 59%|█████▉ | 59/100 [15:40<10:29, 15.35s/it]
Evaluating: 60%|██████ | 60/100 [15:56<10:14, 15.35s/it]
Evaluating: 61%|██████ | 61/100 [16:10<09:45, 15.03s/it]
Evaluating: 62%|██████▏ | 62/100 [16:25<09:37, 15.21s/it]
Evaluating: 63%|██████▎ | 63/100 [16:40<09:16, 15.05s/it]
Evaluating: 64%|██████▍ | 64/100 [16:51<08:12, 13.67s/it]
Evaluating: 65%|██████▌ | 65/100 [17:04<07:52, 13.51s/it]
Evaluating: 66%|██████▌ | 66/100 [17:19<08:00, 14.12s/it]
Evaluating: 67%|██████▋ | 67/100 [17:35<07:58, 14.50s/it]
Evaluating: 68%|██████▊ | 68/100 [17:48<07:30, 14.07s/it]
Evaluating: 69%|██████▉ | 69/100 [18:01<07:10, 13.87s/it]
Evaluating: 70%|███████ | 70/100 [18:15<06:59, 13.99s/it]
Evaluating: 71%|███████ | 71/100 [18:30<06:48, 14.08s/it]
Evaluating: 72%|███████▏ | 72/100 [18:44<06:36, 14.18s/it]
Evaluating: 73%|███████▎ | 73/100 [19:00<06:39, 14.79s/it]
Evaluating: 74%|███████▍ | 74/100 [19:16<06:34, 15.17s/it]
Evaluating: 75%|███████▌ | 75/100 [19:43<07:41, 18.46s/it]
Evaluating: 76%|███████▌ | 76/100 [20:02<07:29, 18.72s/it]
Evaluating: 77%|███████▋ | 77/100 [20:18<06:56, 18.09s/it]
Evaluating: 78%|███████▊ | 78/100 [20:35<06:25, 17.51s/it]
Evaluating: 79%|███████▉ | 79/100 [20:50<05:53, 16.84s/it]
Evaluating: 80%|████████ | 80/100 [21:06<05:33, 16.69s/it]
Evaluating: 81%|████████ | 81/100 [21:22<05:09, 16.28s/it]
Evaluating: 82%|████████▏ | 82/100 [21:38<04:52, 16.28s/it]
Evaluating: 83%|████████▎ | 83/100 [21:53<04:31, 15.97s/it]
Evaluating: 84%|████████▍ | 84/100 [22:11<04:23, 16.44s/it]
Evaluating: 85%|████████▌ | 85/100 [22:22<03:45, 15.01s/it]
Evaluating: 86%|████████▌ | 86/100 [22:41<03:43, 15.99s/it]
Evaluating: 87%|████████▋ | 87/100 [22:54<03:17, 15.20s/it]
Evaluating: 88%|████████▊ | 88/100 [23:11<03:07, 15.62s/it]
Evaluating: 89%|████████▉ | 89/100 [23:26<02:51, 15.55s/it]
Evaluating: 90%|█████████ | 90/100 [23:42<02:37, 15.73s/it]
Evaluating: 91%|█████████ | 91/100 [24:00<02:26, 16.30s/it]
Evaluating: 92%|█████████▏| 92/100 [24:18<02:16, 17.05s/it]
Evaluating: 93%|█████████▎| 93/100 [24:36<01:59, 17.10s/it]
Evaluating: 94%|█████████▍| 94/100 [24:49<01:36, 16.05s/it]
Evaluating: 95%|█████████▌| 95/100 [25:06<01:20, 16.11s/it]
Evaluating: 96%|█████████▌| 96/100 [25:19<01:01, 15.38s/it]
Evaluating: 97%|█████████▋| 97/100 [25:35<00:46, 15.41s/it]
Evaluating: 98%|█████████▊| 98/100 [25:51<00:31, 15.76s/it]
Evaluating: 99%|█████████▉| 99/100 [26:08<00:16, 16.12s/it]
Evaluating: 100%|██████████| 100/100 [26:25<00:00, 16.45s/it]
Evaluating: 100%|██████████| 100/100 [26:25<00:00, 15.86s/it]
✨ You're running DeepEval's latest Contextual Precision Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...
✨ You're running DeepEval's latest Contextual Recall Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...
✨ You're running DeepEval's latest Hallucination Metric! (using
gemini-1.5-pro-002, strict=False, async_mode=True)...
Evaluating 100 test case(s) in parallel: | | 0% (0/100) [Time Taken: 00:00, ?test case/s]
‼️ Friendly reminder 😇: You can also run evaluations with ALL of deepeval's
Evaluating 100 test case(s) in parallel: | | 1% (1/100) [Time Taken: 00:10, 10.44s/test case]
Evaluating 100 test case(s) in parallel: |▏ | 2% (2/100) [Time Taken: 00:12, 5.78s/test case]
Evaluating 100 test case(s) in parallel: |▎ | 3% (3/100) [Time Taken: 00:13, 3.19s/test case]
Evaluating 100 test case(s) in parallel: |▌ | 5% (5/100) [Time Taken: 00:13, 1.42s/test case]
Evaluating 100 test case(s) in parallel: |▋ | 7% (7/100) [Time Taken: 00:14, 1.10s/test case]
Evaluating 100 test case(s) in parallel: |█ | 10% (10/100) [Time Taken: 00:14, 1.73test case/s]
Evaluating 100 test case(s) in parallel: |█▏ | 12% (12/100) [Time Taken: 00:15, 2.06test case/s]
Evaluating 100 test case(s) in parallel: |█▍ | 14% (14/100) [Time Taken: 00:16, 1.98test case/s]
Evaluating 100 test case(s) in parallel: |█▌ | 15% (15/100) [Time Taken: 00:16, 2.14test case/s]
Evaluating 100 test case(s) in parallel: |█▌ | 16% (16/100) [Time Taken: 00:16, 2.48test case/s]
Evaluating 100 test case(s) in parallel: |█▉ | 19% (19/100) [Time Taken: 00:17, 3.35test case/s]
Evaluating 100 test case(s) in parallel: |██▎ | 23% (23/100) [Time Taken: 00:17, 5.88test case/s]
Evaluating 100 test case(s) in parallel: |██▌ | 25% (25/100) [Time Taken: 00:18, 4.58test case/s]
Evaluating 100 test case(s) in parallel: |██▉ | 29% (29/100) [Time Taken: 00:18, 7.17test case/s]
Evaluating 100 test case(s) in parallel: |███▏ | 32% (32/100) [Time Taken: 00:18, 9.35test case/s]
Evaluating 100 test case(s) in parallel: |███▌ | 35% (35/100) [Time Taken: 00:18, 11.00test case/s]
Evaluating 100 test case(s) in parallel: |███▊ | 38% (38/100) [Time Taken: 00:18, 12.85test case/s]
Evaluating 100 test case(s) in parallel: |████ | 41% (41/100) [Time Taken: 00:19, 7.60test case/s]
Evaluating 100 test case(s) in parallel: |████▎ | 43% (43/100) [Time Taken: 00:21, 2.89test case/s]
Evaluating 100 test case(s) in parallel: |████▌ | 45% (45/100) [Time Taken: 00:23, 2.16test case/s]
Evaluating 100 test case(s) in parallel: |████▋ | 47% (47/100) [Time Taken: 00:23, 2.80test case/s]
Evaluating 100 test case(s) in parallel: |████▉ | 49% (49/100) [Time Taken: 00:23, 3.63test case/s]
Evaluating 100 test case(s) in parallel: |█████ | 51% (51/100) [Time Taken: 00:24, 2.66test case/s]
Evaluating 100 test case(s) in parallel: |█████▎ | 53% (53/100) [Time Taken: 00:24, 3.50test case/s]
Evaluating 100 test case(s) in parallel: |█████▌ | 55% (55/100) [Time Taken: 00:25, 4.53test case/s]
Evaluating 100 test case(s) in parallel: |█████▋ | 57% (57/100) [Time Taken: 00:25, 4.84test case/s]
Evaluating 100 test case(s) in parallel: |█████▉ | 59% (59/100) [Time Taken: 00:26, 3.96test case/s]
Evaluating 100 test case(s) in parallel: |██████ | 60% (60/100) [Time Taken: 00:26, 4.36test case/s]
Evaluating 100 test case(s) in parallel: |██████ | 61% (61/100) [Time Taken: 00:26, 4.25test case/s]
Evaluating 100 test case(s) in parallel: |██████▎ | 63% (63/100) [Time Taken: 00:26, 5.74test case/s]
Evaluating 100 test case(s) in parallel: |██████▌ | 65% (65/100) [Time Taken: 00:26, 7.22test case/s]
Evaluating 100 test case(s) in parallel: |██████▋ | 67% (67/100) [Time Taken: 00:27, 5.72test case/s]
Evaluating 100 test case(s) in parallel: |██████▉ | 69% (69/100) [Time Taken: 00:27, 5.63test case/s]
Evaluating 100 test case(s) in parallel: |███████ | 70% (70/100) [Time Taken: 00:28, 4.63test case/s]
Evaluating 100 test case(s) in parallel: |███████ | 71% (71/100) [Time Taken: 00:28, 4.27test case/s]
Evaluating 100 test case(s) in parallel: |███████▎ | 73% (73/100) [Time Taken: 00:28, 5.12test case/s]
Evaluating 100 test case(s) in parallel: |███████▍ | 74% (74/100) [Time Taken: 00:28, 4.33test case/s]
Evaluating 100 test case(s) in parallel: |███████▌ | 75% (75/100) [Time Taken: 00:29, 4.84test case/s]
Evaluating 100 test case(s) in parallel: |███████▌ | 76% (76/100) [Time Taken: 00:29, 4.04test case/s]
Evaluating 100 test case(s) in parallel: |███████▋ | 77% (77/100) [Time Taken: 00:29, 3.22test case/s]
Evaluating 100 test case(s) in parallel: |███████▉ | 79% (79/100) [Time Taken: 00:30, 4.66test case/s]
Evaluating 100 test case(s) in parallel: |████████ | 80% (80/100) [Time Taken: 00:30, 5.10test case/s]
Evaluating 100 test case(s) in parallel: |████████▏ | 82% (82/100) [Time Taken: 00:30, 6.50test case/s]
Evaluating 100 test case(s) in parallel: |████████▍ | 84% (84/100) [Time Taken: 00:30, 6.81test case/s]
Evaluating 100 test case(s) in parallel: |████████▌ | 85% (85/100) [Time Taken: 00:30, 6.83test case/s]
Evaluating 100 test case(s) in parallel: |████████▋ | 87% (87/100) [Time Taken: 00:31, 5.83test case/s]
Evaluating 100 test case(s) in parallel: |████████▊ | 88% (88/100) [Time Taken: 00:32, 3.18test case/s]
Evaluating 100 test case(s) in parallel: |████████▉ | 89% (89/100) [Time Taken: 00:32, 3.62test case/s]
Evaluating 100 test case(s) in parallel: |█████████ | 91% (91/100) [Time Taken: 00:32, 4.48test case/s]
Evaluating 100 test case(s) in parallel: |█████████▎| 93% (93/100) [Time Taken: 00:32, 5.61test case/s]
Evaluating 100 test case(s) in parallel: |█████████▌| 95% (95/100) [Time Taken: 00:33, 4.51test case/s]
Evaluating 100 test case(s) in parallel: |█████████▌| 96% (96/100) [Time Taken: 00:33, 4.62test case/s]
Evaluating 100 test case(s) in parallel: |█████████▋| 97% (97/100) [Time Taken: 00:33, 5.22test case/s]
Evaluating 100 test case(s) in parallel: |█████████▊| 98% (98/100) [Time Taken: 00:34, 4.05test case/s]
Evaluating 100 test case(s) in parallel: |█████████▉| 99% (99/100) [Time Taken: 00:34, 3.73test case/s]
Evaluating 100 test case(s) in parallel: |██████████|100% (100/100) [Time Taken: 00:37, 1.08s/test case]
Evaluating 100 test case(s) in parallel: |██████████|100% (100/100) [Time Taken: 00:37, 2.65test case/s]
✓ Tests finished 🎉! Run 'deepeval login' to save and analyze evaluation results
on Confident AI.
metrics directly on Confident AI instead.
Average Metric Scores:
Contextual Precision 0.7635952380952381
Contextual Recall 0.8677380952380952
Hallucination 0.5268152958152958
Metric Passrates:
Contextual Precision 0.73
Contextual Recall 0.82
Hallucination 0.64