-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panel of BIRD Annotation Issues. #39
Comments
Should the db_id |
There is a typo of this sql in SELECT CAST(SUM(CASE WHEN T2.DESCRIPTION = 'Allergy to grass pollen' THEN 1 ELSE 0 END) AS REL) * 100 / COUNT(T1.patient) FROM patients AS T1 INNER JOIN allergies AS T2 ON T1.patient = T2.PATIENT WHERE T1.birthplace = 'Pembroke MA US' |
These two queries also cannot be parsed by SELECT T1.first_name, T1.last_name FROM current AS T1 INNER JOIN `current-terms` AS T2 ON T1.bioguide_id = T2.bioguide WHERE T2.party = 'Republican' AND T1.gender_bio = 'F' AND T2.END > DATE() GROUP BY T1.bioguide_id SELECT max_temperature_f, date FROM weather WHERE max_temperature_f = (SELECT MAX(max_temperature_f) FROM weather WHERE max_temperature_f IS NOT NULL AND max_temperature_f IS NOT '') |
Incorrect SQLite database for
|
also I have found there exists corrupted database, for example, coinmarketcap.sqlite |
Incorrect evidence in dev set. 在 california_schools/frpm.csv 中,eligible free rate = Free Meal Count / Enrollment,同时 column 【Percent (%) Eligible Free (K-12)】确实是 【Free Meal Count (K-12)】 / 【Enrollment (K-12)】。 但是在 dev set中第一条测试数据就存在错误: 原始数据: db_id: california_schools 两处错误: SELECT MAX(Percent (%) Eligible Free (K-12)) 或者 SELECT 类似的问题可能还有不少。 |
@tshu-w @wbbeyourself @ronch99 @22842219 The latest update includes the following improvements:
As a result of these enhancements, the ChatGPT (gpt-3.5-turbo) and GPT4 (gpt-4-32k) EX scores have improved to 42.24 (from 37.22) and 49.15 (from 46.35), respectively. We will update the leaderboard with all new results shortly. Please note that the test datasets remain unchanged at this time since we have already devoted much more considerable effort in examining errors, and human evaluation during the construction of test datasets. This can ensure its quality. BTW, our paper has been accepted by NeurIPS 2023 as a Spotlight. Thanks for your support and suggestions! We will keep optimizing our work, thanks! |
In train datast I think this example is wrong
Should be "SQL": "SELECT movie_title, movie_release_year FROM movies ORDER BY LENGTH(movie_title) DESC LIMIT 1" |
|
question 24 in dev dataset might be missing the evidence
|
https://arxiv.org/abs/2402.12243
|
@tshu-w @wbbeyourself @josem7 @ronch99 Of course, we will thoroughly consider your findings and make the necessary corrections during this review. After this, we will also review the training set when our expert students and professors are available. We sincerely appreciate your support and suggestions throughout this process. It is indeed a challenging and time-consuming task to meticulously go through all examples and ensure their quality especially our data is more complex. We aim to release this version around late March or early April. Thanks for your understanding. |
@tshu-w, thank you for sharing this insightful paper. We highly value such research that judges the quality of benchmarks. A similar paper from EMNLP (https://arxiv.org/pdf/2310.18538.pdf) also provides a comprehensive examination of Spider, BIRD, and Spider-DK. These papers are crucial for advancing our field. The paper (https://arxiv.org/abs/2402.12243) presents an important research problem on how to develop robust models to deal with noise in the benchmark. However, some details are unclear, making it difficult for us to fully understand. For instance, 1) the authors mention random sampling of examples and a 49% error rate in the california_schools set. Upon our careful review, we just found 11 out of 89 errors, which we plan to refine. We believe that multiple random samplings, as suggested in this paper (https://arxiv.org/pdf/2305.14210.pdf), would provide more reliable insights due to the large standard deviation in random sampling. Otherwise there may be a biased sampling.
Even these, we sincerely appreciate their highlighting of the join conditions issue as Example 3 in which we did pay less attention on this (FK-->PK) during checking. We will pay more attention to in this review round. In conclusion, we have learned a lot from these benchmark judgement papers and understand what we need to focus on next. We welcome more such papers or feedbacks to promote this field together. Also, we appreciate your support and feedback and will continue to refine our work until the real AGIs in DB are born. Thanks. |
Updated (one more inconsistency was found): Q: Among the accounts who have loan validity more than 12 months, list out the accounts that have the highest approved amount and have account opening date in 1993." A (SQL): I think the join should be on the "account" table rather than "disp" and T1.duration > 12 Suggested correction: Thanks! nina From dev.json "question_id": 99, |
"question_id": 95 has an incorrect ground truth SQL query: Q: "List out the account numbers of clients who are youngest and have highest average salary?", A (SQL): Multiple accounts are linked with the district that has the highest A11 Suggested correction to output all accounts with the highest average salary:
From dev.json { Thanks! nina |
Incorrect description of database "financial". File: llm/data/dev_databases/financial/database_description/district.csv Suggested corrections: A4,number of inhabitants ,,integer, |
"question_id": 118 has an incorrect ground truth SQL query: Q: "For loan amount less than USD100,000, what is the percentage of accounts that is still running with no issue.", E (evidence): "Status = 'C' stands for running contract, ok so far; Percentage of accounts by condition = [(total(amount) & condition) / (total amount)] * 100%.", A (SQL): There are a few issues with SQL: According to the evidence, we need to divide the total amount of running contract accounts by the total amount (" [(total(amount) & condition) / (total amount)] * 100%."). However, SQL incorrectly
Suggested correction (discussion): From dev.json
|
"question_id": 141 has an incorrect ground truth SQL query: Q: "Which districts have transactions greater than USS$10,000 in 1997?", An issue with SQL: The question asks about all transactions that satisfy conditions. However, SQL computes the total amount of transactions per district (GROUP BY T1.district_id HAVING SUM(T3.amount) > 10000) in the WHERE clause. Suggested correction : " From dev.json
|
"question_id": 143 has an incorrect ground truth SQL query: Q: "What are the accounts that have both gold and junior credit cards?", Issues with SQL: The question asks about accounts that have both gold and junior credit cards. Suggested correction (one of many):
From dev.json |
"question_id": 145 has an incorrect ground truth SQL query: Q: "Who are the account holder identification numbers whose spent per month on the credit card is less than the average, in 1998?" Issues with SQL: The question asks about accounts whose spent per month on the credit card is less than the average in 1998 Suggested correction discussion: From dev.json
|
"question_id": 162 has an incorrect ground truth SQL query: Q: "What is the region of the client with the id 3541 from?" Issues with SQL: The question asks about regions only. Suggested correction : From dev.json
|
"question_id": 163 has a possible incorrect ground truth SQL query: Q: "Which district has the most accounts with loan contracts finished with no problems?" Issues with SQL: The question asks about a district that satisfies a condition ("Which district"). Suggested correction : From dev.json |
"question_id": 180 has an incorrect ground truth SQL query: Q: "Please provide a list of clients who were born between 1983 and 1987 and whose account branch is in East Bohemia, along with their IDs." Issues with SQL: If one runs the ground truth SQL against the database you will get the following pairs (clint_id, account_id)
However, client 379 has only 1 account "311"., so only one pair is correct and others are incorrect. Suggested correction : We need to ensure that only valid client_id, account_id pairs are joined.
From dev.json
|
"question_id": 181 has an incorrect ground truth SQL query: Q: "Please provide the IDs of the 3 female clients with the largest loans." Issues with SQL: Conceptually, the same issue as in 180. First, SQL query finds clients that have accounts using Suggested correction : The same as in 180. Using the table disp to find valid client account associations. From dev.json
|
"question_id": 186 has an incorrect ground truth SQL query: Q: "What percentage of male clients request for weekly statements to be issued?", Issues with SQL: From dev.json
|
"question_id": 189 has an incorrect ground truth SQL query: Issues with SQL: From dev.json
|
"question_id": 192 has an incorrect ground truth SQL query: Q: "What is the average amount of loan which are still on running contract with statement issuance after each transaction?" Issues with SQL: Suggested correction :
From dev.json
|
"question_id": 129 has an incorrect ground truth SQL query: "question": "Which are the top ten withdrawals (non-credit card) by district names for the month of January 1996?", Issues with SQL: Issue 1 Issue 2 Suggested correction : From dev.json
|
Hi @huybery and @accpatrick, Thank you for developing and maintaining BIRD! I went over many of the question/SQL pairs for the There is a M:M relationship between In many questions, we need to find information about clients and their accounts. To answer these questions correctly, we have to join Please let me know what you think. Thanks! nina |
Hi BIRD team, I just wanted to reach out and say a huge thank you for all the amazing work you've put into creating and maintaining the standard dataset. So, I recently built a solution and decided to sample 30 questions from the dev set to evaluate it. As I was going through the failing questions, I noticed that there were 6 incorrect SQL queries that I'm pretty confident about. That's a bit higher than I expected, to be honest (around 20% or more, However, the significance is limited by the small sample size.). <style type="text/css"></style>
Just to be sure, I double-checked that I'm using the latest data and compared the SQL in |
@nnarodytska @freiz Thank you sincerely for your insightful feedbacks! We are processing this and your feedbacks are very valuable, hope the more high-quality version will be presented soon. Thanks and stay tuned! |
In the training data example: {
"db_id": "address",
"question": "How many states are in the central time zone? Write their full names.",
"evidence": "central time zone refers to time_zone = 'Central'",
"SQL": "SELECT SUM(CASE WHEN T1.time_zone = 'Central' THEN 1 ELSE 0 END) AS count FROM zip_data AS T1 INNER JOIN state AS T2 ON T2.abbreviation = T1.state WHERE T1.time_zone = 'Central'",
} Using I think the correct SQL should be: SELECT COUNT(DISTINCT T2.name)
FROM zip_data AS T1
INNER JOIN state AS T2 ON T1.state = T2.abbreviation
WHERE T1.time_zone = 'Central' There is a lot of use of |
Hi, BIRD Team. Thank you for developing and maintaining BIRD! While thoroughly reviewing the dev set of BIRD Benchmark, we have identified a Foreign Key Integrity problem within the database, which involves records that do not satisfy the Foreign Key constraints. SELECT sname FROM satscores WHERE cname = 'Contra Costa' AND sname IS NOT NULL ORDER BY NumTstTakr DESC LIMIT 1 -- Gold
SELECT T2.sname FROM schools AS T1 INNER JOIN satscores AS T2 ON T1.CDSCode = T2.cds WHERE T1.County = 'Contra Costa' ORDER BY T2.NumTstTakr DESC LIMIT 1 -- Pred Given that SELECT cds FROM satscores WHERE cname = 'Contra Costa' AND sname IS NOT NULL ORDER BY NumTstTakr DESC LIMIT 1 -- Modified from Gold
SELECT T1.CDSCode FROM schools AS T1 INNER JOIN satscores AS T2 ON T1.`CDSCode` = T2.`cds` -- Modified from Pred
-- Gold's cds not in Pred's T1.CDSCode! Consequently, the Pred query cannot yield the same result as the Gold query. Under normal circumstances, where Foreign Key Integrity is satisfied, the semantics of these two SQL queries should be the same. Foreign Key Integrity problem can be easily identified by running Although SQLite3 disables foreign key constraints by default, MySQL and other database systems automatically enable them. We also want to know if there are errors related to this when migrating to these database systems (like your recent work 'mini-dev'), and how this problem can be solved. Thank you. |
Hi BIRD team - big thanks for working on this benchmark! Our team recently did a detailed audit of the minidev dataset (500 questions) and found 91 errors (~18.27%) in the gold SQL as well as a further 70 questions (~14%) where we felt that some additional corrections were necessary to avoid ambiguities or inconsistencies between the question and SQL. (Also: question IDs 137 and 138 both appear twice, so there are only 498 unique questions in the minidev dataset). We are happy to share our corrected version of the dataset with the community in this spreadsheet: https://docs.google.com/spreadsheets/d/1IGm9Otruey60ujUnl8AOkepY3qgWHdFJHnX7hQGUeCw/edit As well as a https://drive.google.com/file/d/1iWlYVknwK5wGli5lnwg4stvNzMogjhwj/view We have also annotated each correction with both a detailed reason for the correction as well as a more fine-grained classification of the correction type and SQL error type. The overall summary stats are available in the spreadsheet, but also attaching a screenshot here: In total, we made corrections to 161/498 (32.33%) of the questions. In general, we tried to make the smallest modification to the question + SQL to make the result unambiguous and correct, although some questions had to be redesigned slightly to have meaningful results with the data provided (for example, some questions refer to dates that do not exist in the DB.) The most common classes of errors were the following:
As a reference, our 'Arcwise AI' agent submission (still awaiting final results on the test set) is able to achieve about 87% accuracy on the corrected minidev set, vs. only ~65% on the original (uncorrected) minidev set. We hope this is helpful for others trying to evaluate results on the dev data - we found that the high rate of errors/ambiguities made it very difficult to accurately assess the performance of AI models. Please feel free to contact us directly or comment in the spreadsheet for any clarifications. |
Hi BIRD team,
|
Thanks sincerely for your careful check and feedback. Your suggestions are well-received and analyzed by us. We will consider these and involve important components into our next generation of benchmark. Your contributions will be no doubt acknowledged. Thanks again! |
Hi, BIRD Team. Thank you for developing and maintaining BIRD! While thoroughly reviewing the dev set of BIRD Benchmark, we have identified several annotation issues. We record all question_id and why it is incorrect in this spreadsheet: Hope this is helpful for maintaining and improving the quality of BIRD! |
THanks a million, it's our pleasure to receive such invaluable judgement! We will deeply analyze it and enhance our evaluation. If we choose to include these as part of our new benchmarks, we will acknowledge you. Thanks! |
Hi all,
Although
BIRD
has incurred significant annotation costs, we still cannot guarantee that all the data is accurately labeled. We hope that the community can assist us in building BIRD together! You can continuously report any errors you find under this issue, and we will perform a dataset update at a designated time.Thanks a lot!
Best,
Binyuan
The text was updated successfully, but these errors were encountered: