Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Successful reproduction of the experiments on APPS by pure GPT3.5 #9

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wyt2000
Copy link

@wyt2000 wyt2000 commented Sep 17, 2023

Since Codex was deprecated by OpenAI, I tried to reproduce the experiments on the dataset APPS in Parsel paper by pure GPT3.5. Thanks to the code in branch saycan, I fully understood your evalutation method. After a tough struggling to modify the prompts and Parsel itself, I finally reproduced a part of experiments mentioned in chapter 3.1 of the paper and even got better results: the pure GPT-3.5 version parsel(8x16) solved 27 of 100 randomly sampled competition-level problem in APPS. I offer the modified code for someone to use in the future.

fix: set num_completions and remove header.

fix: details of prompts.

fix: prompts.

config: modify .gitignor

fix: adjust prompts.

fix: max_tokens.

fix: no_tests.

fix: no_tests -> add asserts.

fix: finish auto gen tests.

fix: no_tests.

fix: allow test failed, modify codeT score part.

fix: prompts.

fix: prompts.

fix: adjust prompts.

feat: Add logit_bias to force gpt use implemented functions.

fix: delete useless code.

fix: logit_bias -> prompts.

fix: prompts.

feat: Add num_completions and save_path args.

feat: add __init__.py, make parsel like a package.

fix: generate_tests = True.

fix: found_successful_generation.

fix: grammar mistake in prompts.

fix: timeout exception, use single process.

fix: code transform to remove implemented functions.

fix: prompts.

fix: restore logit_bias.

fix: Allow nested functions overwrite implemented func

fix: sleep.

fix: remove generate_tests.

fix: adjust prompts.

fix: adjust prompts.

fix: timeout.

fix: num_completions.

fix: product sample.

fix: seed before shuffle.

fix: multiprocess.

fix: timeout.

feat: handle MLE.

fix: MemoryError.
@PatrickHua
Copy link

Hey Yutong, could you share the modifications related to evaluations as well? I'm trying to reproduce the results on apps (27/100) according to your post.

@wyt2000
Copy link
Author

wyt2000 commented May 30, 2024

Hey Yutong, could you share the modifications related to evaluations as well? I'm trying to reproduce the results on apps (27/100) according to your post.

Sorry, since a long time passed, I forgot many details about evaluations. See https://github.com/wyt2000/Automatic-ANPL/tree/apps for help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants