Successful reproduction of the experiments on APPS by pure GPT3.5 #9

wyt2000 · 2023-09-17T12:51:01Z

Since Codex was deprecated by OpenAI, I tried to reproduce the experiments on the dataset APPS in Parsel paper by pure GPT3.5. Thanks to the code in branch saycan, I fully understood your evalutation method. After a tough struggling to modify the prompts and Parsel itself, I finally reproduced a part of experiments mentioned in chapter 3.1 of the paper and even got better results: the pure GPT-3.5 version parsel(8x16) solved 27 of 100 randomly sampled competition-level problem in APPS. I offer the modified code for someone to use in the future.

fix: set num_completions and remove header. fix: details of prompts. fix: prompts. config: modify .gitignor fix: adjust prompts. fix: max_tokens. fix: no_tests. fix: no_tests -> add asserts. fix: finish auto gen tests. fix: no_tests. fix: allow test failed, modify codeT score part. fix: prompts. fix: prompts. fix: adjust prompts. feat: Add logit_bias to force gpt use implemented functions. fix: delete useless code. fix: logit_bias -> prompts. fix: prompts. feat: Add num_completions and save_path args. feat: add __init__.py, make parsel like a package. fix: generate_tests = True. fix: found_successful_generation. fix: grammar mistake in prompts. fix: timeout exception, use single process. fix: code transform to remove implemented functions. fix: prompts. fix: restore logit_bias. fix: Allow nested functions overwrite implemented func fix: sleep. fix: remove generate_tests. fix: adjust prompts. fix: adjust prompts. fix: timeout. fix: num_completions. fix: product sample. fix: seed before shuffle. fix: multiprocess. fix: timeout. feat: handle MLE. fix: MemoryError.

PatrickHua · 2024-04-24T23:40:39Z

Hey Yutong, could you share the modifications related to evaluations as well? I'm trying to reproduce the results on apps (27/100) according to your post.

wyt2000 · 2024-05-30T02:24:44Z

Hey Yutong, could you share the modifications related to evaluations as well? I'm trying to reproduce the results on apps (27/100) according to your post.

Sorry, since a long time passed, I forgot many details about evaluations. See https://github.com/wyt2000/Automatic-ANPL/tree/apps for help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Successful reproduction of the experiments on APPS by pure GPT3.5 #9

Successful reproduction of the experiments on APPS by pure GPT3.5 #9

wyt2000 commented Sep 17, 2023 •

edited

Loading

PatrickHua commented Apr 24, 2024

wyt2000 commented May 30, 2024

Successful reproduction of the experiments on APPS by pure GPT3.5 #9

Are you sure you want to change the base?

Successful reproduction of the experiments on APPS by pure GPT3.5 #9

Conversation

wyt2000 commented Sep 17, 2023 • edited Loading

PatrickHua commented Apr 24, 2024

wyt2000 commented May 30, 2024

wyt2000 commented Sep 17, 2023 •

edited

Loading