This project 1) evaluates advanced reasoning methods, Graph of Thought (GoT) and ReWOO compared to the naive baselines and 2) conduct error-analysis to improve the advanced reasoning method for the real-world task, on the Travel Planner benchmark, which requires generating travel itineraries that meet complex user constraints like budgets and logical sequencing. By comparing these methods to traditional techniques like Chain of Thought (CoT) and Tree of Thought (ToT), we analyze their effectiveness in real-world reasoning tasks and suggest further room to make LLMs handle sophisicated reasoning task.
Run the following command to install all required libraries (conda virtual environment recommended):
pip3 install openai tqdm geopy langchain pandas torch datasets requests graph_of_thoughts
-
Download the database from this link.
-
Extract the contents into the
TravelPlanner
directory.YourPathToTravelPlanner
Export your OpenAI API key as an environment variable. Replace "Your API Key"
with your actual API key:
export OPENAI_API_KEY="Your API Key"
Execute the main.py
file using desired options:
python3 main.py --llm <LLM_MODEL_NAME> --strategy <STRATEGY> --batch_size <BATCH_SIZE> --output_dir <OUTPUT_DIRECTORY> --is_debug <DEBUG_MODE>
--llm
: The language model to use (default:"gpt-4o-mini"
).--strategy
: Reasoning strategy to apply. Options include:vanilla
: Basic LLM reasoning.few_shot_llm
: Few-shot prompting with training examples.got
: Graph of Thought reasoning.rewoo
: ReWOO reasoning.got_advanced
: Advanced GoT reasoning.rewoo_advanced
: Advanced ReWOO reasoning.
--batch_size
: Number of queries processed in one batch (default:2
).--output_dir
: Directory to save results (default:./res
).--is_debug
: Debug mode. Set toTrue
for a quick test run orFalse
for full evaluation (default:True
).
-
Run with Default Settings:
python3 main.py
-
Run with Few-Shot Reasoning:
python main.py --strategy few_shot_llm --batch_size 4 --output_dir ./output
-
Run with ReWOO and Full Debug Off:
python main.py --strategy rewoo --is_debug False
- Predictions: Saved to
<OUTPUT_DIRECTORY>/generated_predictions-<STRATEGY>.json
. - Postprocessed Plans: Saved to
<OUTPUT_DIRECTORY>/generated_predictions-<STRATEGY>-postprocess.json
. - Evaluation Results: Saved to
<OUTPUT_DIRECTORY>/generated_predictions-<STRATEGY>-result.json
.
Method | Delivery Rate | Commonsense Constraint (Micro) | Commonsense Constraint (Macro) | Hard Constraint (Micro) | Hard Constraint (Macro) | Final Pass |
---|---|---|---|---|---|---|
Vanilla LLMs | 100.00% | 60.83% | 0.56% | 0.23% | 0.00% | 0.00% |
Few Shot LLMs | 100.00% | 65.83% | 2.78% | 3.33% | 1.67% | 0.00% |
ReWOO | 100.00% | 70.28% | 6.67% | 5.00% | 1.67% | 1.11% |
GoT | 100.00% | 73.33% | 6.67% | 5.48% | 1.67% | 1.11% |
Advanced ReWOO | 100.00% | 72.50% | 6.67% | 5.95% | 3.89% | 1.67% |
Advanced GoT | 100.00% | 74.58% | 6.67% | 7.38% | 4.44% | 1.67% |
python3 ablation.py --strategy <STRATEGY> --dir_path <PATH_TO_RESULT_DIR> --dir_path2 <PATH_TO_RESULT_DIR2> --is_comp <COMPARISON MODE>
--strategy
: Reasoning strategy to apply. Options include:vanilla
: Basic LLM reasoning.few_shot_llm
: Few-shot prompting with training examples.got
: Graph of Thought reasoning.rewoo
: ReWOO reasoning.got_advanced
: Advanced GoT reasoning.rewoo_advanced
: Advanced ReWOO reasoning.
--dir_path
: Directory that save results file to analyze--dir_path2
: Directory that save results file to analyze (needed if it is comparison mode)--is_comp
: Bool value to indicate whether you want to compare the result between dir_path and dir_path2
GoT's Result
Figure 1: Comparison between GoT and Advanced within Commonsense
Figure 2: Comparison between GoT and Advanced within Hard
ReWOO's Result
Figure 3: Comparison between ReWOO and Advanced within Commonsense
Figure 4: Comparison between ReWOO and Advanced within Hard
If you have any problems, please contact Wonjoon Choi and Wookje Han.