From 5320b72dd72457c6174b4eb55ff87d2a49e9284c Mon Sep 17 00:00:00 2001 From: MBronars Date: Sat, 25 Nov 2023 19:04:39 -0500 Subject: [PATCH 1/2] README with inital proposal for multi-task eval --- MultitaskREADME.md | 78 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 78 insertions(+) create mode 100644 MultitaskREADME.md diff --git a/MultitaskREADME.md b/MultitaskREADME.md new file mode 100644 index 00000000..c2ef4072 --- /dev/null +++ b/MultitaskREADME.md @@ -0,0 +1,78 @@ +# API for Multi-Task Evaluation +--- +We need methods for summarizing performance of multi-task agents. The hope is to achieve accurate checkpoint selection with a limited computation budget (evaluating on all tasks might take too long). In order to build this system out, we need to think about how it will be used. + +## Current API for Evaluation: +``train_utils.rollout_with_stats(policy, envs, use_goals, horizon, num_episodes)`` +* **policy** - trained multi-task policy +* envs - specified in config.experiment.additional_envs +* use_goals - If true, goal is provided by env.get_goal() +* horizon and num_episodes are the same for all environments +* Evaluation takes timesteps = len(envs) * num_episodes * horizon + +Others: \ +``run_trained_agent.py --agent --env --horizon ``\ +validation loss (is there any other useful metric to report?) + + +## Proposed API: +Rollout with stats is largely unchanged, but we need a different goal, horizon, and num_episodes specified for each task. That is to say, len(envs) = len(goals) = len(horizons)...\ + +```train_utils.rollout_with_stats(policy, envs, goals, horizons, num_episodes)``` +* **policy** - trained multi-task policy +* **envs** - the environmet used for task evaluation +* **goals** - the goal for each task +* **num_episodes** - number of rollouts for each task + +New function: \ +``get_evaluation_tasks(dataset, eval_description, total_episodes)``- returns (envs, goals, horizons, num_episodes) which should be fed into train_utils.rollout_with_stats. +* **dataset** - this is the dataset used for training the policy, these are all the tasks that the model was trained on +* **evaluation discription** - this is the most ambiguous part of the document. Open to suggestions, but I think this is how users should specify their desired evaluation method. + * **default** - run the main evaluation method that we propose in the paper. Need to have a larger discussion about what this will actually be. If we are not in the buisness of task design, then this should select a representative subset of tasks from the training dataset. See Vaibhav's post in slack that goes over a couple different ways that we could break these tasks into different categories. + * **random** - randomly select a subset of tasks from the training dataset + * **all** - use all tasks from the training dataset + * **custom** - user explicitly specifies the tasks and their frequency + * **language discription** - the user provides a language discription of what they want their agent to do. We return appropriate evaluation tasks. +* **total_episodes** - total number of evaluation episodes across all of the tasks + +## Current HDF5 Structure: +HDF5 API: +- File + - data (.attrs includes env_args, same for all of the demos) + - demo_0 + - demo_1 + +## Proposed HDF5 Structure: +I think the HDF5 structure need to be reformulated to emphasize tasks. Here is one proposed method. I believe a task can be uniquely described given an environment and a goal. If we are going to break tasks into categories, those decisions should be made based on the information held in the .attrs fields. +HDF5 API: +- File + - env_0 (.attrs includes env_args) + - goal_0 (.attrs includes image, text, and low dim goal specification) + - demo_0 + - demo_1 + - goal_1 + - demo_0 + - demo_1 + - env_1 + - env_2 + + From f0c964154fc8edceabdbbc66c192a4f8dee03fc4 Mon Sep 17 00:00:00 2001 From: MBronars Date: Sat, 25 Nov 2023 19:09:42 -0500 Subject: [PATCH 2/2] small changes to MultitaskREADME --- MultitaskREADME.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/MultitaskREADME.md b/MultitaskREADME.md index c2ef4072..e959d0c0 100644 --- a/MultitaskREADME.md +++ b/MultitaskREADME.md @@ -16,11 +16,11 @@ validation loss (is there any other useful metric to report?) ## Proposed API: -Rollout with stats is largely unchanged, but we need a different goal, horizon, and num_episodes specified for each task. That is to say, len(envs) = len(goals) = len(horizons)...\ +Rollout with stats is largely unchanged, but we need a different goal, horizon, and num_episodes specified for each task. That is to say, len(envs) = len(goals) = len(horizons)... ```train_utils.rollout_with_stats(policy, envs, goals, horizons, num_episodes)``` * **policy** - trained multi-task policy -* **envs** - the environmet used for task evaluation +* **envs** - the environment used for task evaluation * **goals** - the goal for each task * **num_episodes** - number of rollouts for each task @@ -43,7 +43,7 @@ HDF5 API: - demo_1 ## Proposed HDF5 Structure: -I think the HDF5 structure need to be reformulated to emphasize tasks. Here is one proposed method. I believe a task can be uniquely described given an environment and a goal. If we are going to break tasks into categories, those decisions should be made based on the information held in the .attrs fields. +I think the HDF5 structure needs to be reformulated to emphasize tasks. Here is one proposed method. I believe a task can be uniquely described given an environment and a goal. If we are going to break tasks into categories, those decisions could be made based on the information held in the .attrs fields. HDF5 API: - File - env_0 (.attrs includes env_args)