Evaluation Metrics Misalignment with Paper Results #1

juneyeeHu · 2025-01-15T09:15:28Z

I followed the provided instructions to train the SegSwap model on the processed EgoExo4D dataset. However, the evaluation metrics I obtained (as shown in the image below) do not align with the results reported in the paper.

I have processed the dataset as described in the documentation and trained the SegSwap model using the provided training pipeline and hyperparameters, then evaluated the model using the evaluation script. :(

Could there be any additional preprocessing steps not mentioned in the documentation? Is there a possibility that the provided codebase has been updated since the paper was published?

Could you help verify if there are any undocumented steps or issues in the training/evaluation pipeline that might cause the observed discrepancy?

Thank you for your support!

sanjayharesh · 2025-01-20T04:19:23Z

Hi @juneyeeHu,

I re-ran the public code (baring the process_data.py script as I did not have enough bandwidth) and was able to reproduce the numbers in the paper. However I ran into a size mismatch error between predicted and gt masks in ego->exo evaluation. Not sure how you handled it but we post process the masks in evaluation by removing the padding to match the size of gt masks and this step was missing from evaluate_egoexo.py. I have pushed an updated and this should hopefully resolve the issue. I hope to update the eval code/instructions to be a bit more clean in the coming few days but this should work for now.

Leaving a few additional pointers below which might be helpful to debug the setup at your end if you run into additional issues,

Make sure you download the latest version (v2) of the EgoExo4D data.
The numbers in the paper are generated on the test set. Make sure you evaluate on the right set.
Make sure all the takes in split.json are downloaded correctly.
After pre-processing and generating pairs, make sure there are ~442k training pairs for ego->exo and ~610k pairs for exo->ego.

Hope this helps.
-Sanjay

juneyeeHu · 2025-02-01T14:24:48Z

Hi @sanjayharesh ,

Thank you for your response!

I have downloaded the latest version of the EgoExo4D dataset. Since the test set annotations were not publicly available, I conducted my evaluation on the validation set.

Since the code for generating the ground truth (GT) JSON file for evaluation is not publicly available, I referred to the structure provided in the evaluation README to generate the GT JSON file accordingly. I also noticed that the model uses a resolution of 480x480 for both training and evaluation. Given that the original annotation masks are approximately 3000x4000, I resized them to match the required resolution before training.

During training, I encountered an error, which I resolved by adding a condition in Train.py to handle cases where no positive samples exist:

NEG_IDX = batch['negative'].cuda()
POS_IDX = torch.logical_not(NEG_IDX)
if not torch.any(POS_IDX):
     continue
optimizer.zero_grad()

I would appreciate any insights or recommendations regarding these modifications. Thanks again for your help!

Best,
Yijun

sanjayharesh · 2025-02-02T05:53:33Z

Hi Yijun,

I see. Sorry for the confusion. The resolution for exo->ego evaluation is indeed 480x480 but for ego->exo case, we wanted to keep the aspect ratio of the masks consistent with the original and so we resize such that the longer side is 480. (iirc, most masks are horizontal and get a resolution of 270x480). During training we pad the images to be 480x480 and during evaluation we "un-pad" to match the original size. So instead of resizing to 480x480 if you resize so that longer side is 480 keeping the aspect ratio consistent, it might solve the issue you have been facing. Again, this is only in ego->exo case in SegSwap baseline. exo->ego does not require this as the ego images/masks are square anyways.

Also, in the last update we have released the process_annotation script to generate the json file for ground truth so maybe try running the eval again (including running eval_segswap.py script) with the latest update.

For the second error, I have never seen this before in either ego->exo or exo->ego case. Are you running with a much smaller batch size? If this happens rarely enough, I guess skipping a few batches should not hurt the training much.

Hope the explanation helps with your issue.
-Sanjay

juneyeeHu · 2025-02-10T10:44:08Z

Hi @sanjayharesh,

Thank you for your response and for updating the code!

I would like to clarify a few points regarding the ego → exo case. Specifically, I want to confirm whether the training resolution remains at (480, 480) and whether the original aspect ratio is only preserved during the evaluation stage. Additionally, should the operations in the XMem baseline be aligned with those in the SegSwap baseline? In other words, does the XMem implementation in the ego → exo case require corresponding modifications to maintain consistency?

Furthermore, I noticed that in create_pairs.py, the code does not verify whether annotations exist in both views for each frame involved. However, the XMem baseline takes the intersection of annotated frames from both views. Could you share the reasoning behind this difference? Would it be beneficial to apply the same strategy here?

I would greatly appreciate any insights or recommendations regarding these modifications. Thanks again for your help!

Best,
Yijun

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Metrics Misalignment with Paper Results #1

Evaluation Metrics Misalignment with Paper Results #1

juneyeeHu commented Jan 15, 2025

sanjayharesh commented Jan 20, 2025

juneyeeHu commented Feb 1, 2025

sanjayharesh commented Feb 2, 2025

juneyeeHu commented Feb 10, 2025

Evaluation Metrics Misalignment with Paper Results #1

Evaluation Metrics Misalignment with Paper Results #1

Comments

juneyeeHu commented Jan 15, 2025

sanjayharesh commented Jan 20, 2025

juneyeeHu commented Feb 1, 2025

sanjayharesh commented Feb 2, 2025

juneyeeHu commented Feb 10, 2025