We're Not Using Videos Effectively:
An Updated Domain Adaptive Video Segmentation Baseline

TMLR 2024

Simar Kareer, Vivek Vijaykumar, Harsh Maheshwari, Prithvijit Chattopadhyay,
Judy Hoffman, Viraj Prabhu

Code  | Arxiv

Abstract

There has been abundant work in unsupervised domain adaptation for semantic segmentation seeking to adapt a model trained on images from a labeled source domain to an unlabeled target domain. While the vast majority of prior work has studied this as a frame-level ImageDA problem, a few VideoDA works have sought to additionally leverage the temporal signal present in adjacent frames. However, VideoDA works have historically studied a distinct set of benchmarks from ImageDA, with minimal cross-benchmarking. In this work, we address this gap. Surprisingly, we find that (1) even after carefully controlling for data and model architecture, modern ImageDA methods strongly outperform VideoDA methods on VideoDA benchmarks (+14.5 mIoU on Viper to Cityscapes-Seq, +19.0 mIoU on Synthia-Seq to Cityscapes-Seq!), and (2) naive combinations of ImageDA and VideoDA techniques do not lead to consistent performance improvements. To avoid siloed progress between ImageDA and VideoDA we open-source our codebase with support for a comprehensive set of VideoDA and ImageDA methods on a common benchmark. Code available at this link

Key Results

We study the task of sim to real domain adaptation for semantic segmentation, in the setting where both the source and target domain contain sequential video frames. Specifically we consider two common benchmarks Viper to Cityscapes, and Synthia-Seq to Cityscapes. First, we compare the efficacy of VideoDA methods with state-of-the-art ImageDA methods. ImageDA methods drastically outperform VideoDA methods even when controlled for architecture and data!

Overview: Recent domain adaptive video segmentation methods do not compare against state-of-the-art baselines for ImageDA. We perform the first such cross-benchmarking, and find that even after controlling for data and model architecture, ImageDA methods massively outperform VideoDA methods.


Next, we explore hybrid UDA strategies that combine techniques from both lines of work. To do so, we contruct methods which toggle each of the following key methods from the VideoDA literature: Consis-Mixup + ACCEL, Pseudo-label Refinement and Video Discriminators.


Simplified Training Pipeline: Temporally separated frames are first augmented and then passed through a model h to produce source and target predictions, which produce supervised and adaptation losses. In this standard self-training pipeline, VideoDA methods typically add one or more of the following techniques: consistent mixup, ACCEL, pseudo-label refinement, and video discriminators.



NameVideo Discrim.ACCEL+Consis MixupPL RefineViper to CS-Seq
HRDA (baseline)None65.5
+ TPSWarp Frame64.9
+ DAVSNMaxConf65.3
+ UDA-VSSConsis64.3
+ MOMConsis65.9
+ CustomNone63.4
+ CustomNone66.9
+ CustomConsis65.8
+ CustomNone63.7
+ CustomConsis65.9
+ CustomConsis64.3
+ CustomConsis66.0

Hybrid Approaches: We test adding existing VideoDA methods to our ImageDA baseline (HRDA with DeepLabV2), as well as additional custom combinations of the key techniques used in prior works. In this setting we find that only ACCEL+Consistent Mixup provides improvements.


Finally, we test a variety of pseudolabel refinement strategies. While across the board no single strategy consistently outperforms our baseline, there are sporadic gains. For instance we see that consistency based refinement sets the state of the art on Synthia-Seq to Cityscapes


NameFlow DirectionPL RefineViper to CSSeqSynthia-Seq to CSSeq
Source36.730.1
HRDA MIC68.375.3
CustomBackwardwarp frame62.971.1
CustomBackwardconsis67.774.7
CustomBackwardmaxconf65.073.8
CustomForwardwarp frame67.272.3
CustomForwardconsis66.874.5
CustomForwardmaxconf68.271.9
Oracle78.982.8
Target 83.084.9


Conclusion

The field of VideoDA is currently in an odd place. Image-based approaches have advanced so much that they simply outperform video methods, largely because of techniques such as multi-resolution fusion. However, certain VideoDA techniques still show occasional promise, such as ACCEL and pseudo-label refinement. Further, while we find that no single pseudo-label refinement strategy performs well across settings, such approaches are generally more effective on Synthia suggesting suitability to the the low data setting. However, our initial exploration leads us to believe that static, hand-crafted refinement strategies based on heuristics are likely too brittle to consistently improve performance, and adaptive, learning-based, approaches (like ACCEL) perhaps require more consideration. Further, while we show ImageDA methods to be far superior to VideoDA methods for sim-to-real semantic segmentation adaptation, we leave to future work an investigation into whether this holds more generally. Perhaps VideoDA will only be effective for segmentation on harder domain shifts, given the saturating performance on current benchmarks. Further, in tasks like action recognition which necessitate a model to understand a series of frames, video may be necessary.


To accelerate research in VideoDA, we open source our code built off of MMSegmentation, by providing multi-frame support I/O, optical flow, and baseline implementations of key VideoDA methods. We believe that the lack of a standardized codebase for VideoDA is one of the primary reasons for the siloed progress in the field, and hope that with our codebase future work can both easily cross-benchmark methods and develop specialized VideoDA methods built on top of the latest and greatest in image-level semantic segmentation.


Citation

@inproceedings{kareer2024EffectiveVDA
  title={We're Not Using Videos Effectively: An Updated Video Domain Adaptation Baseline},
  author={Simar Kareer, Vivek Vijaykumar, Harsh Maheshwari, Prithvi Chattopadhyay, Judy Hoffman, Viraj Prabhu},
  booktitle={Transactions on Machine Learning Research (TMLR)},
  year={2024}
}