Downstream Task Details
AV-SUPERB consists of seven public datasets across five audio-visual tasks:
  1. Audio Event Classification (AEC):
    We evaluate model perfomance on two datasets, AudioSet and VGGSound. As noted in our paper, while sound events in VGGSound are guaranteed to be visually present, events in AudioSet are not necessarily present in the video frames.
    For AudioSet, the balanced train set containing roughly 20K 10-second clips is used for fine-tuning. Each clip may contain multiple audio events from 527 event types. We optimize binary cross entropy loss for multi-label classification, and report the mean average precision (mAP) on the balanced evaluation set.
    VGGSound consists of ~200K 10-second clips, split into 183K and 15k clips for training and testing, respectively. Each clip is labelled with one event from 309 event types. We report top-1 accuracy for the testing set.

  2. Action Recognition (AR):
    We evaluate model perfomance on two datasets, Kinetics-Sounds and UCF101. As noted in our paper, while Kinetics-Sounds contains videos with higher correlation between sound and visual frames, visual actions in UCF101 may be accompanied with unrelated audio.
    Kinetics-Sounds is a curated subset of the Kinetics400 dataset with 28k clips, split into 23k training, 1.6k validation, and 3.1k testing videos. The dataset focuses on 32 action classes that are more likely to be present in both visual and audio modalities.
    UCF101 contains 13k clips of 101 action classes. We use the first of the three official train/test splits for evaluation. For both action recognition datasets, downstream models are trained to minimize regular cross entropy loss, and we report the top-1 classification accuracy on their respective testing sets.

  3. Automatic Speech Recognition (ASR):
    We evaluate model perfomance on the LRS3-TED dataset, which consists of 433 hours of video derived from TED talks online. We optimize a 2-layer bidirectional LSTM model with 1024 hidden dimensions using CTC loss for character-level ASR. We report character error rate (CER) on the test set. During inference, we use greedy decoding without external language model re-scoring.

  4. Automatic Speaker Verification (ASV):
    We evaluate model perfomance on the VoxCeleb2 dataset, consisting of over 1M video clips. To keep the computational cost of evaluating models reasonable, we only sample 5 videos (each containing multiple utterances) per speaker from the dev subset for finetuning. One additional video per speaker in used for validation. The official testing set is used to generating target and non-target trials for testing. We optimize the downstream model with the additive margin softmax loss. We report equal error rate (EER) on the testing trials.

  5. Emotion Recognition (ER):
    We evaluate model perfomance on the IEMOCAP dataset. Following conventional evaluation setups, we remove unbalanced classes to perform four-way classification (neutral, happy, sad, angry). We use Section1 as the testing set and report top-1 accuracy.

Training Details
Batch SizeTraining Steps
AudioSet1005000
VGGSound1285000
Kinetics-Sounds3250000
UCF1011630000
LRS3-TED3220000
VoxCeleb21620000
IEMOCAP3235000