AV-SUPERB is a collection of benchmarking resources to evaluate the capability of a universal shared representation for speech and audio processing. AV-SUPERB consists of the following:
- A benchmark of five speech and audio processing tasks built on seven established public datasets,
- ABENCHMARK TOOLKITdesigned to evaluate and analyze pretrained model performance on various downstream tasks following conventional evaluation protocols,
- A publicLEADERBOARDfor SUBMISSIONSand performance tracking on the benchmark.
AV-SUPERB aims to offer the community a standard and comprehensive framework to train, evaluate, and compare the generalizability of universal audio-visual representations on a wide range of tasks. A universal representation can be leveraged to quickly adapt to diverse downstream tasks with minimum architectural change and downstream fine-tuning, so as to reduce the model development cycle time for new tasks.To emphasize on evaluating the quality of the learned universal representation, AV-SUPERB puts an explicit constraint on the downstream model and limits its parameter size.
The ultimate goal of AV-SUPERB is to democratize the advancements in audio-visual learning with powerful, generalizable, and reusable representations. As we are gradually releasing new tasks and opening new tracks, we invite researchers to participate in the challenge and advance the research frontier together.