Workshop Date: Mar 7, 2026
Location: AZ Ballroom Salon 1
Held in conjunction with WACV2026
5th Workshop on Image/Video/Audio Quality Assessment in Computer Vision, VLM and Diffusion Model
Image, video, and audio quality significantly impacts machine learning and computer vision systems, yet remains underexplored by the broader research community. Real-world applications—from streaming services and autonomous vehicles to cashier-less stores and generative AI—critically depend on robust quality assessment and improvement techniques. Despite their importance, most visual learning systems assume high-quality inputs, while in reality, artifacts from capture, compression, transmission, and rendering processes can severely degrade performance and user experience.
This workshop is particularly timely given the explosive growth of generative AI, which introduces new challenges in quality assessment for both inputs and outputs. By bringing together researchers from industry and academia, we aim to systematically investigate how quality issues affect various visual learning tasks and develop innovative assessment and mitigation techniques. Building on the success of our previous workshops at WACV(2022-2025), we expect to stimulate new research directions and attract more talent to this critical field, ultimately improving the robustness and reliability of computer vision applications across industries.
This workshop addresses topics related to image/video/audio quality assessment in machine learning, computer vision, VLM, Diffusion Model, and other types of generative AIs. The topics include, but are not limited to:
| Time | Event | Duration |
|---|---|---|
| 8:20-8:30am | Opening Remarks (Host: Joe Liu) | 10 mins |
| 8:30-9:30am | Keynote Keynote Speaker: Gérard G. Medioni (Host: Joe Liu) |
60 mins |
| 9:30-10:15am | Coffee Break | 45 mins |
| 10:15-11:45am | Oral Long Session I (Host: Yarong Feng) | 90 mins |
| 10:15-10:30am | Fast 2DGS: Efficient Image Representation with Deep Gaussian Prior (in person) | 15 mins |
| 10:30-10:45am | REMinD: Balancing Robust Concept Unlearning and Image Quality in Diffusion Models (in person) | 15 mins |
| 10:45-11:00am | Reason Then Ground: Multilingual Text/Logo Grounding on Movie Posters (in person) | 15 mins |
| 11:00-11:15am | VideoForge: Efficient Domain Adaptation for Video Generation Through Quality-Driven Rewards and Enhanced LoRA (in person) | 15 mins |
| 11:15-11:30am | HandSurge: Localized Neural Surgery for Diffusion-Generated Hand Deformity Restoration (in person) | 15 mins |
| 11:30-11:45am | HiFi-Deblur: High-Frequency Intense Image Deblurring with Frequency-Decoupled U-Net and Discrete Wavelet Transform (in person) | 15 mins |
| 11:45am-1:00pm | Lunch Break | 75 mins |
| 1:00-2:00pm | Oral Long Session II (Host: Qipin Chen) | 60 mins |
| 1:00-1:15pm | Motion Blur Detection and Segmentation from Static Image Artworks (in person) | 15 mins |
| 1:15-1:30pm | Transforming Video Subjective Testing with Training, Engagement, and Real-Time Feedback (in person) | 15 mins |
| 1:30-1:45pm | Can You Find the Difference? Visually Identical Image Detection (in person) | 15 mins |
| 1:45-2:00pm | Efficient Deep Demosaicing with Spatially Downsampled Isotropic Networks (in person) | 15 mins |
| 2:00-3:00pm | Keynote Keynote Speaker: Sarah Ostadabbas, "Toward Data-Efficient Dynamically-Aware Visual Intelligence" (Host: Joe Liu) |
60 mins |
| 3:00-3:45pm | Coffee Break | 45 mins |
| 3:45-4:45pm | Oral Short Session I (Host: Qipin Chen) | 60 mins |
| 3:45-3:52pm | Diffuse4D: Completing NeRF-Stereo Depth via Diffusion-Driven Restoration in Dynamic Scenes (in person) | 7 mins |
| 3:52-3:59pm | Seeing in the Dark: Synthesizing Underexposure for More Robust Underwater Image Augmentation (in person) | 7 mins |
| 3:59-4:06pm | ViTNT-FIQA: Training-Free Face Image Quality Assessment with Vision Transformers (in person) | 7 mins |
| 4:06-4:13pm | Cost Savings from Automatic Quality Assessment of Generated Images (in person) | 7 mins |
| 4:13-4:20pm | VIBEFACE - Video and Image Biometric Dataset for Evaluation of Faces (in person) | 7 mins |
| 4:20-4:27pm | We Still See Broken Limbs: Towards Anatomical Realism in GenAI via Human Preference Learning (in person) | 7 mins |
| 4:27-4:34pm | JetBench: Quality-Aware Benchmarking of Vision Models for Jet Parameter Classification in Heavy-Ion Physics (in person) | 7 mins |
| 4:34-4:41pm | STEC: A Spatio-Temporal Entropy Coverage Metric for Evaluating Sampled Video Frames (in person) | 7 mins |
| 4:41-4:48pm | SPoRC-VIST: A Benchmark for Evaluating Generative Natural Narrative in Vision-Language Models (in person) | 7 mins |
| 4:45-5:00pm | Closing Remarks (Host: Joe Liu) | 15 mins |
| 5:00-6:00pm | Poster Session + Online Oral Presentations | 60 mins |
| 5:00-5:07pm | From Filters to VLMs: Benchmarking Defogging Methods through Object Detection and Segmentation Performance (online) | 7 mins |
| 5:07-5:14pm | Device-Robust Spectral Grading and Origin Detection from UV-Vis-NIR Images: Towards Practical Gemstone Quality Assessment (online) | 7 mins |
| 5:14-5:21pm | Vision Language Models Learn to Assess Images with Specialists (online) | 7 mins |
| 5:21-5:28pm | When Probe and Gallery are Low Quality: Decreasing Accuracy and Increasing Demographic Disparities in 1:N Identification (online) | 7 mins |
| 5:28-5:35pm | CARLA-Haze: A Synthetic Benchmark for Outdoor Image Dehazing (online) | 7 mins |
| 5:35-5:42pm | Quality-Driven and Diversity-Aware Sample Expansion for Robust Marine Obstacle Segmentation (online) | 7 mins |
| 5:42-5:49pm | Enhancement as Augmentation: Improving Detection in Highly Degraded Underwater Images Through Mixed-Domain Training (online) | 7 mins |
| 5:49-5:56pm | Image-Specific Adaptation of Transformer Encoders for Compute-Efficient Segmentation (online) | 7 mins |
| 5:56-6:03pm | YOLO-OSA: A ShuffleAttention-Enhanced YOLO Model for FOD Detection with Comprehensive Benchmarking on MS COCO (online) | 7 mins |
| Zoom Information for virtual presentations: TBD |
Abstract: Gerard Medioni will present an exploration of the cutting-edge technology driving the Prime Video experience. The session opens with three flagship innovations: AI-aided dubbing, video season recaps, and a newly launched NBA feature — each showcasing how Prime Video is pushing the boundaries of content delivery and personalization. Gerard will then take a deeper dive into three areas of active technical development: the challenges and solutions behind delivering streaming video in vertical format for mobile audiences; Prime Video's approach to detecting and classifying audio quality defects; and the unique image quality challenges inherent to livestreaming — and how the team is tackling them.
Bio: Gérard G. Medioni is a computer scientist, author, academic and inventor. He is a vice president and distinguished scientist at Amazon and serves as emeritus professor of Computer Science at the University of Southern California. Medioni has made contributions to computer vision, in particular 3D sensing, surface reconstruction, and object modelling. He has translated his computer vision research into customer-facing inventions and products. He has authored four books, including Emerging Topics in Computer Vision, Multimedia Systems: Algorithms, Standards, and Industry Practices, and A Computational Framework for Segmentation and Grouping, and has published more than 80 journal papers, 200 conference papers, with over 34,000 citations and his h-index is 88. In addition, he holds 123 patents to his name which include Visual tracking in video images in unconstrained environments by exploiting on-the-fly context using supporters and distracters and Depth mapping based on pattern matching and stereoscopic information, along with patents on Just Walk Out technology and Amazon One. Medioni is a Fellow of the Association for the Advancement of Artificial Intelligence, the Institute of Electrical and Electronics Engineers, the International Association for Pattern Recognition, and the National Academy of Inventors. He is also a member of National Academy of Engineering.
Abstract: Despite rapid advances in multimodal foundation models, today's video AI systems still struggle to reason about motion, causality, and physical change, especially in real-world, small-data environments. This talk argues that scaling data and parameters alone yields models that reproduce appearance, but fail to anticipate how the world evolves. Instead, I advocate for a shift toward motion-grounded visual intelligence, where dynamics (not static frames or language priors) form the core representation. I will present recent work from our lab demonstrating how motion provides a low-dimensional bridge between pixels and physics, enabling systems that discover, describe, and generate video through dynamics-aware reasoning. Using examples from our motion-aware zero-prompt video understanding and our physics-grounded generative framework, I show how treating video as a learnable world model, rather than a sequence of images, supports more robust generalization, interpretable reasoning, and physically consistent generation. The talk concludes with a broader vision for Physical AI: data-efficient systems that learn from motion, reason over future states, and operate safely in unconstrained environments such as healthcare, robotics, and human-centered applications.
Bio: Professor Ostadabbas is an associate professor in the Electrical and Computer Engineering Department at Northeastern University (NU) in Boston, Massachusetts, USA. She joined NU in 2016 after completing her post-doctoral research at Georgia Tech, following the achievement of her PhD at the University of Texas at Dallas in 2014. At NU, Professor Ostadabbas holds the roles of Director at the Augmented Cognition Laboratory (ACLab), Director of Women in Engineering (WIE), and Co-Director at The Center for Signal Processing, Imaging, Reasoning, and Learning (SPIRAL). Her research focuses on the convergence of computer vision and machine learning, particularly emphasizing representation learning in visual perception problems. In her applied research, she has significantly contributed to the understanding, detection, and prediction of human and animal behaviors through the modeling of visual motion, considering various biomechanical factors. Professor Ostadabbas also extends her work to the Small Data Domain, including applications in medical and military fields, where data collection and labeling are costly and protected by strict privacy laws. Her solutions involve deep learning frameworks that operate effectively with limited labeled training data, incorporate domain knowledge for prior learning and synthetic data augmentation, and enhance the generalization of learning across domains by acquiring invariant representations. Professor Ostadabbas has co-authored over 130 peer-reviewed journal and conference articles and received research awards from prestigious institutions such as the National Science Foundation (NSF), Department of Defense (DoD), Sony, Mathworks, Amazon AWS, Verizon, Oracle, Biogen, and NVIDIA. She has been honored with the NSF CAREER Award (2022), Sony Faculty Innovation Award (2023), was the runner-up for the Oracle Excellence Award (2023), and One of the 120+ Women Spearheading Advances in Visual Tech and AI Recognized by LDV Capital (2024). She served in the organization committees of many workshops in renowned conferences (such as CVPR, ECCV, ICCV, ICIP, ICCASP, BioCAS, CHASE, ICHI) in various roles including Lead/Co-Lead Organizer, Program Chair, Board Member, Publicity Co-Chair, Session Chair, Technical Committee, and Mentor.
Amazon
Amazon
Amazon
TBD
If you have any questions or inquiries, please contact us at wacv2026-image-quality-workshop@amazon.com.