5th Workshop on Image/Video/Audio Quality Assessment in Computer Vision, VLM and Diffusion Model

Home

Important Dates/Links:

Submission Deadline: 12/15/2025, 23:59 GMT
Submission Link: OpenReviewLink
Acceptance Notification Deadline: 1/1/2026
Camera Ready Papers Submission Deadline: 1/10/26 07:59 AM UTC
Workshop Day: 3/7/2026
Workshop Website: https://wacv2026-image-quality-workshop.github.io/
Workshop Recording: Watch the recording
Authors of accepted papers are required to present their work live, either in-person or remote. If pre-recorded videos are used, the authors are required to join the meeting online to do Q&A live.
Accepted papers will be published in the WACV 2026 proceedings.
Depending on the number of submissions we receive, author(s) of submitted papers may be required to review some papers.

Description:

Image, video, and audio quality significantly impacts machine learning and computer vision systems, yet remains underexplored by the broader research community. Real-world applications—from streaming services and autonomous vehicles to cashier-less stores and generative AI—critically depend on robust quality assessment and improvement techniques. Despite their importance, most visual learning systems assume high-quality inputs, while in reality, artifacts from capture, compression, transmission, and rendering processes can severely degrade performance and user experience.

This workshop is particularly timely given the explosive growth of generative AI, which introduces new challenges in quality assessment for both inputs and outputs. By bringing together researchers from industry and academia, we aim to systematically investigate how quality issues affect various visual learning tasks and develop innovative assessment and mitigation techniques. Building on the success of our previous workshops at WACV(2022-2025), we expect to stimulate new research directions and attract more talent to this critical field, ultimately improving the robustness and reliability of computer vision applications across industries.

Topics:

This workshop addresses topics related to image/video/audio quality assessment in machine learning, computer vision, VLM, Diffusion Model, and other types of generative AIs. The topics include, but are not limited to:

Impact of image/video/audio quality in traditional machine learning and computer vision use cases such as object detection, segmentation, tracking, and recognition;
Analyze, model and learn the quality impact from image/video/audio acquisition, compression, transcoding, transmission, decoding, rendering, and/or display;
Techniques used to improve image/video/audio quality in terms of
- brightening, color adjustment, sharpening, inpainting, deblurring, denoising, de-hazing, de-raining, demosaicing,
- removing artifacts such as shadows, glare, and reflections, etc.,
- resolution, frame rate, color gamut, dynamic range (SDR vs. HDR), etc.,
- noise/echo cancellation, speech enhancement, etc.;
Novel image/video/audio quality assessment methodologies: full reference, reduced-reference, and non-reference;
Impact of image/video/audio quality in multi-modal use cases;
Evaluate image/video/audio quality produced by generative AI;
Evaluate the hallucination effects in image and video super-resolution and restoration using diffusion models;
Techniques to measure the quality consistency across different types of content in video (such as ads, movies, streamed content, etc.);
Datasets, statistics, and theory of image/video/audio quality;
Research, applications and system development of the above.

Schedule (MST)

Mar 7th, 2026, 8:20 AM – 6:00 PM

Time	Event	Duration
8:20-8:30am	Opening Remarks (Host: Joe Liu)	10 mins
8:30-9:30am	Keynote Keynote Speaker: Gérard G. Medioni (Host: Joe Liu)	60 mins
9:30-10:15am	Coffee Break	45 mins
10:15-11:45am	Oral Long Session I (Host: Yarong Feng)	90 mins
10:15-10:30am	Fast 2DGS: Efficient Image Representation with Deep Gaussian Prior (in person)	15 mins
10:30-10:45am	REMinD: Balancing Robust Concept Unlearning and Image Quality in Diffusion Models (in person)	15 mins
10:45-11:00am	Reason Then Ground: Multilingual Text/Logo Grounding on Movie Posters (in person)	15 mins
11:00-11:15am	VideoForge: Efficient Domain Adaptation for Video Generation Through Quality-Driven Rewards and Enhanced LoRA (in person)	15 mins
11:15-11:30am	HandSurge: Localized Neural Surgery for Diffusion-Generated Hand Deformity Restoration (in person)	15 mins
11:30-11:45am	HiFi-Deblur: High-Frequency Intense Image Deblurring with Frequency-Decoupled U-Net and Discrete Wavelet Transform (in person)	15 mins
11:45am-1:00pm	Lunch Break	75 mins
1:00-2:00pm	Oral Long Session II (Host: Qipin Chen)	60 mins
1:00-1:15pm	Motion Blur Detection and Segmentation from Static Image Artworks (in person)	15 mins
1:15-1:30pm	Transforming Video Subjective Testing with Training, Engagement, and Real-Time Feedback (in person)	15 mins
1:30-1:45pm	Can You Find the Difference? Visually Identical Image Detection (in person)	15 mins
1:45-2:00pm	Efficient Deep Demosaicing with Spatially Downsampled Isotropic Networks (in person)	15 mins
2:00-3:00pm	Keynote Keynote Speaker: Sarah Ostadabbas, "Toward Data-Efficient Dynamically-Aware Visual Intelligence" (Host: Joe Liu)	60 mins
3:00-3:45pm	Coffee Break	45 mins
3:45-4:45pm	Oral Short Session I (Host: Qipin Chen)	60 mins
3:45-3:52pm	Diffuse4D: Completing NeRF-Stereo Depth via Diffusion-Driven Restoration in Dynamic Scenes (in person)	7 mins
3:52-3:59pm	Seeing in the Dark: Synthesizing Underexposure for More Robust Underwater Image Augmentation (in person)	7 mins
3:59-4:06pm	ViTNT-FIQA: Training-Free Face Image Quality Assessment with Vision Transformers (in person)	7 mins
4:06-4:13pm	Cost Savings from Automatic Quality Assessment of Generated Images (in person)	7 mins
4:13-4:20pm	VIBEFACE - Video and Image Biometric Dataset for Evaluation of Faces (in person)	7 mins
4:20-4:27pm	We Still See Broken Limbs: Towards Anatomical Realism in GenAI via Human Preference Learning (in person)	7 mins
4:27-4:34pm	JetBench: Quality-Aware Benchmarking of Vision Models for Jet Parameter Classification in Heavy-Ion Physics (in person)	7 mins
4:34-4:41pm	STEC: A Spatio-Temporal Entropy Coverage Metric for Evaluating Sampled Video Frames (in person)	7 mins
4:41-4:48pm	SPoRC-VIST: A Benchmark for Evaluating Generative Natural Narrative in Vision-Language Models (in person)	7 mins
4:45-5:00pm	Closing Remarks (Host: Joe Liu)	15 mins
5:00-6:00pm	Poster Session + Online Oral Presentations	60 mins
5:00-5:07pm	From Filters to VLMs: Benchmarking Defogging Methods through Object Detection and Segmentation Performance (online)	7 mins
5:07-5:14pm	Device-Robust Spectral Grading and Origin Detection from UV-Vis-NIR Images: Towards Practical Gemstone Quality Assessment (online)	7 mins
5:14-5:21pm	Vision Language Models Learn to Assess Images with Specialists (online)	7 mins
5:21-5:28pm	When Probe and Gallery are Low Quality: Decreasing Accuracy and Increasing Demographic Disparities in 1:N Identification (online)	7 mins
5:28-5:35pm	CARLA-Haze: A Synthetic Benchmark for Outdoor Image Dehazing (online)	7 mins
5:35-5:42pm	Quality-Driven and Diversity-Aware Sample Expansion for Robust Marine Obstacle Segmentation (online)	7 mins
5:42-5:49pm	Enhancement as Augmentation: Improving Detection in Highly Degraded Underwater Images Through Mixed-Domain Training (online)	7 mins
5:49-5:56pm	Image-Specific Adaptation of Transformer Encoders for Compute-Efficient Segmentation (online)	7 mins
5:56-6:03pm	YOLO-OSA: A ShuffleAttention-Enhanced YOLO Model for FOD Detection with Comprehensive Benchmarking on MS COCO (online)	7 mins

Zoom Information for virtual presentations: TBD

Keynotes

Keynote Speaker: Gérard G. Medioni, VP & Distinguished Scientist, Amazon.com

Title: "Prime Video: Delivering High Quality Streaming"

Abstract: Gerard Medioni will present an exploration of the cutting-edge technology driving the Prime Video experience. The session opens with three flagship innovations: AI-aided dubbing, video season recaps, and a newly launched NBA feature — each showcasing how Prime Video is pushing the boundaries of content delivery and personalization. Gerard will then take a deeper dive into three areas of active technical development: the challenges and solutions behind delivering streaming video in vertical format for mobile audiences; Prime Video's approach to detecting and classifying audio quality defects; and the unique image quality challenges inherent to livestreaming — and how the team is tackling them.

Bio: Gérard G. Medioni is a computer scientist, author, academic and inventor. He is a vice president and distinguished scientist at Amazon and serves as emeritus professor of Computer Science at the University of Southern California. Medioni has made contributions to computer vision, in particular 3D sensing, surface reconstruction, and object modelling. He has translated his computer vision research into customer-facing inventions and products. He has authored four books, including Emerging Topics in Computer Vision, Multimedia Systems: Algorithms, Standards, and Industry Practices, and A Computational Framework for Segmentation and Grouping, and has published more than 80 journal papers, 200 conference papers, with over 34,000 citations and his h-index is 88. In addition, he holds 123 patents to his name which include Visual tracking in video images in unconstrained environments by exploiting on-the-fly context using supporters and distracters and Depth mapping based on pattern matching and stereoscopic information, along with patents on Just Walk Out technology and Amazon One. Medioni is a Fellow of the Association for the Advancement of Artificial Intelligence, the Institute of Electrical and Electronics Engineers, the International Association for Pattern Recognition, and the National Academy of Inventors. He is also a member of National Academy of Engineering.

Keynote Speaker: Sarah Ostadabbas, Associate Professor, College of Engineering, Northeastern University

Title: "Toward Data-Efficient Dynamically-Aware Visual Intelligence"

Abstract: Despite rapid advances in multimodal foundation models, today's video AI systems still struggle to reason about motion, causality, and physical change, especially in real-world, small-data environments. This talk argues that scaling data and parameters alone yields models that reproduce appearance, but fail to anticipate how the world evolves. Instead, I advocate for a shift toward motion-grounded visual intelligence, where dynamics (not static frames or language priors) form the core representation. I will present recent work from our lab demonstrating how motion provides a low-dimensional bridge between pixels and physics, enabling systems that discover, describe, and generate video through dynamics-aware reasoning. Using examples from our motion-aware zero-prompt video understanding and our physics-grounded generative framework, I show how treating video as a learnable world model, rather than a sequence of images, supports more robust generalization, interpretable reasoning, and physically consistent generation. The talk concludes with a broader vision for Physical AI: data-efficient systems that learn from motion, reason over future states, and operate safely in unconstrained environments such as healthcare, robotics, and human-centered applications.

Bio: Professor Ostadabbas is an associate professor in the Electrical and Computer Engineering Department at Northeastern University (NU) in Boston, Massachusetts, USA. She joined NU in 2016 after completing her post-doctoral research at Georgia Tech, following the achievement of her PhD at the University of Texas at Dallas in 2014. At NU, Professor Ostadabbas holds the roles of Director at the Augmented Cognition Laboratory (ACLab), Director of Women in Engineering (WIE), and Co-Director at The Center for Signal Processing, Imaging, Reasoning, and Learning (SPIRAL). Her research focuses on the convergence of computer vision and machine learning, particularly emphasizing representation learning in visual perception problems. In her applied research, she has significantly contributed to the understanding, detection, and prediction of human and animal behaviors through the modeling of visual motion, considering various biomechanical factors. Professor Ostadabbas also extends her work to the Small Data Domain, including applications in medical and military fields, where data collection and labeling are costly and protected by strict privacy laws. Her solutions involve deep learning frameworks that operate effectively with limited labeled training data, incorporate domain knowledge for prior learning and synthetic data augmentation, and enhance the generalization of learning across domains by acquiring invariant representations. Professor Ostadabbas has co-authored over 130 peer-reviewed journal and conference articles and received research awards from prestigious institutions such as the National Science Foundation (NSF), Department of Defense (DoD), Sony, Mathworks, Amazon AWS, Verizon, Oracle, Biogen, and NVIDIA. She has been honored with the NSF CAREER Award (2022), Sony Faculty Innovation Award (2023), was the runner-up for the Oracle Excellence Award (2023), and One of the 120+ Women Spearheading Advances in Visual Tech and AI Recognized by LDV Capital (2024). She served in the organization committees of many workshops in renowned conferences (such as CVPR, ECCV, ICCV, ICIP, ICCASP, BioCAS, CHASE, ICHI) in various roles including Lead/Co-Lead Organizer, Program Chair, Board Member, Publicity Co-Chair, Session Chair, Technical Committee, and Mentor.

Submission Guidelines and Review Process:

Authors are encouraged to submit high-quality, original (i.e., not been previously published or accepted for publication in substantially similar form in any peer-reviewed venue including journal, conference or workshop) research.
All submissions should follow the same template as for the main WACV2026 conference. Please refer to the WACV Author Kit.
The main paper has an 8-page limit, references do not count toward this. There is no limit on the number of pages in the supplementary material. Only pdf files are accepted.
Unlike the main conference(WACV2026), the review process for this workshop has only one round, and is single-blind. Authors do not have to be anonymized when submitting their work.
We will be using OpenReview to manage submissions. Submissions under review will be visible only to their assigned members of the program committee (area chairs and reviewers). The reviews and author responses will never be made public, and we will not be soliciting comments from the general public during the reviewing process. Anyone who plans to submit a paper as an author or a co-author will need to create (or update) their OpenReview profile by the full paper submission deadline. Please note that if you use a non-institutional email address such as gmail.com, it can take up to two weeks to approve your OpenReview profile. Profiles tied to institutional email addresses are generally approved automatically. Plan accordingly. By submitting a paper to the workshop, the authors agree to the review process and understand that papers are processed by the OpenReview system to match each manuscript to the best possible area chairs and reviewers.
Please submit your paper via this link: OpenReviewLink
Authors of accepted papers will be notified via email by: 1/1/2026

5th Workshop on Image/Video/Audio Quality Assessment in Computer Vision, VLM and Diffusion Model

Home

Important Dates/Links:

Description:

Topics:

Schedule (MST)

Mar 7th, 2026, 8:20 AM – 6:00 PM

Keynotes

Keynote Speaker: Gérard G. Medioni, VP & Distinguished Scientist, Amazon.com

Title: "Prime Video: Delivering High Quality Streaming"

Keynote Speaker: Sarah Ostadabbas, Associate Professor, College of Engineering, Northeastern University

Title: "Toward Data-Efficient Dynamically-Aware Visual Intelligence"

Submission Guidelines and Review Process:

Organizers

Yarong Feng

Joe Liu

Qipin Chen

Information

Contact Us