FAIR-TAT: Improving Model Fairness Using Targeted Adversarial Training

Published in WACV 2025, 2025

Deep neural networks are susceptible to adversarial at- tacks and common corruptions, which undermine their ro- bustness. In order to enhance model resilience against such challenges, Adversarial Training (AT) has emerged as a prominent solution. Nevertheless, adversarial robustness is often attained at the expense of model fairness during AT, i.e., disparity in class-wise robustness of the model. While distinctive classes become more robust towards such adver- saries, hard to detect classes suffer. Recently, research has focused on improving model fairness specifically for per- turbed images, overlooking the accuracy of the most likely non-perturbed data. Additionally, despite their robustness against the adversaries encountered during model training, state-of-the-art adversarial trained models have difficulty maintaining robustness and fairness when confronted with diverse adversarial threats or common corruptions. In this work, we address the above concerns by introducing a novel approach called Fair Targeted Adversarial Training (FAIR- TAT). We show that using targeted adversarial attacks for adversarial training (instead of untargeted attacks) can al- low for more favorable trade-offs with respect to adversarial fairness. Empirical results validate the efficacy of our ap- proach.

Download paper here