Towards backdoor attacks and defense in robust machine learning models

Category Machine Learning

tldr #

Asst. Prof. Sudipta Chattopadhyay and fellow SUTD researchers, in their study "Towards backdoor attacks and defense in robust machine learning models", published in Computers & Security, studied how to inject and defend against backdoor attacks for robust models in a certain ML component called image classifiers that are trained using the state-of-the-art projected gradient descent (PGD) method. They found that robust models are highly susceptible to backdoor attacks (67.8% success rate) and developed AEGIS, the very first backdoor detection technique for PGD-trained robust models, with an average detection rate of 93.7%.


content #

Software systems are all around us—from the operating systems of our computers to search engines to automation used in industrial applications. At the center of all of this is data, which is used in machine learning (ML) components that are available in a wide variety of applications, including self-driving cars and large language models (LLM). Because many systems rely on ML components, it is important to guarantee their security and reliability. For ML models trained using robust optimization methods (robust ML models), their effectiveness against various attacks is unknown. An example of a major attack vector is backdoor poisoning, which refers to compromised training data fed into the model. Technologies that detect backdoor attacks in standard ML models exist, but robust models require different detection methods for backdoor attacks because they behave differently than standard models and hold different assumptions.

The SGD-trained models are 67.8% susceptible to backdoor attacks

This is the gap that Dr. Sudipta Chattopadhyay, Assistant Professor at the Information Systems Technology and Design (ISTD) Pillar of the Singapore University of Technology and Design (SUTD), aimed to close.

In the study "Towards backdoor attacks and defense in robust machine learning models", published in Computers & Security, Asst. Prof. Chattopadhyay and fellow SUTD researchers studied how to inject and defend against backdoor attacks for robust models in a certain ML component called image classifiers. Specifically, the models studied were trained using the state-of-the-art projected gradient descent (PGD) method.

AEGIS was developed by Asst. Prof. Chattopadhyay and fellow SUTD researchers as the first backdoor detection technique for PGD-trained robust models

The backdoor issue is urgent and dangerous, especially because of how current software pipelines are developed. Chattopadhyay stated, "No one develops a ML model pipeline and data collection from scratch nowadays. They might download training data from the internet or even use a pre-trained model. If the pre-trained model or dataset is poisoned, the resulting software, using these models, will be insecure. Often, only 1% of data poisoning is needed to create a backdoor".

AEGIS is able to detect backdoor attacks with an average detection rate of 93.7%

The difficulty with backdoor attacks is that only the attacker knows the pattern of poisoning. The user cannot go through this poison pattern to recognize whether their ML model has been infected.

"The difficulty of the problem fascinated us. We speculated that the internals of a backdoor model might be different than a clean model," said Chattopadhyay.

To this end, Chattopadhyay investigated backdoor attacks for robust models and found that they are highly susceptible (67.8% success rate). He also found that poisoning a training set creates mixed input distributions for the poisoned class, enabling the robust model to learn multiple feature representations for a certain prediction class. In contrast, clean models will only learn a single feature representation for a certain prediction class.

No more than 1% data poisoning is needed to create a backdoor

Along with fellow researchers, Chattopadhyay used this fact to his advantage to develop AEGIS, the very first backdoor detection technique for PGD-trained robust models. Using t-Distributed Stochastic Neighbor Embedding (t-SNE) and Mean Shift Clustering as a dimensionality reduction technique and clustering method, respictively, AEGIS is able to detect backdoor attacks with an average detection rate of 93.7%.

T-Distributed Stochastic Neighbor Embedding (t-SNE) and Mean Shift Clustering are used as dimensionality reduction technique and clustering method, respectively, by AEGIS

hashtags #
worddensity #

Share