3.1. Poisoning Attacks#

In this notebook, we will experiment with adversarial poisoning attacks. Poisoning attacks are performed at training time by injecting carefully-crafted samples that alter the classifier decision function so that its behavior at testing time is modified. In particular, we will analyze backdoor and label-flip attacks.

%%capture --no-stderr
try:
    import secmlt
except ImportError:
    %pip install secml-torch

3.1.1. Backdoor Poisoning Attacks#

In backdoor poisoning, the attacker acts on a subset of the training data by:

  • adding a specific pattern (i.e., the trigger);

  • modifying the label to a target class, satisfying the attacker’s goal.

In this way, the model learns a correlation between the trigger and the chosen label. As a consequence, at test time, the input samples containing the trigger will be misclassified as samples of the target class. On the other hand, the clean samples will be correctly classified.

We will implement a simple patch-based attack against a small neural network trained on MNIST. We first code the CNN used in the paper “BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain” (Gu et al., 2019).

import torch
#you can also try device = "cuda"
device = "cpu"

# Load a model
net = torch.hub.load("maurapintor/mnist_examples", "mnist_model")
net.to(device);

We then load the MNIST training dataset from the TorchVision datasets hub.

%%capture
import torchvision
from torch.utils.data import Subset

dataset_path = "data/datasets/"
N_TRAIN = 10000

training_dataset = torchvision.datasets.MNIST(
    transform=torchvision.transforms.ToTensor(),
    train=True,
    root=dataset_path,
    download=True,
)
training_dataset = Subset(training_dataset, range(N_TRAIN))

We now need a function to inject the backdoor. In particular, we will apply a white 4x4 patch in the bottom-right corner of the images.

def apply_patch(x: torch.Tensor) -> torch.Tensor:
    x[:, 0, 24:28, 24:28] = 1.0
    return x

We can now create our poisoned dataset loader.

The BackdookDatasetPytorch class of SecML-Torch is a custom PyTorch dataset that applies the backdoor through a custom function on a certain amount of the training data, also switching their labels to the desired target label. We set the portion of the manipulated training samples to 10%.

The backdoored dataset can thus be wrapped into a PyTorch DataLoader, and used for training the neural network.

from secmlt.adv.poisoning.backdoor import BackdoorDatasetPyTorch
from torch.utils.data import DataLoader


target_label = 1
backdoored_mnist = BackdoorDatasetPyTorch(
    training_dataset,
    data_manipulation_func=apply_patch,
    trigger_label=target_label,
    portion=0.2,
)
training_data_loader = DataLoader(backdoored_mnist, batch_size=64, shuffle=False)

Let’s visualize some of the training images!

import matplotlib.pyplot as plt
from torchvision.utils import make_grid


x_target = []
for x, y in training_data_loader:
    x_target.append(x[y == target_label])
x_target = torch.cat(x_target, dim=0)

grid = make_grid(x_target[:100], nrow=10, normalize=True, pad_value=1.0)
plt.figure(figsize=(10, 10))
plt.imshow(grid.permute(1, 2, 0).cpu())
plt.axis("off")
plt.show()
../_images/12f7ed0470332584199a8150b8c960f072c763ff899ac002d34a85e9ade9f832.png

To train the model, we leverage the SecML-Torch utils, so that we have to instantiate:

  • a PyTorch optimizer, passing the network parameters;

  • a BasePyTorchTrainer, passing the optimizer and the number of epochs;

  • a BasePytorchClassifier, wrapping the network and the trainer.

Finally, we only need to call the BasePytorchClassifier.train method, providing the training data loader.

from torch.optim import Adam
from secmlt.models.pytorch.base_pytorch_nn import BasePyTorchClassifier
from secmlt.models.pytorch.base_pytorch_trainer import BasePyTorchTrainer


optimizer = Adam(lr=1e-3, params=net.parameters())
trainer = BasePyTorchTrainer(optimizer, epochs=1)
model = BasePyTorchClassifier(net, trainer=trainer)
model.train(training_data_loader);

We can now load the MNIST test data and wrap it into a PyTorch DataLoader.

%%capture
N_TEST = 1000

test_dataset = torchvision.datasets.MNIST(
    transform=torchvision.transforms.ToTensor(),
    train=False,
    root=dataset_path,
    download=True,
)
test_dataset = Subset(test_dataset, range(N_TEST))
test_data_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

First, we test the accuracy on the clean test data.

from secmlt.metrics.classification import Accuracy


accuracy = Accuracy()(model, test_data_loader)
print("test accuracy: ", accuracy)
test accuracy:  tensor(0.6330)

We then wrap the test dataset into the BackdoorDatasetPyTorch, applying the trigger to all the samples, and build a new PyTorch DataLoader to test the model on the backdoored dataset again.

from secmlt.metrics.classification import AttackSuccessRate


backdoored_test_set = BackdoorDatasetPyTorch(
    test_dataset, data_manipulation_func=apply_patch
)
backdoored_loader = DataLoader(backdoored_test_set, batch_size=64, shuffle=False)

We compute the attack success rate as the fraction of backdoored samples labeled with the target class.

asr = AttackSuccessRate(y_target=target_label)(model, backdoored_loader)
print(f"asr: {asr}")
asr: 0.4300000071525574

3.1.2. Label-flip Poisoning Attacks#

In label-flip poisoning, the attacker can only modify the labels of a training data subset to induce misclassification at test time. This can be done either randomly or based on heuristics.

We will implement the first solution, with the goal of making 0 digits classified as 1 and vice versa.

We first need a label manipulation function.

def flip_label(label):
    if label == 0:
        return 1
    elif label == 1:
        return 0
    return label

We can now create our poisoned dataset loader.

The PoisoningDatasetPyTorch class of SecML-Torch is a custom PyTorch dataset that can apply a custom manipulation function to either the inputs and/or their labels.

We want to manipulate the labels of the 50% of samples of both class 0 and 1.

The poisoned dataset can thus be wrapped into a PyTorch DataLoader, and used for training the neural network.

import random
from secmlt.adv.poisoning.base_data_poisoning import PoisoningDatasetPyTorch


targets = [training_dataset.dataset.targets[i] for i in training_dataset.indices]
class_0_idxs = [i for i, y in enumerate(targets) if y == 0]
class_1_idxs = [i for i, y in enumerate(targets) if y == 1]
poisoned_indexes = random.sample(class_0_idxs, len(class_0_idxs) // 2) + random.sample(
    class_1_idxs, len(class_1_idxs) // 2)

poisoned_mnist = PoisoningDatasetPyTorch(
    training_dataset,
    label_manipulation_func=flip_label,
    poisoned_indexes=poisoned_indexes,
)
poisoned_data_loader = DataLoader(poisoned_mnist, batch_size=64, shuffle=False)

As we want to separately inspect the model accuracy for each class, we write a function that creates a PyTorch DataLoader returning only the samples of a selected class.

from torch.utils.data import SubsetRandomSampler


def get_class_loader(label):
    test_targets = [test_dataset.dataset.targets[i] for i in test_dataset.indices]
    idxs = [i for i, y in enumerate(test_targets) if y == label]
    test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False, sampler=SubsetRandomSampler(idxs))
    return test_loader

We will now train the model on both clean and poisoned data, and evaluate it on the test dataset.

We print the total accuracy, the class accuracies, and the attack success rate for classes 0 and 1, i.e., the ratio of class 0 samples misclassified as 1 and vice versa.

for k, data_loader in {
    "normal": training_data_loader,
    "poisoned": poisoned_data_loader,
}.items():
    net = torch.hub.load("maurapintor/mnist_examples", "mnist_model")
    net.to(device)
    optimizer = Adam(lr=1e-3, params=net.parameters())
    trainer = BasePyTorchTrainer(optimizer, epochs=1)
    model = BasePyTorchClassifier(net, trainer=trainer)
    model.train(data_loader)
    accuracy = Accuracy()(model, test_data_loader)
    print(f"test accuracy on {k} data: {accuracy.item():.3f}")
    for i in range(10):
        class_test_loader = get_class_loader(i)
        accuracy = Accuracy()(model, class_test_loader)
        print(f"    test accuracy (class {i} only): {accuracy.item():.3f}")
        if i == 0 or i == 1:
            asr = AttackSuccessRate(y_target=0 if i==1 else 1)(model, class_test_loader)
            print(f"        asr on class {i}: {asr}")
test accuracy on normal data: 0.724
    test accuracy (class 0 only): 0.871
        asr on class 0: 0.0941176488995552
    test accuracy (class 1 only): 1.000
        asr on class 1: 0.0
    test accuracy (class 2 only): 0.767
    test accuracy (class 3 only): 0.374
    test accuracy (class 4 only): 0.600
    test accuracy (class 5 only): 0.736
    test accuracy (class 6 only): 0.897
    test accuracy (class 7 only): 0.596
    test accuracy (class 8 only): 0.764
    test accuracy (class 9 only): 0.660
test accuracy on poisoned data: 0.816
    test accuracy (class 0 only): 0.035
        asr on class 0: 0.8941176533699036
    test accuracy (class 1 only): 0.873
        asr on class 1: 0.0793650820851326
    test accuracy (class 2 only): 0.879
    test accuracy (class 3 only): 0.860
    test accuracy (class 4 only): 0.727
    test accuracy (class 5 only): 0.954
    test accuracy (class 6 only): 0.885
    test accuracy (class 7 only): 0.939
    test accuracy (class 8 only): 0.876
    test accuracy (class 9 only): 0.957