How To Train AI Classification Models

7.320 MHz

31 Dec 2023 • 6 min read

A Text Classification model is an AI-based model trained to categorize or classify text into predefined categories or classes. It learns from a labeled dataset where each text sample is associated with a specific category. The model analyzes a string's patterns, features, and relationships to classify new or unseen text samples. Various applications use Text Classification models, such as sentiment analysis, spam detection, topic classification, and intent recognition. Text Classification models are built using various machine learning algorithms, such as Naive Bayes, Support Vector Machines (SVM), or deep learning algorithms like Recurrent Neural Networks (RNN) or Convolutional Neural Networks (CNN). In this article, we will train a Text Classification model to identify toxic text samples.

Create a Python Virtual Environment

A Python virtual environment is an isolated environment where you can install Python packages without affecting your global Python setup. It's a way to manage dependencies on a per-project basis. This is how we will set up our project:

Create a directory to store our files. On a Debian-based system we will run:

mkdir AI\ Classifications && cd AI\ Classifications

Create a virtual environment:

python3 -m venv .env

Activate your virtual environment: (If you close your terminal, you'll have to re-run this command)

source .env/bin/activate

Install the transformers library

Note that the following is only if you have an NVIDIA GPU. PyTorch and HF Transformers use CUDA to interface to CUDA Cores on your GPU. A Cuda core is much more lightweight than your CPU cores. GPUs can do less advanced arrhythmic, but calculations can be done much faster as you have more cores.

#HF Transformers
pip install git+https://github.com/huggingface/transformers
#PyTorch
pip install torch torchvision torchaudio

### Other libraries for machine learning and data parsing:
pip install pandas datasets scikit-learn accelerate

If you do not have an NVIDIA GPU, you can use your CPU. However, I would not recommend this, as a simple NVIDIA GPU, such as the NVIDIA 4070 TI, has 7,680 CUDA cores, while a flagship CPU, such as the Ryzen 9 7950, has a maximum of 16 cores.

#HF Transformers
pip install git+https://github.com/huggingface/transformers
#PyTorch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

### Other libraries for machine learning and data parsing:
pip install pandas datasets scikit-learn accelerate

Now, we can initialize the libraries in the train.py file:

import pandas as pd
from sklearn.model_selection import train_test_split
from datasets import Dataset
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, AutoTokenizer

Download a dataset

As an example, we will be using the surge-ai/toxicity dataset from Surge AI's GitHub.

wget https://github.com/surge-ai/toxicity/raw/main/toxicity_en.csv

This is a somewhat small dataset where we have two columns text text and is_toxic. The text column contains a random string, such as me thinking in my head: mmm pizzaaaa… and the is_toxic column contains a classification of Not Toxic and Toxic.

Start with a pre-trained model

For this task, we will use the distilbert-base-uncased model as a starting point instead of starting from scratch. distilbert-base-uncased is a pre-trained model that has been trained on a large amount of text data and has learned to encode the meaning and context of words. Using this pre-trained model, we can leverage its existing knowledge and fine-tune it for our specific text classification task, saving us time and resources.

#Define the distilbert-base-uncased tokenizer to tokenize our dataset
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def tokenize(batch):
    return tokenizer(batch['text'], padding='max_length', truncation=True)

#Map expected ids to labels
id = {0: "Not Toxic", 1: "Toxic"}
label = {"Not Toxic": 0, "Toxic": 1}

#Define pre-trained model
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, id2label=id, label2id=label
)

In this code, we used the AutoTokenizer from the HF transformers library to initialize a tokenizer from the pre-trained DistilBERT model. After initializing this tokenizer, we wrote a function that takes a batch of text data and tokenizes it using the previously defined tokenizer. Next, we defined two dictionaries id and label which will be later used to map the labels Not Toxic and Toxic to the integers 0 and 1. Finally, we loaded in our pre-trained model with AutoModelForSequenceClassification from the HF transformers library.

Import the CSV as a HuggingFace Dataset

To train our model, we must give it our data in a format it can understand. In this case, we will be using the datasets pandas and the sklearn libraries to format our data.

# Load the CSV file
df = pd.read_csv('toxicity_en.csv')

#Map to numerical labels (0 and 1)
df['is_toxic'] = df['is_toxic'].map(label)

# Split the dataset into training and validation sets
train_df, valid_df = train_test_split(df, test_size=0.2, random_state=42)

# Convert the data into HF Dataset format
train_dataset = Dataset.from_pandas(train_df.reset_index(drop=True))
valid_dataset = Dataset.from_pandas(valid_df.reset_index(drop=True))

# Apply the function to the training and validation datasets
train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
valid_dataset = valid_dataset.map(tokenize, batched=True, batch_size=len(valid_dataset))

# Set the Tensorflow cloumn names to the Hugging Face dataset column names
train_dataset = train_dataset.rename_column("is_toxic", "labels")
valid_dataset = valid_dataset.rename_column("is_toxic", "labels")

In this code, we loaded the toxicity_en.csv into a pandas DataFrame, we mapped the 'is_toxic' column in the DataFrame to numerical labels. In order to train our model, we need to split our dataset into a training and a validation set. In this case, we're using 20% of the data for validation. Next, we use our already defined tokenize function to convert the text into DistilBERT's tokens. Finally, we renamed the column from is_toxic to labels.

Define the training arguments

#Define the training arguments
training_args = TrainingArguments(
    output_dir='./my_classification_model',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    remove_unused_columns=False,
    save_strategy='epoch',
)

This is where the code gets somewhat complex but stick with me here. In this code, we defined the training_args variable, which is an object that stores various training arguments for the AI model training process. Let's break down each parameter in simple terms:

output_dir: Specifies the directory where the training results will be stored, such as the trained model and evaluation metrics.
num_train_epochs: Determines the number of times the model will iterate over the entire training dataset during training. Each iteration is called an epoch.
per_device_train_batch_size: Specifies the number of training examples (texts) that will be processed together in each iteration. In this case, 16 texts will be processed simultaneously.
per_device_eval_batch_size: This is similar to the previous parameter, but it determines the batch size for evaluation, where the model's performance is assessed on a separate validation dataset.
warmup_steps: Refers to the number of initial steps in the training process where the learning rate gradually increases. This helps the model to converge more smoothly.
weight_decay: Controls the strength of weight decay, which is a regularization technique that prevents the model from overfitting the training data.
logging_dir: Specifies the directory where the training logs will be saved, including information about the model's performance during training.
remove_unused_columns: Controls whether or not to remove any unused columns from the training and evaluation datasets.
save_strategy: The strategy for saving the trained model during training.

Define and initialize the trainer

#Define the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset
)

In this code, we defined the trainer variable, which is an object that takes the following parameters:

model: Model we imported earlier that will be trained.
args: The training_args object we defined earlier.
train_dataset: Contains the labeled text samples to be used for training the model.
eval_dataset: Contains separate labeled text samples to evaluate the model's performance during training.

Run the trainer

Add trainer.train() and trainer.save_model('./my_classification_model') to your code and then run your code with python3 train.py. Depending on your GPU/CPU performance, this could take a bit.

Test your model

Create a new file called test.py in this python file we will use the AutoModelForSequenceClassification and AutoTokenizer to load our existing model.

from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load the trained model
model = AutoModelForSequenceClassification.from_pretrained('./my_classification_model')

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Example text
text = input("String for classification: ")
text = str(text)

# Tokenize the text and convert to input format expected by the model
inputs = tokenizer(text, return_tensors='pt')

# Get model's prediction
outputs = model(**inputs)

# The model returns a tuple. Get the logits from the tuple.
logits = outputs.logits

# Convert logits to probabilities
probabilities = logits.softmax(dim=-1)

# Get the predicted label
predicted_label = probabilities.argmax(dim=-1).item()

# Map the predicted label to its string representation
id2label = {0: "Not Toxic", 1: "Toxic"}
predicted_label_str = id2label[predicted_label]

print(f"The text is predicted as: {predicted_label_str}")

In this code, we loaded our existing model, tokenized the text, predicted toxicity using our model and mapped the predicted label to its string representation.

This should return the following data:

(.env) ~/Desktop/AI Classifications$ python3 test.py 
String for classification: You're awesome
The text is predicted as: Not Toxic

Application

Text classification models like the one we built with DistilBERT can be used to protect databases from toxic comments. Penetration testers can integrate this model with a web scraper to evaluate employees' opinions about an organization. Additionally, this model can assist in classifying the security of a website, making it easier for penetration testers to identify vulnerable sites.

If you liked this article, you won't want to miss my guide on self-hosting LLMs with Ollama. Ollama is a user-friendly way to download and run open-source AI models locally. Just click here to read it now, and I’ll see you there shortly. Cheers!