Author: Marco Jeffrey Pansa

Date: 2024-10-04

Tags: huggingface; transformers; nlp; python; ml

Resolving skill issues - The HUGGINGFACE SERIES P1

I have been meaning to start this HF series for a long time already. I finally managed to get started thanks to COVID and Merve's spicy take a couple of days ago, which I captured below. I have seen this happening time and time again; it is not a new phenomenon. These days, people tend to just throw LLMs at every problem they encounter without actually thinking critically if there might be a better or cheaper option. Something similar happened when deep learning got hot in 2017++ and people forgot there were still things like tree-based methods or simple linear regressions. What is convenient about LLMs is obviously that they can be used so versatilely for many use cases, and you definitely get an initial speed boost due to only needing to call an API, with no training of local models, no nasty deployments, MLOps, etc.

The problem is you give up so much control over the whole pipeline. You miss out on:

  • Having a fast & cheap locally trained model
  • Protecting your precious data
  • Having complete control / reproducibility (OAI APIs change over time)
  • Learning from your data in the fine-tuning process (usually the DS working on the task gets really valuable insights)
  • Using the trained model for downstream tasks like clustering, etc.

image

The goal of this series is to get you comfortable digging deep into the the internals of training a state of the art LM so that you can go out and solve real world business problems. Do not be a wrapper using soy dev and leverage the full power of open-source :)

Let's get started with the "Hello World" of modern NLP: training a sentiment classifier.

In this series, we will begin by using only the Transformers library. As we progress, we'll introduce more low-level details. At some point, you'll want more control over your training pipeline, and knowing how to manipulate every step of the pipeline is crucial if you want to tackle more ambitious projects and get the most out of your models.

We will explore:

  1. How to train models from the Transformers hub using just plain PyTorch
  2. How to strike a good balance of abstraction using the PyTorch Lightning library
  3. Adding monitoring using MLflow
  4. Deploying a simple model using Docker and FastAPI

Stay tuned for further blog posts in this series!

Now, let's get started with our simple Hugging Face tutorial.

import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification

ds = load_dataset("stanfordnlp/imdb")

We will use the very popular IMDB movie reviews dataset from Stanford NLP for fine tuning. This dataset contains a single feature - the movie review in plain text. Each review is labeled either with a "0" idicating a negative sentiment, or a "1" indicating a positive sentiment

ds
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})
# lets make the datasets a bit smaller (I am training on a mac)
train_ds = ds["train"].shuffle().select(range(2_000))
test_ds = ds["test"].shuffle().select(range(500))
train_ds[0]["text"][:500]
'Look, I loved the PROPER Anchorman film, but this was reaaaaallly bad. The kind of bad that makes you wish you could get that time back in your life, the kind of bad that makes you think "what on Earth were they thinking to film this in the first place", the kind of bad that makes you wish you\'d taken 50 more minutes when stepping into the kitchen to grab a snack during the film, the kind of bad that makes leprosy look fun, the kind of bad that makes you think you wish you rented a Pauly Shore f'
f'LABEL --> {train_ds[0]["label"]}'
'LABEL --> 0'

So now that we have our dataset loaded and understand what it contains, lets move forward and load both the model and the tokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-uncased", num_labels=2)

Before we can use any of our data together with the model, we need to first bring it to a representation our model can actually understand. That is the process of tokenization. Basically, we break down the actual text into smaller pieces first. Or to be more specific, we use the tokenizer vocab to turn our text into a list of integers, where the integer represents the index of our tokenizer vocab entry. The list of indexes is later used to retrieve the corresponding embedding vector. I'll go into more depth in another blog post about tokenizers. The best way I found to see what they do is just writing some code and looking at the outcomes. So let's do that:

Let's tokenize the word "football"

tokenizer("football")
{'input_ids': [101, 2374, 102], 'attention_mask': [1, 1, 1]}

we can see that we get returned a list of 3 integers as out input_ids: [101, 2374, 102]

tokenizer.decode([101]),tokenizer.decode([2374]),tokenizer.decode([102])
('[CLS]', 'football', '[SEP]')

Now we can see when using the tokenizer on the single word "football" we get back three tokens:

  1. [CLS] - this is a special token that gets placed in the beginning
  2. Token 2374 is actually a token representing the whole word "football", in this case the word does not get split into more pieces
  3. [SEP] is a special token called the separator token. It can indicate a new sentence or the end of the sequence

Let's look at a few more examples to get some intuition. In general, we can say the more common a word is, the more likely it is represented in fewer tokens.

tokenizer("transformers", add_special_tokens=False)
{'input_ids': [19081], 'attention_mask': [1]}
tokenizer("tranfsormers", add_special_tokens=False) # lets ignore special tokens for now
{'input_ids': [25283, 10343, 2953, 16862], 'attention_mask': [1, 1, 1, 1]}

As we can see the word "transformers" can also be represented as only one token, but what happens if we make a small spelling mistake by accident? We get 4 tokens. Now but what are they?

tokenizer.decode([25283]),tokenizer.decode([10343]),tokenizer.decode([2953]),tokenizer.decode([16862])
('tran', '##fs', '##or', '##mers')

Essentially, since we made a typo the tokenizer cannot find the word in the vocab anymore and it now uses a different set of word pieces to represent it. You can read more about this specific tokenization algorithm called WordPiece here.

we can use the map function of the dataset instance to apply the tokenizer to both of our datasets. We use padding and truncation to make sure all samples will end up with exactly token_length=512, otherwise we cannot perform operations on batches later. Granted this is kind of a waste of compute and for training you would do dynamic padding to the longest sequence in the batch. But this suffices for now.

train_ds_tok = train_ds.map(lambda samples: tokenizer(samples["text"], padding="max_length", truncation=True), batched=True)
test_ds_tok = test_ds.map(lambda samples: tokenizer(samples["text"], padding="max_length", truncation=True), batched=True)
Map:   0%|          | 0/2000 [00:00<?, ? examples/s]
Map:   0%|          | 0/500 [00:00<?, ? examples/s]
from transformers import TrainingArguments, Trainer
import numpy as np

Alright, so let's talk a little about the next steps / components. Transformers provides a very high-level API for making training as simple as possible. There are two important pieces to this:

  1. The training arguments The TrainingArguments instance holds all important training arguments to configure the training like output_dir (the dir where the artifacts will be saved), batch sizes for training and eval, learning_rate, weight_decay, logging etc. Take a look here to learn more about TrainingArguments.

  2. The trainer The trainer instance glues all pieces together and provides a method for starting the actual training. It sticks together the model, the tokenized datasets and the training arguments.

Optionally, we can define a compute_metrics function. The metrics will be evaluated on the eval dataset after every epoch. Now I say it's optional, and it is, but without an actual metric there is no way to tell how good the model actually is. By default, we only get train and eval losses from the trainer. You could also compute and return additional metrics like precision / recall / F1 and just add them to the return dict and they will just show up in the training metrics table below. Try it out!

def compute_accuracy(p):
    # p is an EvalPrediction object
    labels, predictions = p.label_ids, p.predictions
    acc = (np.argmax(predictions, axis=-1) == labels).mean()
    return {"acc": np.round(acc, 4)}
training_arguments = TrainingArguments(
    output_dir="imdb",
    num_train_epochs=5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    learning_rate=1e-5,
    weight_decay=1e-3,
    eval_strategy="epoch",
    logging_strategy="epoch",
    use_mps_device=True
)

trainer = Trainer(
    model=model,
    train_dataset=train_ds_tok,
    eval_dataset=test_ds_tok,
    tokenizer=tokenizer,
    args=training_arguments,
    compute_metrics=compute_accuracy
)

Notice that we are only training on 2000 examples and doing eval on 500 since this is a tutorial example. I've see people get around 92+-1% when training on all the data. We are not that far off with only these few samples.

trainer.train()
[315/315 21:18, Epoch 5/5]
Epoch Training Loss Validation Loss Acc
1 0.673300 0.610215 0.692000
2 0.412500 0.331831 0.876000
3 0.235700 0.347854 0.858000
4 0.186000 0.357295 0.860000
5 0.162900 0.331261 0.878000

TrainOutput(global_step=315, training_loss=0.3340808505103702, metrics={'train_runtime': 1280.3092, 'train_samples_per_second': 7.811, 'train_steps_per_second': 0.246, 'total_flos': 1324673986560000.0, 'train_loss': 0.3340808505103702, 'epoch': 5.0})

Now that we finished training we can run some examples, i let claude.ai write me a few reviews:

star_wars_review_pos = """Attack of the Clones delivers an exhilarating chapter in the Star Wars saga, featuring stunning visual effects 
and epic battles that push the boundaries of cinematic spectacle. The film expertly expands the Star Wars universe,
introducing compelling new characters and planets while deepening the mythology. Ewan McGregor shines as Obi-Wan Kenobi,
bringing charm and gravitas to his role as he investigates a mysterious plot against the Republic. The blossoming romance
between Anakin and Padmé adds a touching emotional core to the grand-scale adventure, setting the stage for the tragic events to come.
"""

star_wars_review_neg = """Star Wars: The Force Awakens is a soulless, uninspired rehash that shamelessly plunders the original trilogy
for nostalgia while offering nothing new or meaningful. The plot is a lazy carbon copy of "A New Hope," complete with yet another Death
Star knockoff, leaving long-time fans feeling cheated and newcomers bewildered. The new characters are flat and uninteresting, with Rey
emerging as an insufferable Mary Sue who masters complex Force abilities without any training. The film's pacing is a mess, lurching from
one contrived action sequence to another, while butchering the legacy of beloved characters like Han Solo. This cynical cash-grab proves
that Disney's acquisition of Lucasfilm was the true disturbance in the Force, effectively killing the magic that once made Star Wars
special."""
tokenized_review = tokenizer(star_wars_review_pos, return_tensors="pt")
tokenized_review.keys()
dict_keys(['input_ids', 'attention_mask'])

we first need to move the tensors over to the gpu since the model is still sitting in gpu and they need to share the same memory location in order to perform the calculations. Then feed the through the model

tokenized_review = {k:v.to("mps") for k,v in tokenized_review.items()}

the torch.no_grad() context manager call is a good habit to keep even in those small examples. That way you dont forget it in other settings where it might be more costly. It makes sure that this forward call for inference does not trigger the extra calculations of the gradients that are needed during training thus saving precious compute and memory.

with torch.no_grad():
    pred = model.forward(**tokenized_review)
pred
SequenceClassifierOutput(loss=None, logits=tensor([[-1.6137,  1.7726]], device='mps:0'), hidden_states=None, attentions=None)
np.argmax(pred.logits.cpu().numpy(), axis=-1).item()
1

Cool! The model sucessfully identified the review as being positive!!

Lets look at the other sample:

tokenized_review = tokenizer(star_wars_review_neg, return_tensors="pt")
tokenized_review.keys()
tokenized_review = {k:v.to("mps") for k,v in tokenized_review.items()}
with torch.no_grad():
    pred = model.forward(**tokenized_review)
np.argmax(pred.logits.cpu().numpy(), axis=-1).item()
0

It got that right as well! The Force Awakens was indeed a miserable one ;)

Thats it for this post, thanks more making it this far and have a great day. You can follow me on X / twitter to see when i post again :)