from transformers import AutoModelForSequenceClassification
from fastai.text.all import *
from fastai.callback.wandb import *

from fasthugs.learner import TransLearner
from fasthugs.data import TransformersTextBlock, TextGetter, get_splits

from datasets import load_dataset, concatenate_datasets

import wandb
import gc

Introduction

In this blogpost we will look at how to combine the power of HuggingFace with great flexibility of fastai. For this purpose we will finetune distilroberta-base on The General Language Understanding Evaluation(GLUE) benchmark. GLUE consists of 8 diverse sequence classification and one regression task.

I'll use fasthugs to make HuggingFace+fastai integration smooth.

Fun fact:GLUE benchmark was introduced in this paper in 2018 as tough to beat benchmark to chellange NLP systems and in just about a year new SuperGLUE benchmark was introduced because original GLUE has become too easy for the models. To give you a grasp on what we are dealing with, here is a brief summary of GLUE tasks:

Name Task description Size Metrics
cola Corpus of Linguistic Acceptability Determine whether it is a grammatical sentence 8.5k matthews_corrcoef
sst2 Stanford Sentiment Treebank Predict the sentiment of a givensentence 67k accuracy
mrpc Microsoft Research Paraphrase Corpus Determine whether the sentences in the pair are semantically equivalent 3.7k f1/accuracy
stsb Semantic Textual Similarity Benchmark Determine similarity score for 2 sentences 7k pearsonr/spearmanr
qqp Quora question pair Determine if 2 questions are the same (paraphrase) 364k f1/accuracy
mnli Mulit-Genre Natural Language Inference Predict whether the premise entails, contradicts or is neutral to the hypothesis 393k accuracy
qnli Stanford Question Answering Dataset Determine whether the context sentence containsthe answer to the question 105k accuracy
rte Recognize Textual Entailment Determine whether one sentece entails another 2.5k accuracy
wnli Winograd Schema Challenge Predict if the sentence with the pronoun substituted is entailed by the original sentence 634 accuracy

As you can see some datasets are really small here. And we'll look at how one can adress.

Setup

Let's define main settings for the run in one place:

ds_name = 'glue'
model_name = "distilroberta-base"

max_len = 512
bs = 32
val_bs = bs*2

n_epoch = 4
lr = 2e-5
wd = 0.
opt_func = Adam
diff_lr_decay_factor = 0

To make switching between datasets smooth I define couple of dictionaries containing per-task information. We need metrics, text fields to retrieve data and number of outputs for the model.

GLUE_TASKS = ["cola", "mnli", "mrpc", "qnli", "qqp", "rte", "sst2", "stsb", "wnli"]
def validate_task():
    assert task in GLUE_TASKS

glue_metrics = {
    'cola':[MatthewsCorrCoef()],
    'sst2':[accuracy],
    'mrpc':[F1Score(), accuracy],
    'stsb':[PearsonCorrCoef(), SpearmanCorrCoef()],
    'qqp': [F1Score(), accuracy],
    'mnli':[accuracy],
    'qnli':[accuracy],
    'rte': [accuracy],
    'wnli':[accuracy],
}

glue_textfields = {
    'cola':['sentence', None],
    'sst2':['sentence', None],
    'mrpc':['sentence1', 'sentence2'],
    'stsb':['sentence1', 'sentence2'],
    'qqp': ['question1', 'question2'],
    'mnli':['premise', 'hypothesis'],
    'qnli':['question', 'sentence'],
    'rte': ['sentence1', 'sentence2'],
    'wnli':['sentence1', 'sentence2'],
}

glue_num_labels = {'mnli':3, 'stsb':1}

def layerwise_splitter(model):
    emb = L(model.base_model.embeddings)
    layers = L(model.base_model.encoder.layer.children())
    clf = L(m for m in list(model.children())[1:] if params(m))
    groups = emb + layers + clf
    return groups.map(params)

Running a GLUE task

task = 'sst2'; validate_task()
ds = load_dataset(ds_name, task)
valid_ = 'validation-matched' if task=='mnli' else 'validation'
len(ds['train']), len(ds[valid_])
(67349, 872)
train_idx, valid_idx = get_splits(ds, valid=valid_)
train_ds = concatenate_datasets([ds['train'], ds[valid_]])
train_ds[0]
{'idx': 0,
 'label': 0,
 'sentence': 'hide new secretions from the parental units '}

Here I use number of characters a proxy for length of tokenized text to speed up dls creation.

lens = train_ds.map(lambda s: {'len': sum([len(s[i]) for i in glue_textfields[task] if i])},
                    remove_columns=train_ds.column_names, num_proc=2, keep_in_memory=True)
train_lens = lens.select(train_idx)['len']
valid_lens = lens.select(valid_idx)['len']
blocks = [TransformersTextBlock(pretrained_model_name=model_name),
          RegressionBlock() if task=='stsb' else CategoryBlock()]
dblock = DataBlock(blocks = blocks,
                   get_x=TextGetter(*glue_textfields[task]),
                   get_y=ItemGetter('label'),
                   splitter=IndexSplitter(valid_idx))
dl_kwargs=[{'res':train_lens}, {'val_res':valid_lens}]
dls = dblock.dataloaders(train_ds, bs=bs, val_bs=val_bs, dl_kwargs=dl_kwargs)
dls.show_batch(max_n=4)
text category
0 ... spiced with humor ('i speak fluent flatula,'advises denlopp after a rather, er, bubbly exchange with an alien deckhand ) and witty updatings ( silver's parrot has been replaced with morph, a cute alien creature who mimics everyone and everything around ) 1
1 stopped thinking about how good it all was, and started doing nothing but reacting to it - feeling a part of its grand locations, thinking urgently as the protagonists struggled, feeling at the mercy of its inventiveness, gasping at its visual delights 1
2 there aren't too many films that can be as simultaneously funny, offbeat and heartwarming ( without a thick shmear of the goo, at least ), but `` elling '' manages to do all three quite well, making it one of the year's most enjoyable releases 1
3 hatfield and hicks make the oddest of couples, and in this sense the movie becomes a study of the gambles of the publishing world, offering a case study that exists apart from all the movie's political ramifications. 1

Single run

The GLUE benchmark contains 8 tasks and it might be cumbersome to systematize the results. To make the analysis simpler and much more powerful I will be using Weights&Biases tracking platform. And even better thanks to Morgan McGuire (@morg) we have an open W&B project. You just need to log your runs under glue-benchmark project and set entity="fastai_community" and your results will be added to the pull for further investigation of hyperparameters. The fastest way to start participating would be to fork this notebook as it is set up to run any of the GLUE tasks with minimal changes. There is a lot to try: gradual unfreezing strategy is reported not to be helpful when finetuning Transformer-based models (for example see a discussion here); differential learning rates are used in NLP [1, 2] but are not common practice, do we need to use weight decay, if yes - how much and where, what suggestions from LR-finder work best? These are only few of many open questions and there are so much more. And even more interesting one how do this scale with dataset and model size?

Deep Learning as of now is highly empirical field and experiments require both some engendering and compute. This post is aimed to fuel community effort towards finding empirical truth by joining small forces together. Even if you're new to NLP do not hesitate to participate and run couple of experiments while learning along the way!

WANDB_NAME = f'{ds_name}-{task}-{model_name}'
GROUP = f'{ds_name}-{task}-{model_name}-{lr:.0e}'
if diff_lr_decay_factor: GROUP += f"diff_lr_{diff_lr_decay_factor}"
NOTES = f'finetuning {model_name} with {opt_func.__name__} lr={lr:.0e}'
TAGS =[model_name, ds_name, opt_func.__name__]
wandb.init(reinit=True, project="glue-benchmark", entity="fastai_community",
           name=WANDB_NAME, group=GROUP, notes=NOTES, tags=TAGS);
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=glue_num_labels.get('task', 2))
metrics = glue_metrics[task]
learn = TransLearner(dls, model, metrics=metrics, opt_func=opt_func, splitter=layerwise_splitter)
if diff_lr_decay_factor != 0:
    k = len(layerwise_splitter(model))
    lr = slice(lr*diff_lr_decay_factor**k,lr)

metric_to_monitor = metrics[0].name if isinstance(metrics[0], Metric) else metrics[0].__name__
cbs = [WandbCallback(log_preds=False, log_model=False),
       SaveModelCallback(monitor=metric_to_monitor, fname=f'{model_name}-{task}')]
learn.fit_one_cycle(4, lr, wd=wd, cbs=cbs)
Could not gather input dimensions
epoch train_loss valid_loss accuracy time
0 0.229984 0.264627 0.900229 02:02
1 0.157474 0.251536 0.912844 02:02
2 0.105107 0.252113 0.916284 02:03
3 0.070137 0.278783 0.925459 02:03
Better model found at epoch 0 with accuracy value: 0.9002293348312378.
Better model found at epoch 1 with accuracy value: 0.9128440618515015.
Better model found at epoch 2 with accuracy value: 0.9162843823432922.
Better model found at epoch 3 with accuracy value: 0.9254587292671204.

It's always useful to check your model predictions after training. fastai makes this very simple:

learn.show_results()

text category category_
0 the movie has an infectious exuberance that will engage anyone with a passing interest in the skate/surf culture, the l.a. beach scene and the imaginative ( and sometimes illegal ) ways kids can make a playground out of the refuse of adults. 1 1
1 what really makes it special is that it pulls us into its world, gives us a hero whose suffering and triumphs we can share, surrounds him with interesting characters and sends us out of the theater feeling we've shared a great adventure. 1 1
2 this is a train wreck of an action film -- a stupefying attempt by the filmmakers to force-feed james bond into the mindless xxx mold and throw 40 years of cinematic history down the toilet in favor of bright flashes and loud bangs. 0 0
3 it's one of those baseball pictures where the hero is stoic, the wife is patient, the kids are as cute as all get-out and the odds against success are long enough to intimidate, but short enough to make a dream seem possible. 1 1
4 though perry and hurley make inspiring efforts to breathe life into the disjointed, haphazard script by jay scherick and david ronn, neither the actors nor director reginald hudlin can make it more than fitfully entertaining. 0 1
5 may be far from the best of the series, but it's assured, wonderfully respectful of its past and thrilling enough to make it abundantly clear that this movie phenomenon has once again reinvented itself for a new generation. 1 1
6 despite all evidence to the contrary, this clunker has somehow managed to pose as an actual feature movie, the kind that charges full admission and gets hyped on tv and purports to amuse small children and ostensible adults. 0 0
7 it's inoffensive, cheerful, built to inspire the young people, set to an unending soundtrack of beach party pop numbers and aside from its remarkable camerawork and awesome scenery, it's about as exciting as a sunburn. 0 1
8 but the power of these ( subjects ) is obscured by the majority of the film that shows a stationary camera on a subject that could be mistaken for giving a public oration, rather than contributing to a film's narrative. 0 0

Sweeps

Finding the perfect learning rate for a task isn't easy. Add weight decay, different optimizers, differential learning rates and various scheduler to the mix and search for the best hyperparameters becomes a really big task. For that reason there exist automated tools for hyperparameter search. Here we'll look at sweeps functionality provided by W&B. It not only facilitates hyperparameter finetuning but also enables great visualization of the results, which might help for further analysis. Check out documentaion for more details.

def train():
    with wandb.init() as run:
        cfg = run.config
        model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=glue_num_labels.get(task, 2))
        metrics = glue_metrics[task]
        k = len(layerwise_splitter(model))
        if cfg.diff_lr_decay_factor: lr = slice(cfg.lr*cfg.diff_lr_decay_factor**k,cfg.lr)
        learn = TransLearner(dls, model, metrics=metrics, opt_func=Adam, splitter=layerwise_splitter)
        learn.fit_one_cycle(n_epoch, cfg.lr, wd=cfg.wd, cbs=[WandbCallback(log_preds=False, log_model=False)])
        del learn
        gc.collect()
        torch.cuda.empty_cache()
        torch.cuda.synchronize()

Here we'll do a grid search over combinations of learning rate, weight decay and differential learning rates. Differential learning rates is specified by decay factor $\gamma$: $lr$ for layer $l$ are are determined as ${lr_0}*\gamma^{L-l}$, where L is total number of layers.

metrics = glue_metrics[task]
metric_to_monitor = metrics[0].name if isinstance(metrics[0], Metric) else metrics[0].__name__
sweep_name = f"glue-{task}-sweep"
sweep_config = {
    "project":"glue-benchmark",
    "entity":"fastai_cimmunity",
    "name": sweep_name,
    "method": "grid",
    "parameters": {
        "lr": {"values":[1e-5,2e-5,3e-5,5e-5, 1e-4]},
        "wd": {"values":[0.,1e-2,5e-2]},
        "diff_lr_decay_factor":{"values":[0., 0.9, 0.8, 0.7, 0.6]}
    },
    "metric":{"goal": "maximise", "name": metric_to_monitor},
    "early_terminate": {"type": "hyperband", "s": 2, "eta": 3, "max_iter": 40}
}
sweep_id = wandb.sweep(sweep_config)
wandb.agent(sweep_id, function=train)

As a result we get a nice chart which helps to relate hyperparameter combinations to model performance.

sweeps

The sweep can be explored interactively by this link https://wandb.ai/fastai_community/glue-benchmark/sweeps/hc8ytty4.

Another task example: MNLI

MNLI task is interesting for a couple of reasons. It has the largest training set in the benchmark, for the results of training for MNLI might be useful for smaller tasks as we will consider in the next section. Unlike most of the GLUE tasks, which ar formulated as binary classification problem, this one has three categories: entailment, neutral and contradiction. One can argue that solving such kind of problem should envolve more "understanding" of the meaning of text.

task = 'mnli'; validate_task()
ds = load_dataset(ds_name, task)
train_idx, valid_idx = get_splits(ds, valid='validation_matched')
train_ds = concatenate_datasets([ds['train'], ds['validation_matched']])

Each sample contains premise and hypothesis, the task is to determine whether the hypothesis entails, contradicts or is neutral to the premise. Let's check out an example:

train_ds[0]
{'hypothesis': 'Product and geography are what make cream skimming work. ',
 'idx': 0,
 'label': 1,
 'premise': 'Conceptually cream skimming has two basic dimensions - product and geography.'}

The data preparation and dataloaders construction do not differ much from those for previous task:

lens = train_ds.map(lambda s: {'len': len(s['premise'])+len(s['hypothesis'])}, remove_columns=train_ds.column_names, num_proc=4, keep_in_memory=True)
train_lens = lens.select(train_idx)['len']
valid_lens = lens.select(valid_idx)['len']

blocks = [TransformersTextBlock(pretrained_model_name=model_name),
          RegressionBlock() if task=='stsb' else CategoryBlock()]
dblock = DataBlock(blocks = blocks,
                   get_x=TextGetter(*glue_textfields[task]),
                   get_y=ItemGetter('label'),
                   splitter=IndexSplitter(valid_idx))
dl_kwargs=[{'res':train_lens}, {'val_res':valid_lens}]
dls = dblock.dataloaders(train_ds, bs=bs, val_bs=val_bs, dl_kwargs=dl_kwargs, num_workers=4)
dls.show_batch(max_n=4)
text text_ category
0 well uh that's kind of obvious i mean they're even carrying it to to where now uh that they advertise on TV you know if your if you uh you know have done this or if you need this uh uh we'll sue for you and you don't have to pay us unless you but then what they don't tell you is that if you if they win you give them at least a third of the of the thing that they win so i don't know it is uh it's getting to be more business now rather than uh actually uh dealing with the crime than with uh um the uh punishment they the the lawyers are just in it for the money i'm i'm convinced i know i i agree with you i think you're real you're very right that the politicians should i think they I think that there should be an equal representation of backgrounds in our politicians. 0
1 um-hum still have a problem with uh you know i haven't come to an absolute conclusion on my opinion on this but and i know other Christians would disagree with me my husband and i are kind of not even in agreement on this but we don't fight over it or anything but you know how can you know the Bible says bless your enemies and bless those that curse you and it's like be gentle unto all men apt to teach patient kind so it's like how can you i don't know for me i don't know you know i can't say that i agree with Vietnam because how can you be gentle unto all men and and then shoot them As a Christian I believe that Vietnam is a necessary war. 2
2 These 1) approving both changes in existing services and the establishment of new services---those are known as classification cases; 2) adjudicating complaints from anyone who believes the Postal Service is not providing rates or services as required by law; 3) issuing advisory opinions when the Postal Service proposes a substantially nationwide change in the nature of its services; and, 4) our mostly recently assigned task, providing Congress with annual reports about the costs and revenues of international mail. The Postal Service undergoing a review across all of its activities. 1
3 After all, in this piece the car-home-and-fire salesman turned global strategist describes the Gulf War as if it were a model of Clausewitzian clarity concerning ultimate goals and acceptable means, forgetting in the process that at the end of that war, the Bush/Powell/Schwarzkopf axis internally disagreed about war issues that had never been articulated for the American Should the U.S. destroy the Iraqi military, invade Baghdad, or topple Hussein even after Iraq was repulsed from Kuwait? Bush, Powell and Schwarzkopf were in full agreement about war issues. 2
WANDB_NAME = f'{ds_name}-{task}-{model_name}'
GROUP = f'{ds_name}-{task}-{model_name}-{lr:.0e}'
NOTES = f'finetuning {model_name} with Adam lr={lr:.0e}'
TAGS =[model_name, ds_name, 'adam', task]

wandb.init(reinit=True, project="glue-benchmark", entity="fastai_community",
           name=WANDB_NAME, group=GROUP, notes=NOTES, tags=TAGS);

Training procedure is also very similar:

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)
metrics = glue_metrics[task]
learn = TransLearner(dls, model, metrics=metrics)
metric_to_monitor = metrics[0].name if isinstance(metrics[0], Metric) else metrics[0].__name__
cbs = [WandbCallback(log_preds=False, log_model=False),
       SaveModelCallback(monitor=metric_to_monitor, fname=f'{model_name}-{task}')]
learn.fit_one_cycle(4, lr, wd=wd, cbs=cbs)
Could not gather input dimensions
epoch train_loss valid_loss accuracy time
0 0.532420 0.497427 0.801936 22:54
1 0.447823 0.431625 0.835660 23:02
2 0.384313 0.431362 0.841161 22:58
3 0.297241 0.459461 0.843709 23:11
Better model found at epoch 0 with accuracy value: 0.8019357919692993.
Better model found at epoch 1 with accuracy value: 0.8356596827507019.
Better model found at epoch 2 with accuracy value: 0.8411614894866943.
Better model found at epoch 3 with accuracy value: 0.8437086343765259.
learn.show_results()

text text_ category category_
0 yes they would they just wouldn't be able to own the kind of automobiles that they think they deserve to own or the kind of homes that we think we deserve to own we might have to you know just be able to i think if we a generation went without debt then the next generation like if if our our generation my husband and i we're twenty eight if we lived our lives and didn't become you know indebted like you know our generation before us that um the budget would balance and that we became accustomed to living with what we could afford which we wouldn't be destitute i mean we wouldn't be living on the street by any means but just compared to how spoiled we are we would be in our own minds but i feel like the generation after us would oh man it it Society would be perfect and there would be no more war if we could just rid ourselves of our debt. 1 2
1 and i look back on that and i bought shoes i went shopping i did not need that money i did not need it i didn't need it i shouldn't have even qualified to get it i didn't need it and it would have been a little rough i might have eaten some bologna instead of roast beef out of the deli but i did not need it and as i look back now now we're paying that back i told my son if you have to live in the ghetto to go to college do it but don't take out ten thousand dollars in loans don't do it and i don't i hope don't think he'll have to do that but i just so like we might if we didn't have those loans we could have saved in the last five years the money for that and i believe My friends should look towards me as a model of saving money. 1 1
2 and i look back on that and i bought shoes i went shopping i did not need that money i did not need it i didn't need it i shouldn't have even qualified to get it i didn't need it and it would have been a little rough i might have eaten some bologna instead of roast beef out of the deli but i did not need it and as i look back now now we're paying that back i told my son if you have to live in the ghetto to go to college do it but don't take out ten thousand dollars in loans don't do it and i don't i hope don't think he'll have to do that but i just so like we might if we didn't have those loans we could have saved in the last five years the money for that and i believe I regret taking out loans. 0 1
3 well the first thing for me is i wonder i see a couple of different ways of talking about what privacy is um if privacy is something that disturbs your private state i mean an invasion of privacy is something that disturbs your private state that's one thing and if privacy is something that comes into your private state and extracts information from it in other words finds something out about you that's another and the first kind of invasion of the first type of privacy seems invaded to me in very much everyday in this country but in the second type at least overtly uh where someone comes in and uh finds out information about you that should be private uh does not seem uh um obviously everyday Talking about privacy is a complicated topic, there are a couple different ways of talking about it, for example privacy is something that disturbs your private state... 0 1
4 The rule prohibits the sale of nicotine-containing cigarettes and smokeless tobacco to individuals under the age of 18; requires manufacturers, distributors, and retailers to comply with various conditions regarding the sale and distribution of these products; requires retailers to verify a purchaser's age by photographic identification; prohibits all free samples; limits the distribution of these products through vending machines and self-service displays by permitting such methods of sale only in facilities where access by individuals under 18 is prohibited; limits the advertising and labeling to which children and adolescents are exposed; prohibits promotional, non-tobacco items such as hats and tee shirts; prohibits sponsorship of This rule will make the sale of tobacco products to people under 18 years old legal in every state and Mexico. 2 2
5 yeah the the i mean people like that are crazy i did a study on it though when i was in high school it was one of these things we had to pick a topic to to investigate and at that time i don't think it's like that any more but at that time uh it was very unfair capital punishment was a lot more common and if you tended and it tended to be that if you were ignorant or if you were a foreigner or if you were black or any minority for that matter the chances your chances of of uh getting the death penalty were you know like hundreds of times greater than if you could just communicate well i mean you didn't have to be um you didn't even necessarily have to be white but if you could just communicate and you could come It was something I performed research on during high school. 0 0
6 yeah because you look at the statistics now and i'm sure it's in your your newspapers just like it is in ours that every major city now the increase of crime is is escalating i mean there are more look at the look at the people there are being shot now i mean every day there's there's dozens of dozens of people across the nation they just get blown away for no reason you know stray bullets or California they were going out there and they were shooting and they get these guys and they don't do anything with them so i kind of i kind of agree with you i'm kind of you still in the in the uh prison system "Crime is escalating now in every major city, however there are plans in place now." 1 1
7 i know that you know the further we go from Adam the worse the food is for you but God still somehow makes us all be able to still live i think it's a miracle we're all still alive after so many generations well the last couple of processed foods you know i mean but i don't know i like to i like to my i like to be able to eat really healthy you know what am saying and i guess i'm going to have to wait for the millennium i think though because i do don't think we're going to restore the earth to you know i think Jesus is the only one that can make this earth be restored to what it should be It is miraculous God still provides for us to this day. 0 0
8 i know because i think i've been reading i read this ten years ago that they were having these big uh um rallies and people would be in the streets flashing signs statehood yes and other people would statehood down the statehood it's it down there if you're um familiar with their politics they uh it's very uh i i don't know it's called Latino there they have loudspeakers on their cars and they run down the neighborhood saying vote for you know Pierre he's or uh Pedro uh Pedro he's the best it's it's really kind of comical Ten years ago, they rallies on streets with flashing signs and loudspeakers advertising voting candidates. 0 0

MNLI task has another missmatched validation set. matched set contains in-domain data and the missmatched is a cross-domain.

valid_mm_dl = dls.test_dl(ds['validation_mismatched'], with_labels=True)
learn.validate(dl=valid_mm_dl)

Notice that there are similar datasets available (e.g. snli dataset). Those might be used to improve the performance. But for these post I'll limit the scope to GLUE data only and leave the experiments with extra data for upcoming posts.

Low resource tasks

Some daatsets are rather small, RTE has only 2.5k samples in the training set. This is not much at all for nontrivial language task like this one. But we can try to use a small trick to improve the results. The MNLI task is quite similar and has much more training data. Let's reuse model trained on it for improving RTE score. This trick is common practice and has been employed in original RoBERTa paper when reporting GLUE score.

task = 'rte'; validate_task()

ds = load_dataset(ds_name, task)

valid_ = 'validation-matched' if task=='mnli' else 'validation'
len(ds['train']), len(ds[valid_])

train_idx, valid_idx = get_splits(ds, valid=valid_)
train_ds = concatenate_datasets([ds['train'], ds[valid_]])
train_ds[0]
{'idx': 0,
 'label': 1,
 'sentence1': 'No Weapons of Mass Destruction Found in Iraq Yet.',
 'sentence2': 'Weapons of Mass Destruction Found in Iraq.'}
blocks = [TransformersTextBlock(pretrained_model_name=model_name),
          RegressionBlock() if task=='stsb' else CategoryBlock()]
dblock = DataBlock(blocks = blocks,
                   get_x=TextGetter(*glue_textfields[task]),
                   get_y=ItemGetter('label'),
                   splitter=IndexSplitter(valid_idx))
dls = dblock.dataloaders(train_ds, bs=bs, val_bs=val_bs)
dls.show_batch(max_n=4)
text text_ category
0 No Weapons of Mass Destruction Found in Iraq Yet. Weapons of Mass Destruction Found in Iraq. 1
1 The most recent poll carried out by NOP market research in January revealed that 61% of Britons are opposed to joining the euro. The introduction of the euro has been opposed. 0
2 The disappearance of York University chef Claudia Lawrence is now being treated as suspected murder, North Yorkshire Police said. However detectives said they had not found any proof that the 35-year-old, who went missing on 18 March, was dead. Her father Peter Lawrence made a direct appeal to his daughter to contact him five weeks after she disappeared. His plea came at a news conference held shortly after a £10,000 reward was offered to help find Miss Lawrence. Crimestoppers said the sum they were offering was "significantly higher" than usual because of public interest in the case. Claudia Lawrence is 35 years old. 0
3 A Continental Connection flight from Newark to Buffalo crashed into a house about four to six miles from Buffalo Niagara International Airport on Thursday night, killing 50 people, officials said. Continental Airlines Flight 3407 is a daily commuter flight from Newark Liberty International Airport in Newark, New Jersey to Buffalo, New York, operated under the Continental Connection brand by Virginia-based regional airline Colgan Air. A daily commuter flight crashed in New York. 0
WANDB_NAME = f'{ds_name}-{task}-{model_name}'
GROUP = f'{ds_name}-{task}-{model_name}-{lr:.0e}'
if diff_lr_decay_factor: GROUP += f"diff_lr_{diff_lr_decay_factor}"
NOTES = f'finetuning {model_name} with {opt_func.__name__} lr={lr:.0e}'
TAGS =[model_name, ds_name, opt_func.__name__]

wandb.init(reinit=True, project="fasthugs", entity="fastai_community",
           name=WANDB_NAME, group=GROUP, notes=NOTES, tags=TAGS);
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=glue_num_labels.get('task', 2))
metrics = glue_metrics[task]
learn = TransLearner(dls, model, metrics=metrics, opt_func=opt_func)
try:
    learn.load('distilroberta-base-mnli', with_opt=False, strict=False)
except RuntimeError as e:
    print(e)
Error(s) in loading state_dict for RobertaForSequenceClassification:
	size mismatch for classifier.out_proj.weight: copying a param with shape torch.Size([3, 768]) from checkpoint, the shape in current model is torch.Size([2, 768]).
	size mismatch for classifier.out_proj.bias: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([2]).
if diff_lr_decay_factor != 0:
    k = len(layerwise_splitter(model))
    lr = slice(lr*diff_lr_decay_factor**k,lr)

metric_to_monitor = metrics[0].name if isinstance(metrics[0], Metric) else metrics[0].__name__
cbs = [WandbCallback(log_preds=False, log_model=False),
       SaveModelCallback(monitor=metric_to_monitor, fname=f'{model_name}-{task}')]
learn.fit_one_cycle(10, lr, wd=wd, cbs=cbs, pct_start=0.1)
epoch train_loss valid_loss accuracy time
0 0.569979 0.565890 0.693141 00:30
1 0.511280 0.529077 0.736462 00:31
2 0.409093 0.601690 0.743682 00:31
3 0.265996 0.763166 0.736462 00:31
4 0.171846 0.770063 0.754513 00:32
5 0.098103 0.922156 0.768953 00:32
6 0.067698 1.030401 0.761733 00:31
7 0.048222 1.007513 0.772563 00:31
8 0.034855 1.056370 0.765343 00:32
9 0.021131 1.069907 0.761733 00:32

As one can see by using this simple trick we've improved the result reported at HuggingFace model card by some 10%. Pretty nice, ha?

Just to be sure that improvement is due to using model finetuned on mnli let's do another run starting from vanilla distilroberta:

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=glue_num_labels.get('task', 2))
metrics = glue_metrics[task]
learn = TransLearner(dls, model, metrics=metrics, opt_func=opt_func)
learn.fit_one_cycle(10, lr, wd=wd, cbs=cbs, pct_start=0.1)
Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
epoch train_loss valid_loss accuracy time
0 0.695126 0.691306 0.527076 00:31
1 0.692349 0.692152 0.480144 00:31
2 0.678994 0.641740 0.624549 00:31
3 0.602276 0.600447 0.671480 00:31
4 0.488653 0.662074 0.678700 00:31
5 0.377430 0.683057 0.678700 00:31
6 0.269494 0.967499 0.657040 00:31
7 0.182777 1.016970 0.685921 00:32
8 0.140067 1.038462 0.696751 00:31
9 0.113930 1.068865 0.682310 00:32

The same is applicable for STSB taks, which has 7k training samples. Performance gain for STSB is not so prominent but it's still there. You can compare the results for cold and warm starts in this W&B report.

Concluding thoughts

With this we have an simple easy to use framework for quick experimentation with LM finetuning. HuggingFace provides us with huge variety of state of the art Transformers and fastai facilitates configurable training loop with gret API. You are wellcomed to share your comments in dedicated fastai forums topic, try out fasthugs (I'm happy to here your opinions and accept feature requests) and finally open this notebook on Colab, select your task and try to set new best for the model.