Tutorial#
The easiest way to get started with transformers in Determined is to use one of the provided examples. In this tutorial, we will walk through the question answering example to get a better understanding of how to use model-hub for transformers.
The question answering example includes two implementations of PyTorch API:
qa_trial.py uses the
model_hub.huggingface.BaseTransformerTrial
parent__init__
function to build transformers config, tokenizer, and model objects; and optimizer and learning rate scheduler.qa_beam_search_trial.py overrides the
model_hub.huggingface.BaseTransformerTrial
parent__init__
function to customize how the transformers config, tokenizer, and model objects are constructed.
To learn the basics, we’ll walk through qa_trial.py. We won’t cover the model definition line-by-line but will highlight the parts that make use of model-hub.
Note
If you are new to Determined, we recommend going through the Quickstart for ML Developers
document to get a better understanding of how to use PyTorch in Determined using
determined.harness.pytorch.PyTorchTrial
.
After this tutorial, if you want to further customize a trial for your own use, you can look at qa_beam_search_trial.py for an example.
Initialize the QATrial#
The __init__
for QATrial
is responsible for creating and processing the dataset; building
the transformers config, tokenizer, and model; and tokenizing the dataset. The specifications
for how we should perform these steps is passed from PyTorchContext
via
the hyperparameters and data configuration fields. These fields are set to hparams
and
data_config
class attributes in model_hub.huggingface.BaseTransformerTrial.__init__()
.
You can also get them by calling context.get_hparams()
and context.get_data_config()
respectively.
Note that context.get_hparams()
and context.get_data_config()
returns the
hyperparameters
and data
section respectively of the experiment configuration file squad.yaml.
Build transformers config, tokenizer, and model#
First, we’ll build the transformer config, tokenizer, and model objects by calling
model_Hub.huggingface.BaseTransformerTrial.__init__()
:
super(QATrial, self).__init__(context)
This will parse the hyperparameters and fill the fields of
model_hub.huggingface.ConfigKwargs
, model_hub.huggingface.TokenizerKwargs
,
and model_hub.huggingface.ModelKwargs
if present in hyperparameters and then pass them
to model_hub.huggingface.build_using_auto()
to build the config, tokenizer, and model using
transformers autoclasses. You
can look at the associated class definitions for the Kwargs objects to see the fields you can pass.
This step needs to be done before we can use the tokenizer to tokenize the dataset. In some cases, you may need to first load the raw dataset and get certain metadata like the number of classes before creating the transformers objects (see ner_trial.py for example).
Note
You are not tied to using model.huggingface.build_using_auto()
to build the config,
tokenizer, and model objects. See qa_beam_search_trial.py for an example of a trial directly
calling transformers methods.
Build the optimizer and LR scheduler#
The model_Hub.huggingface.BaseTransformerTrial.__init__()
also parses the hyperparameters
into model_hub.huggingface.OptimizerKwargs()
and
model_hub.huggingface.LRSchedulerKwargs()
before passing them to
model_hub.huggingface.build_default_optimizer()
and
model_hub.huggingface.build_default_lr_scheduler()
respectively. These two build methods
have the same behavior and configuration options as the transformers Trainer. Again, you can look
at the associated class definitions for the Kwargs objects to see the fields you can pass.
Note
You are not tied to using these functions to build the optimizer and LR scheduler. You can very
easily override the parent __init__
methods to use whatever optimizer and LR scheduler you
want.
Load the Dataset#
self.raw_datasets = hf.default_load_dataset(self.data_config)
This example uses the helper function model_hub.huggingface.default_load_dataset()
to load
the SQuAD dataset. The function takes the data_config
as input and parses the fields into those
expected by the model_hub.huggingface.DatasetKwargs
dataclass before passing it to the
load_dataset function from Huggingface datasets.
Not all the fields of model_hub.huggingface.DatasetKwargs
are always applicable to an
example. For this example, we specify the following fields in squad.yaml for loading the dataset:
data:
dataset_name: squad
train_file: null
validation_file: null
If the dataset you want to use is registered in Huggingface datasets then you can simply specify
the dataset_name
. Otherwise, you can set dataset_name: null
and pass your own dataset in
using train_file
and validation_file
. There is more guidance on how to use this example with
custom data files in qa_trial.py.
Note
You can also bypass model_hub.huggingface.default_load_dataset()
and call load_dataset
directly for more options.
Data processing#
Our text data needs to be converted to vectors before we can apply our models to them. This usually
involves some preprocessing before passing the result to the tokenizer for vectorization. This part
usually has task-specific preprocessing required as well to process the targets. model-hub has
no prescription for how you should process your data but all the provided examples implement a
build_datasets
function to create the tokenized dataset.
Note
The Huggingface transformers and datasets library have optimized routiens for
tokenization that caches results for reuse if possible. We have taken special care to make sure
all our examples make use of this functionality. As you start implementing your own Trials, one
pitfall to watch out for that prevents efficient caching is passing a function to Dataset.map
that contains unserializable objects.
Define metrics#
Next, we’ll define the metrics that we wish to compute over the predictions generated for the validation dataset.
# Create metric reducer
metric = datasets.load_metric(
"squad_v2" if self.data_config.version_2_with_negative else "squad"
)
self.reducer = context.wrap_reducer(
functools.partial(
qa_utils.compute_metrics,
self.data_config,
self.column_names,
self.data_processors.post_processing_function,
self.raw_datasets,
self.tokenized_datasets,
self.model,
metric,
),
for_training=False,
)
We use the metric function associated with the SQuAD dataset from huggingface datasets and apply
it after post-processing the predictions in the qa_utils.compute_metrics
function.
Determined supports parallel evaluation via custom reducers. The
reducer
we created above will aggregate predictions across all GPUs then apply the
qa_utils.compute_metrics
function to the result.
Fill in the Rest of PyTorchTrial#
The remaining class methods we must implement are
determined.harness.pytorch.PyTorchTrial.build_training_data_loader()
,
determined.harness.pytorch.PyTorchTrial.build_validation_data_loader()
, and
determined.harness.pytorch.PyTorchTrial.evaluate_batch()
.
Build the Dataloaders#
The two functions below are responsible for building the dataloaders used for training and validation.
def build_training_data_loader(self) -> det_torch.DataLoader:
return det_torch.DataLoader(
self.tokenized_datasets["train"],
batch_size=self.context.get_per_slot_batch_size(),
collate_fn=self.collator,
)
def build_validation_data_loader(self) -> det_torch.DataLoader:
# Determined's distributed batch sampler interleaves shards on each GPU slot so
# sample i goes to worker with rank i % world_size. Therefore, we need to re-sort
# all the samples once we gather the predictions before computing the validation metric.
return det_torch.DataLoader(
qa_utils.DatasetWithIndex(self.tokenized_datasets["validation"]),
batch_size=self.context.get_per_slot_batch_size(),
collate_fn=self.collator,
)
There are two things to note:
Batch size passed to the dataloader is
context.get_per_slot_batch_size
which is the effective per GPU batch size when performing distributed training.The dataloader returned is a
determined.harness.pytorch.DataLoader
which has the same signature as PyTorch dataloaders but automatically handles data sharding and resuming dataloader state when recovering from a fault.
Define the Training Routine#
The train_batch
method below for model_hub.huggingface.BaseTransformerTrial
is
sufficient for this example.
def train_batch(self, batch: Any, epoch_idx: int, batch_idx: int) -> Any:
# By default, all HF models return the loss in the first element.
# We do not automatically apply a label smoother for the user.
# If this is something you want to use, please see how it's
# applied by transformers.Trainer:
# https://github.com/huggingface/transformers/blob/v4.3.3/src/transformers/trainer.py#L1324
outputs = self.model(**batch)
loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]
self.context.backward(loss)
self.context.step_optimizer(self.optimizer, self.grad_clip_fn)
return loss
Define the Evaluation Routine#
Finally, we can define the evaluation routine for this example.
def evaluate_batch(self, batch: det_torch.TorchData, batch_idx: int) -> Dict:
ind = batch.pop("ind")
outputs = self.model(**batch)
if isinstance(outputs, dict):
predictions = tuple(
v.detach().cpu().numpy() for k, v in outputs.items() if k not in ("loss", "mems")
)
else:
predictions = outputs[1:].detach().cpu().numpy()
self.reducer.update((ind.detach().cpu().numpy(), predictions))
# Although we are returning the empty dictionary below, we will still get the metrics from
# custom reducer that we passed to the context during initialization.
return {}
After passing the batch through the model and doing some processing to get the predictions, we pass
the predictions to reducer.update
to aggregate the predictions in each GPU. Once each GPU has
exhausted the batches in its dataloader, Determined automatically performs an all gather operation
to collect the predictions in the rank 0 GPU before passing them to the compute_metrics
function.
HF Library Versions#
model-hub support for transformers is tied to specific versions of the source library to ensure compatibility. Be sure to use the latest Docker image with all the necessary dependencies for transformers with model-hub. All provided examples already have this Docker image specified:
environment:
image:
We periodically bump these libraries up to more recent versions of transformers
and datasets
so you can access the latest upstream features. That said, once you create a trial definition using
a particular Docker image, you will not need to upgrade to a new Docker image for your code to
continue working with model-hub. Additionally, your code will continue to work with that image
even if you use it with a more recent version of the Determined cluster.
Next Steps#
Take a look at qa_beam_search_trial.py for an example of how you can further customize your trial.
Dive into the api.