The easiest way to get started with transformers in Determined is to use one of the provided examples. In this tutorial, we will walk through the question answering example to get a better understanding of how to use model-hub for transformers.

The question answering example includes two implementations of PyTorch API:

To learn the basics, we’ll walk through qa_trial.py. We won’t cover the model definition line-by-line but will highlight the parts that make use of model-hub.


If you are new to Determined, we recommend going through the Quickstart for ML Developers document to get a better understanding of how to use PyTorch in Determined using determined.harness.pytorch.PyTorchTrial.

After this tutorial, if you want to further customize a trial for your own use, you can look at qa_beam_search_trial.py for an example.

Initialize the QATrial#

The __init__ for QATrial is responsible for creating and processing the dataset; building the transformers config, tokenizer, and model; and tokenizing the dataset. The specifications for how we should perform these steps is passed from PyTorchContext via the hyperparameters and data configuration fields. These fields are set to hparams and data_config class attributes in model_hub.huggingface.BaseTransformerTrial.__init__(). You can also get them by calling context.get_hparams() and context.get_data_config() respectively.

Note that context.get_hparams() and context.get_data_config() returns the hyperparameters and data section respectively of the experiment configuration file squad.yaml.

Build transformers config, tokenizer, and model#

First, we’ll build the transformer config, tokenizer, and model objects by calling model_Hub.huggingface.BaseTransformerTrial.__init__():

        super(QATrial, self).__init__(context)

This will parse the hyperparameters and fill the fields of model_hub.huggingface.ConfigKwargs, model_hub.huggingface.TokenizerKwargs, and model_hub.huggingface.ModelKwargs if present in hyperparameters and then pass them to model_hub.huggingface.build_using_auto() to build the config, tokenizer, and model using transformers autoclasses. You can look at the associated class definitions for the Kwargs objects to see the fields you can pass.

This step needs to be done before we can use the tokenizer to tokenize the dataset. In some cases, you may need to first load the raw dataset and get certain metadata like the number of classes before creating the transformers objects (see ner_trial.py for example).


You are not tied to using model.huggingface.build_using_auto() to build the config, tokenizer, and model objects. See qa_beam_search_trial.py for an example of a trial directly calling transformers methods.

Build the optimizer and LR scheduler#

The model_Hub.huggingface.BaseTransformerTrial.__init__() also parses the hyperparameters into model_hub.huggingface.OptimizerKwargs() and model_hub.huggingface.LRSchedulerKwargs() before passing them to model_hub.huggingface.build_default_optimizer() and model_hub.huggingface.build_default_lr_scheduler() respectively. These two build methods have the same behavior and configuration options as the transformers Trainer. Again, you can look at the associated class definitions for the Kwargs objects to see the fields you can pass.


You are not tied to using these functions to build the optimizer and LR scheduler. You can very easily override the parent __init__ methods to use whatever optimizer and LR scheduler you want.

Load the Dataset#

        self.raw_datasets = hf.default_load_dataset(self.data_config)

This example uses the helper function model_hub.huggingface.default_load_dataset() to load the SQuAD dataset. The function takes the data_config as input and parses the fields into those expected by the model_hub.huggingface.DatasetKwargs dataclass before passing it to the load_dataset function from Huggingface datasets.

Not all the fields of model_hub.huggingface.DatasetKwargs are always applicable to an example. For this example, we specify the following fields in squad.yaml for loading the dataset:

  dataset_name: squad
  train_file: null
  validation_file: null

If the dataset you want to use is registered in Huggingface datasets then you can simply specify the dataset_name. Otherwise, you can set dataset_name: null and pass your own dataset in using train_file and validation_file. There is more guidance on how to use this example with custom data files in qa_trial.py.


You can also bypass model_hub.huggingface.default_load_dataset() and call load_dataset directly for more options.

Data processing#

Our text data needs to be converted to vectors before we can apply our models to them. This usually involves some preprocessing before passing the result to the tokenizer for vectorization. This part usually has task-specific preprocessing required as well to process the targets. model-hub has no prescription for how you should process your data but all the provided examples implement a build_datasets function to create the tokenized dataset.


The Huggingface transformers and datasets library have optimized routiens for tokenization that caches results for reuse if possible. We have taken special care to make sure all our examples make use of this functionality. As you start implementing your own Trials, one pitfall to watch out for that prevents efficient caching is passing a function to Dataset.map that contains unserializable objects.

Define metrics#

Next, we’ll define the metrics that we wish to compute over the predictions generated for the validation dataset.

        # Create metric reducer
        metric = datasets.load_metric(
            "squad_v2" if self.data_config.version_2_with_negative else "squad"

        self.reducer = context.wrap_reducer(

We use the metric function associated with the SQuAD dataset from huggingface datasets and apply it after post-processing the predictions in the qa_utils.compute_metrics function.

Determined supports parallel evaluation via custom reducers. The reducer we created above will aggregate predictions across all GPUs then apply the qa_utils.compute_metrics function to the result.

Fill in the Rest of PyTorchTrial#

The remaining class methods we must implement are determined.harness.pytorch.PyTorchTrial.build_training_data_loader(), determined.harness.pytorch.PyTorchTrial.build_validation_data_loader(), and determined.harness.pytorch.PyTorchTrial.evaluate_batch().

Build the Dataloaders#

The two functions below are responsible for building the dataloaders used for training and validation.

    def build_training_data_loader(self) -> det_torch.DataLoader:
        return det_torch.DataLoader(

    def build_validation_data_loader(self) -> det_torch.DataLoader:
        # Determined's distributed batch sampler interleaves shards on each GPU slot so
        # sample i goes to worker with rank i % world_size.  Therefore, we need to re-sort
        # all the samples once we gather the predictions before computing the validation metric.
        return det_torch.DataLoader(

There are two things to note:

  • Batch size passed to the dataloader is context.get_per_slot_batch_size which is the effective per GPU batch size when performing distributed training.

  • The dataloader returned is a determined.harness.pytorch.DataLoader which has the same signature as PyTorch dataloaders but automatically handles data sharding and resuming dataloader state when recovering from a fault.

Define the Training Routine#

The train_batch method below for model_hub.huggingface.BaseTransformerTrial is sufficient for this example.

    def train_batch(self, batch: Any, epoch_idx: int, batch_idx: int) -> Any:
        # By default, all HF models return the loss in the first element.
        # We do not automatically apply a label smoother for the user.
        # If this is something you want to use, please see how it's
        # applied by transformers.Trainer:
        # https://github.com/huggingface/transformers/blob/v4.3.3/src/transformers/trainer.py#L1324
        outputs = self.model(**batch)
        loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]
        self.context.step_optimizer(self.optimizer, self.grad_clip_fn)
        return loss

Define the Evaluation Routine#

Finally, we can define the evaluation routine for this example.

    def evaluate_batch(self, batch: det_torch.TorchData, batch_idx: int) -> Dict:
        ind = batch.pop("ind")
        outputs = self.model(**batch)
        if isinstance(outputs, dict):
            predictions = tuple(
                v.detach().cpu().numpy() for k, v in outputs.items() if k not in ("loss", "mems")
            predictions = outputs[1:].detach().cpu().numpy()

        self.reducer.update((ind.detach().cpu().numpy(), predictions))
        # Although we are returning the empty dictionary below, we will still get the metrics from
        # custom reducer that we passed to the context during initialization.
        return {}

After passing the batch through the model and doing some processing to get the predictions, we pass the predictions to reducer.update to aggregate the predictions in each GPU. Once each GPU has exhausted the batches in its dataloader, Determined automatically performs an all gather operation to collect the predictions in the rank 0 GPU before passing them to the compute_metrics function.

HF Library Versions#

model-hub support for transformers is tied to specific versions of the source library to ensure compatibility. Be sure to use the latest Docker image with all the necessary dependencies for transformers with model-hub. All provided examples already have this Docker image specified:


We periodically bump these libraries up to more recent versions of transformers and datasets so you can access the latest upstream features. That said, once you create a trial definition using a particular Docker image, you will not need to upgrade to a new Docker image for your code to continue working with model-hub. Additionally, your code will continue to work with that image even if you use it with a more recent version of the Determined cluster.

Next Steps#