Porting Checklist

If you port your code to Determined, you should walk through this checklist to ensure your code does not conflict with the Determined library.

Removing Pined GPUs

Determined handles scheduling jobs on available slots. However, you need to let the Determined library handles choosing the GPUs.

Take this script as an example. It has the following code to configure the GPU:

if args.gpu is not None:
    print("Use GPU: {} for training".format(args.gpu))

Any use of args.gpu should be removed.

Removing Distributed Training Code

To run distributed training outside Determined, you need to have code that handles the logic of launching processes, moving models to pined GPUs, sharding data, and reducing metrics. You need to remove this code to be not conflict with the Determined library.

Take this script as an example. It has the following code to initialize the process group:

if args.distributed:
    if args.dist_url == "env://" and args.rank == -1:
        args.rank = int(os.environ["RANK"])
    if args.multiprocessing_distributed:
        # For multiprocessing distributed training, rank needs to be the
        # global rank among all the processes
        args.rank = args.rank * ngpus_per_node + gpu
    dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
                            world_size=args.world_size, rank=args.rank)

This example also has the following code to set up CUDA and converts the model to a distributed one.

if not torch.cuda.is_available():
    print('using CPU, this will be slow')
elif args.distributed:
    # For multiprocessing distributed, DistributedDataParallel constructor
    # should always set the single device scope, otherwise,
    # DistributedDataParallel will use all available devices.
    if args.gpu is not None:
        torch.cuda.set_device(args.gpu)
        model.cuda(args.gpu)
        # When using a single GPU per process and per
        # DistributedDataParallel, we need to divide the batch size
        # ourselves based on the total number of GPUs we have
        args.batch_size = int(args.batch_size / ngpus_per_node)
        args.workers = int((args.workers + ngpus_per_node - 1) / ngpus_per_node)
        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
    else:
        model.cuda()
        # DistributedDataParallel will divide and allocate batch_size to all
        # available GPUs if device_ids are not set
        model = torch.nn.parallel.DistributedDataParallel(model)
elif args.gpu is not None:
    torch.cuda.set_device(args.gpu)
    model = model.cuda(args.gpu)
else:
    # DataParallel will divide and allocate batch_size to all available GPUs
    if args.arch.startswith('alexnet') or args.arch.startswith('vgg'):
        model.features = torch.nn.DataParallel(model.features)
        model.cuda()
    else:
        model = torch.nn.DataParallel(model).cuda()

This code is unnecessary in the trial definition. When we create the model, we will wrap it with self.context.wrap_model(model), which will convert the model to distributed if needed. We will also automatically set up horovod for you. If you would like to access the rank (typically used to view per GPU training), you can get it by calling self.context.distributed.rank.

To handle data loading in distributed training, this example has the code below:

traindir = os.path.join(args.data, 'train')
valdir = os.path.join(args.data, 'val')
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                std=[0.229, 0.224, 0.225])

train_dataset = datasets.ImageFolder(
    traindir,
    transforms.Compose([
        transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        normalize,
    ]))

# Handle distributed sampler for distributed training.
if args.distributed:
    train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
else:
    train_sampler = None

This should be removed since we will use distributed data loader if you following the instructions of build_training_data_loader() and build_validation_data_loader().

Getting Hyperparameters from PyTorchTrialContext

Take the following code for example.

def __init__(self, context: PyTorchTrialContext):
    self.context = context
    if args.pretrained:
        print("=> using pre-trained model '{}'".format(args.arch))
        model = models.__dict__[args.arch](pretrained=True)
    else:
        print("=> creating model '{}'".format(args.arch))
        model = models.__dict__[args.arch]()

args.arch is a hyperparameter. You should define the hyperparameter space in the experiment configuration and use self.context.get_hparams(), which gives you access to all the hyperparameters for the current trial. By doing so, you get better tracking in the WebUI, especially for experiments that use a searcher.