This part of the documentation describes how to train a Tensorpack model in PEDL.
There are three steps needed to define a Tensorpack model in PEDL using a Standard Model Definition:
- Define a
make_data_loaders()function to specify data access and any preprocessing in the data pipeline.
- Optionally, subclass the abstract class
Evaluator. This part of the interface defines the validation process if the validation dataflow is not specified in
- Subclass the abstract class
TensorpackTrial. This part of the interface defines the deep learning model, including the graph, loss, and optimizers.
Data Loading via
A PEDL user prescribes data access in
TensorpackTrial by writing a
make_data_loaders() function. This function should return a pair of
tp.DataFlow objects, the first for the training set and the second for the validation set. Alternatively, this function can return a single
tp.DataFlow object to use for training; in that case, a subclass of the
Evaluator must be provided to define the validation process.
def make_data_loaders(experiment_config, hparams): ... return trainDataset, valDataset
For cases where the dataset is too large to be stored locally, PEDL supports downloading data from Google Cloud Storage (GCS). To use this feature, replace code that opens a local file, e.g.,:
import cv2 image = cv2.imread(filename)
with a call to GCS:
import cv2 from google.cloud import storage from pedl.util import download_gcs_blob_with_backoff c = storage.Client.create_anonymous_client() gcs_bucket_name = "bucket_name" bucket = c.get_bucket(gcs_bucket_name) blob = bucket.blob(filename) s = download_gcs_blob_with_backoff(blob) image = cv2.imdecode(np.asarray(bytearray(s), dtype=np.uint8))
download_gcs_blob_with_backoff implements a standard error handling strategy for network applications in which a client periodically retries a failed request with increasing delays between requests.
This strategy is suggested when reading data from GCS to handle transient network failures and HTTP 429 and 5xx error codes.
Instead of defining the validation dataflow in
make_data_loaders, PEDL users can subclass
Evaluator to perform validation. Using
Evaluator allows users to manually specify the validation graph, as well as custom code for computing validation metrics. Users must define the following abstract methods:
set_up_graph(self, trainer): Builds the validation graph. The
trainerargument is an instance of
compute_validation_metrics(self): Defines the process for computing validation metrics. This function returns the validation metrics.
Tensorpack trials are created by subclassing the abstract class
TensorpackTrial. Users must define the following abstract methods that will specify the deep learning model associated with a trial in the experiment, as well as how to subsequently train and evaluate it:
build_model(self, hparams, trainer_type): Builds and returns the Tensorpack model (
tp.ModelDesc) to be used during training.
validation_metrics(self, hparams): If the validation dataflow is specified in
make_data_loaders, this function returns a list of metric names that will be evaluated on the validation data set (e.g.,
"cross_entropy_loss"). Otherwise this function returns an instance of
training_metrics(self, hparams): Specifies the training metrics that should be tracked (e.g.,
tensorpack_callbacks(self, hparams): Returns a list of Tensorpack callbacks to use during training. Often users choose to control their learning rate schedule via these callbacks (e.g.,
tensorpack_monitors(self, hparams): Returns a list of Tensorpack monitors to use during training.
load_backbone_weights(self, hparams): Returns the filepath for the backbone weights which are loaded prior to training.
Learning Rate Schedule¶
When training models using Tensorpack, users often choose to control learning rate schedules via callbacks.
TensorpackTrial provides PEDL users with
ScheduleSetter, which subclasses
ScheduleSetter can be used for any hyperparameter, but is most commonly used to control learning rate.)
ScheduleSetter takes a list of
SchedulePoint() objects, which defines how the value of hyperparameter changes during training.
SchedulePoint(point, value, interp): Specifies that the value of the hyperparameter should be exactly
valueat the end of
interpcan either be
Noneto specify that the value should remain the same until the next point or
"interp"to specify that it should be linearly interpolated.
ScheduleSetter(param, schedule): Defines the parameter name (e.g.,
"learning_rate") and a list of schedule points.
An example use case where we reduce the initial learning rate every 1000 steps by factor of 10:
from pedl.frameworks.tensorflow.tensorpack_trial import SchedulePoint, ScheduleSetter, TensorpackTrial def make_schedule(): init_lr = 0.1 schedule =  for idx in range(10): mult = 0.1 ** idx schedule.append(SchedulePoint(1000 * idx, init_lr * mult)) return schedule class YourTrial(TensorpackTrial): ... def tensorpack_callbacks(self, hparams): return [ ScheduleSetter("learning_rate", make_schedule()) ]
When training in a distributed setting,
TensorpackTrial supports a performance optimization that shortens the training time by reducing communication. PEDL users can enable this feature by setting the hyperparameter
aggregation_frequency to be greater than 1:
hyperparameters: ... aggregation_frequency: 4
Aggregation frequency controls the frequency with which updates are communicated between workers. This modifies the effective training batch size (number of training samples processed per gradient update); thus PEDL users are encouraged to modify their learning rate proportionally with aggregation frequency:
new_learning_rate = original_learning_rate * aggregation_frequency.