Cluster Configuration

This YAML file provides the cluster configuration in PEDL. In particular, this file contains the following fields:

  • scheduler: Specifies how PEDL schedules tasks to agents.

    • fit: the scheduling policy to use when assigning tasks to agents in the cluster.

      • best: the best-fit policy ensures that tasks will be preferentially "packed" together on the smallest number of agents.

      • worst: the worst-fit policy ensures that tasks will be placed on under-utilized agents.

  • provisioner: Specifies the configuration of dynamic agents.

    • master_url: the full url of the master. A valid url is in the format of scheme://host:port. The scheme must be either http or https. If the master is deployed on EC2, rather than hardcoding the ip address, we advise you use one of the following to set the host as an alias: local-ipv4, public-ipv4, local-hostname, or public-hostname. If the master is deployed on GCP, rather than hardcoding the ip address, we advise you use one of the following to set the host as an alias: internal-ip orexternal-ip. Which one you should select is based on your network configuration. On master startup, we will replace the above alias host with its real value. Defaults to http as scheme, local ip address as host, and 8080 as port.

    • startup_script: startup script for agents. This script will run right away when agent instances start up. For example, it can be used for formating and mounting a disk. Defaults to an empty string.

    • agent_docker_network: the Docker network to use for the PEDL agent and task containers. If this is set to "host", Docker host-mode networking will be used instead. The default value is "pedl".

    • max_idle_agent_period: length of the waiting period before terminating an idle agent instance. This string is a sequence of decimal numbers, each with optional fraction and a unit suffix, such as "30s", "1h", or "1m30s". Valid time units are "s", "m", "h".

    • provider: aws: Specifies running dynamic agents on AWS. (Required)

      • region: the region of the AWS resources used by PEDL. We advise setting this region to be the same region as the PEDL master for better network performance. Defaults to the same region as the master.

      • root_volume_size: size of the root volume of the PEDL agent in GB. We recommend at least 100GB. Defaults to 200.

      • image_id: the AMI ID of the PEDL agent that was shared with you. (Required)

      • tag_key: key for tagging the PEDL agent instances. Defaults to managed-by.

      • tag_value: value for tagging the PEDL agent instances. Defaults to the master instance id if the master is on EC2 otherwise determined-ai-pedl.

      • instance_name: name to set for the PEDL agent instances. Defaults to determined-ai-agent.

      • ssh_key_name: the name of the ssh key registered with AWS for ssh key access to the agent instances. (Required)

      • iam_instance_profile_arn: the Amazon Resource Name (ARN) of the IAM instance profile to attach to the agent instances.

      • network_interface: network interface to set for the PEDL agent instances.

        • public_ip: flag to using public IP address for the PEDL agent instances. See Network Requirements for instruction on whether an external IP should be set. Defaults to false.

        • security_group_id: the ID of the security group to run the PEDL agents as. This is the one you identified or created in Network Requirements . Defaults to the default security group of the specified VPC.

        • subnet_id: the ID of the subnet to run the PEDL agents in. Defaults to the default subnet of the default VPC.

      • max_instances: max number of PEDL agent instances. Defaults to 5.

      • instance_type: type of instance for the PEDL agents. We only support P3 and P2 type instances. Defaults to p3.8xlarge.

    • provider: gcp: Specifies running dynamic agents on GCP. (Required)

      • base_config: instance resource base configuration that will be merged with the fields below to construct GCP inserting instance request. See [REST Resource: instances] (https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert) for details.
      • project: the project id of the GCP resources used by PEDL. Defaults to the project of the master.
      • zone: the zone of the GCP resources used by PEDL. Defaults to the zone of the master.

      • boot_disk_size: size of the root volume of the PEDL agent in GB. We recommend at least 100GB. Defaults to 200.

      • boot_disk_source_image: the boot disk source image of the PEDL agent that was shared with you. To use a specific version of the PEDL agent image from a specific project, it should be set in the format: projects/<project-id>/global/images/<image-id>. (Required)

      • label_key: key for labeling the PEDL agent instances. Defaults to managed-by.

      • label_value: value for labeling the PEDL agent instances. Defaults to the master instance name if the master is on GCP otherwise determined-ai-pedl.

      • name_prefix: name prefix to set for the PEDL agent instances. The names of the PEDL agent instances are a concatenation of the name prefix and a pet name. Defaults to the master instance name if the master is on GCP otherwise determined-ai-pedl.

      • network_interface: network configuration for the PEDL agent instances. See the GCP API Access section for the suggested configuration. (Required)

        • network: network resource for the PEDL agent instances. The network configuration should specify the project id of the network. It should be set in the format: projects/<project>/global/networks/<network>. (Required)

        • subnetwork: subnetwork resource for the PEDL agent instances. The subnet configuration should specify the project id and the region of the subnetwork. It should be set in the format: projects/<project>/regions/<region>/subnetworks/<subnetwork>. (Required)

        • external_ip: flag to using external IP address for the PEDL agent instances. See Network Requirements for instructions on whether an external IP should be set. Defaults to false.

      • network_tags: an array of network tags to set firewalls for the PEDL agent instances. This is the one you identified or created in System requirements - Firewall Rules. Defaults to be an empty array.

      • service_account: service account for the PEDL agent instances. See the GCP API Access section for suggested configuration.

        • email: email of the service account for the PEDL agent instances. Defaults to be an empty string.

        • scopes: list of scopes authorized for the PEDL agent instances. As suggested in GCP API Access, we recommend you set the scopes to ["https://www.googleapis.com/auth/cloud-platform"]. Defaults to ["https://www.googleapis.com/auth/cloud-platform"].

      • instance_type: type of instance for the PEDL agents.

        • machine_type: type of machine for the PEDL agents. Defaults to n1-standard-32.

        • gpu_type: type of GPU for the PEDL agents. Defaults to nvidia-tesla-v100.

        • gpu_num: number of GPU for the PEDL agents. Defaults to 4.

      • max_instances: max number of PEDL agent instances. Defaults to 5.

  • checkpoint_storage: Specifies where model checkpoints will be stored. A checkpoint contains the architecture and weights of the model being trained. PEDL currently supports four kinds of checkpoint storage, gcs, hdfs, s3, and shared_fs, identified by the type subfield.

    • type: gcs: Checkpoints are stored on Google Cloud Storage (GCS). Authentication is done using GCP's "Application Default Credentials" approach. When using PEDL inside Google Compute Engine (GCE), the simplest approach is to ensure that the VMs used by PEDL are running in a service account that has the "Storage Object Admin" role on the GCS bucket being used for checkpoints. As an alternative (or when running outside of GCE), you can add the appropriate service account credentials to your container (e.g., via a bind-mount), and then set the GOOGLE_APPLICATION_CREDENTIALS environment variable to the container path where the credentials are located.

      • bucket: The GCS bucket name to use.
    • type: hdfs: Checkpoints are stored in HDFS using the WebHDFS API for reading and writing checkpoint resources.

      • hdfs_url: Hostname or IP address of HDFS namenode, prefixed with protocol, followed by WebHDFS port on namenode. Multiple namenodes are allowed as a semicolon-separated list (e.g., "http://namenode1:50070;http://namenode2:50070").
      • hdfs_path: The prefix path where all checkpoints will be written to and read from. The resources of each checkpoint will be saved in a subdirectory of hdfs_path, where the subdirectory name is the checkpoint's UUID.
      • user: An optional string value that indicates the user to use for all read and write requests. If left unspecified, the default user of the trial runner container will be used.
      • kerberos (Experimental): A optional boolean value indicating that Kerberos is enabled on the HDFS cluster (defaults to false). If true, Kerberos authentication will be used when connecting to HDFS. Kerberos authenticat ion cannot be combined with the user configuration option. Please see the security/kerberos section for more information about configuring Kerberos.
    • type: s3: Checkpoints are stored in Amazon S3.

      • bucket: The S3 bucket name to use.
      • access_key: The AWS access key to use.
      • secret_key: The AWS secret key to use.
      • endpoint_url: The optional endpoint to use for S3 clones, e.g., http://127.0.0.1:8080/.
    • type: shared_fs: Checkpoints are written to a directory on the agent's file system. The assumption is that the system administrator has arranged for the same directory to be mounted at every agent host, and for the content of this directory to be the same on all agent hosts (e.g., by using a distributed or network file system such as GlusterFS or NFS).

      • host_path: The file system path on each agent to use.
      • container_path: The optional file system path to use as the mount point in the trial runner container. Defaults to /pedl_shared_fs.
      • storage_path: The optional path where checkpoints will be written to and read from. Must be a subdirectory of the host_path or an absolute path containing the host_path. If unset, checkpoints are written to and read from the host_path.
      • propagation: (Advanced users only) Optional propagation behavior for replicas of the bind-mount. Defaults to rprivate.
    • When an experiment finishes, the system will optionally delete some checkpoints to reclaim space. The save_experiment_best, save_trial_best and save_trial_latest parameters specify which checkpoints to save. See the documentation on Checkpoint Garbage Collection for more details.