AWS Spot Instances¶
This document describes how to use AWS spot instances with Determined. Spot instances can be much cheaper than on-demand instances (up to 90% cheaper, but more often 70-80%) but they are unreliable, so software that runs on spot instances must be fault tolerant. Unfortunately, deep learning code is often not written with fault tolerance in mind, preventing many practitioners from using spot instances easily. Because Determined was built with fault tolerance as a core feature, it works seamlessly when run on top of spot instances, allowing users to reduce their training costs by setting a single flag in the Cluster Configuration.
Determined handles most details of working with spot instances, but knowing how spot instances work at a high level is helpful for understanding the behavior of Determined when provisioning spot instances.
On-Demand vs Spot¶
The standard way to create an EC2 instance is to create an on-demand instance. On-demand instances always have the same price and once you have created one, you will keep that instance indefinitely.
The number of EC2 instances desired by customers varies over time. AWS wants to have enough hardware so that during peak usage periods, all customers are able to get as many on-demand instances as they need. This means that during non-peak periods (which is most of the time), AWS has extra hardware capacity sitting around unused. AWS makes these extra instances available to customers via the spot market as spot instances. They can be rented at a much reduced cost, but the trade-off is that, if AWS needs instances to satisfy on-demand users, they can reclaim your spot instance and give it to the on-demand user. If all available instances are in use by on-demand customers, you will not be able to launch any spot instances.
Because spot capacity depends on supply and demand, your ability to launch spot instances will vary by region/availability zone, as well as instance type. GPU instances, particularly those with the most modern GPUs, are often in high demand and they may sometimes be difficult to get as spot instances.
Using Spot Instances with Determined¶
Because they can be reclaimed, using spot instances requires you to have
good fault tolerance built into your software. Determined was built with
fault-tolerance as a core feature, so using spot instances is usually as
easy as configuring a resource pool with
true in the master configuration. Here is a fragment of a master
configuration file that defines a resource pool with up to 10 p2.8xlarge
resource_pools: - pool_name: aws-spot-p2-8xlarge provider: type: aws max_instances: 10 instance_type: p2.8xlarge spot: true # ...
det deploy to install Determined on AWS, you can also
--spot when running
det deploy aws up to cause the
default CPU and GPU resource pools to use spot instances.
AWS might not always have the capacity to fulfill spot requests. When
AWS is out of capacity, the Determined cluster will wait until AWS has
capacity and can fulfill the requests. This will be visible in the
master logs. A useful approach is to configure the master with two
resource pools, one that uses spot instances and another that uses
on-demand instances. This allows users to easily select whether to use
spot or on-demand instances for a given job by setting
resources.resource_pool appropriately in their experiment
Unlike on-demand instances, the market price of a spot instance varies. Once the spot instance has been created, the hourly cost to run that instance is constant, but if you try to create another spot instance, the price may have changed. The spot price is typically around 70% less than the on-demand price (see AWS’s spot advisor for up-to-date information), but it can technically rise as high as the on-demand price.
AWS and Determined allow you to specify the maximum price that you are
willing to pay for a spot instance. If the market price is above that
number, the spot instance will not be created until the price falls
under your maximum price. You can set this value via the
spot_max_price field in the master configuration. The market price
being above your
spot_max_price is another reason why Determined may
not be creating spot instances when you expect it to. If this is
preventing Determined from creating spot instances, this will be visible
in the master logs.
Many users want to reduce costs by using spot, but have deadlines and
are not willing to delay their experiments. In this case, it may be best
to not set
spot_max_price and pay whatever the market price is. Your
mileage may vary, but at Determined, we have regularly seen 70% cost
reductions when using V100s, without specifying a