Worker Managers¶
Worker Managers allow you to automatically spin up VMs and start CodaLab workers on them to run your staged jobs.
We support the following Worker Managers:
Name | Description |
---|---|
aws-batch | Worker manager for submitting jobs to AWS Batch. |
azure-batch | Worker manager for submitting jobs to Azure Batch. |
slurm-batch | Worker manager for submitting jobs using Slurm Batch. |
kubernetes | Worker manager for submitting jobs to a Kubernetes cluster. |
Setting a shared cache¶
To use a shared cache among workers, have all the workers use the same working directory by specifying
the same path for --work-dir
. The working directory is set by -worker-work-dir-prefix
when starting
a worker manager. The dependency managers can be used over NFS, so a working directory can be on a network disk.
cl-worker-manager --worker-work-dir-prefix /juice slurm-batch --cpus 4 --gpus 1 --memory-mb 16000
In the example worker manager command above, juice
is a directory on a network disk.
AWS Batch Worker Manager¶
Configure AWS Batch (one-time setup)¶
- Authenticate AWS on the command-line:
- Install the CLI:
pip install awscli
. - Authenticate by running
aws configure
and fill out the form.
- Install the CLI:
- Create a launch template for EC2 instances by running:
aws ec2 --region <region> create-launch-template --cli-input-json file://lt.json
Your launch template lt.json
should look something like this:
{
"LaunchTemplateName": "increase-root-volume",
"LaunchTemplateData": {
"BlockDeviceMappings": [
{
"DeviceName": "/dev/xvda",
"Ebs": {
"Encrypted": true,
"VolumeSize": <Desired volume size in GB as an integer>,
"VolumeType": "gp2"
}
}
]
}
}
- Log on to the AWS console.
- In the upper right corner, select the region.
- Type
Batch
in the search bar and clickBatch
underServices
. - Create a compute environment:
- Click
Compute environments
and thenCreate
. - Specify a name for
Compute Environment Name
. - Under
Instance Configuration
, selectOn-Demand
orSpot
. - Specify the type of EC2 instances under the
Allowed Instance Types
dropdown menu. - Under
Additional Settings
, select the launch template you created. - Click
Create compute environment
.
- Click
- Configure a job queue:
- Click
Job queues
and thenCreate
. - Give your job queue a name.
- Under
Connected compute environments
, select the compute environment from the previous step. - Click
Create
.
- Click
- Wait for the job queue and compute environment to have a status of
VALID
.
Start a AWS Batch Worker Manager¶
Use the AWS Batch Worker Manager,
to start the worker manager. Pass in the name of the job queue for --job-queue
.
Azure Batch Worker Manager¶
Configure Azure Batch (one-time setup)¶
-
Log on to the Azure portal using your credentials.
-
Go to the Batch account where you want to start your workers. If you don't have a Batch account create one through the Azure portal.
-
Click
Keys
and take note ofBatch account
,URL
, andPrimary access key
as you will need this information to start the worker manager. -
Next create and configure a Batch Pool and a Batch Job.
How to create a Batch Pool¶
-
Go to
Pools
and selectAdd
. -
Under
Pool Detail
:- For
Pool ID
, give your Pool a unique name. - Skip
Display Name
.
- For
-
Under
Operating System
:- Keep
Image Type
asMarketplace
. - For
Publisher
, selectmicrosoft-azure-batch
. - For
Offer
, selectubuntu-server-container
. - For
Sku
, select20-04-lts
. - Toggle
Container configuration
toCustom
. - Make sure
Container type
isDocker compatible
.
- Keep
-
Under
Node Size
, select the appropriate VM size. If you want a CPU-only pool you would selectStandard D3_v2 (4 vCPUs, 14 GB Memory)
for example. For a gpu pool, selectStandard NC6 (6 vCPUs, 56 GB Memory)
for example. -
Under
Scale
:- Toggle
Mode
toAuto scale
. - Set
AutoScale Evaluation Interval
to an appropriate time. This controls how often the pool autoscales. - For
Formula
, create your custom autoscale formula based on your compute needs. The following is an example:
- Toggle
// The pool size is adjusted based on the number of tasks in the queue
// The variables prepended with '$' in this formula are Azure service-defined variables
// Adjust the min and max number of VMs accordingly
minNumberOfVMs = 1;
maxNumberOfVMs = 15;
// Samples are obtained every 30 seconds over a 5 minute interval
pendingTaskSamplePercent = $PendingTasks.GetSamplePercent(5 * TimeInterval_Minute);
// If we have more than 50 percent data points, we use the history average of number of tasks.
// It is bad practice to simply use the last sample, as it can be stale and not indicative of the current situation.
pendingTaskSamples = pendingTaskSamplePercent < 50 ? minNumberOfVMs : avg($PendingTasks.GetSample(5 * TimeInterval_Minute));
pendingTaskSamples = max(pendingTaskSamples, minNumberOfVMs);
$TargetDedicatedNodes=min(pendingTaskSamples, maxNumberOfVMs);
// Set node deallocation mode - keep nodes active only until tasks finish
$NodeDeallocationOption = taskcompletion;
-
Create the Batch Pool by clicking OK.
-
Select
Pools
and ensure that the Batch Pool you just created shows up on the page.
How to create a Batch Job¶
-
Go to
Jobs
and selectAdd
. -
For
Job ID
, give your Job a unique ID. For example, you can give it name with the format{environment}-{resource type}
, whereenvironment
is eitherprod
ordev
and resource type is eithergpu
orcpu
(e.g.prod-cpu
). -
For
Pool
, select the corresponding Batch Pool. -
Create the Batch Job by clicking OK.
-
Select
Jobs
and ensure that your Batch Job shows up on the page.
Start a Azure Batch Worker Manager¶
Use the Azure Batch Worker Manager
to start the worker manager by passing in azure-batch
for the
type of worker manager.
Below is an example of how to start a worker manager:
cl-worker-manager --server https://worksheets.codalab.org --min-workers 0 --max-workers 8
--min-seconds-between-workers 300 --sleep-time 120 --worker-pass-down-termination
--worker-exit-on-exception --worker-exit-after-num-runs 1 azure-batch
--account-name <Azure Batch account name> --account-key <Azure Batch account key>
--service-url <Azure Batch service URL> --log-container-url <URL of the Azure Storage container to store the worker logs>
--job-id <Name of the Batch Job> --cpus <Number of CPUs on VM> --gpus <Number of GPUS on VM> --memory-mb <Amount of memory on VM in MB>
Checking worker logs in Azure¶
For a running bundle¶
-
Go to the bundle view page of the bundle and get the worker ID from the
remote
field. The remote field is in form<hostname>-<worker ID>
. -
Login to
portal.azure.com
and go to your Batch account. - Under
Features
, selectJobs
. - Select the Batch Job your worker was running on.
- Search for the task by typing
cl_worker_<worker ID>
into the search bar. - Open the task for the log files of the running worker.
For a failed bundle¶
-
Go to the bundle view page of the bundle and get the worker ID from the
remote
field. The remote field is in form<hostname>-<worker ID>
. -
Login to
portal.azure.com
and go to the storage account. - Under
Blob service
, selectContainers
. - Select the Batch Job your worker was running on.
- Search for the blob by typing
cl_worker_<worker ID>
into the search bar. - Open the blob and select
Edit
to view the logs in the browser. SelectDownload
to download the file.
Force kill an Azure Batch worker¶
Sometimes, if a bundle cannot be killed, you may want to force kill the Azure Batch worker. Note: this will kill all other bundles that are running on this worker, so only do this if you absolutely need to (if the bundle cannot be stopped otherwise). To do so,
- Follow the steps in the previous section to get the worker ID of the running bundle, then navigate to the corresponding task on the Azure Console.
- Click "Terminate" to terminate the worker.
- Look through the logs, if useful, and file a GitHub issue related to the problem that this particular worker was having.
Kubernetes Batch Worker Manager¶
Configure GKE (one-time setup)¶
Setting up gcloud and kubectl¶
- The Cloud SDK is needed to manage GKE clusters. Follow these instructions to install the Cloud SDK and GKE.
- Login to gcloud by running:
gcloud auth login
. - Set the project by running
gcloud config set project hai-gcp-natural-language
. - Install kubectl by running
gcloud components install kubectl
.
Creating a cluster ¶
The CodaLab Kubernetes worker manager creates pods in the GKE cluster to run jobs. Create a cluster by following these documentations:
Here are some additional links to help determine the parameter values of
gcloud container clusters create
:
We will create two pools when starting a GKE cluster:
- The default pool comprises a single E2-standard machine that runs essential, non-GPU jobs (e.g., running the NFS-server).
- The GPU pool runs CodaLab workers.
Creating separate pools for GPU vs. non-GPU jobs allows the GPU pool to scale down to 0 when there aren't any running CodaLab jobs.
Below is an example of how to create a GKE cluster with a separate GPU pool:
gcloud container clusters create codalab-worker-manager-cluster \
--zone us-west1-a \
--machine-type e2-standard-4 \
--disk-type=pd-ssd \
--disk-size 100GB \
--num-nodes 1 \
--image-type UBUNTU \
--scopes=cloud-platform,gke-default \
gcloud beta container node-pools create gpu-pool \
--cluster codalab-worker-manager-cluster \
--zone us-west1-a \
--machine-type n1-standard-8 \
--disk-type=pd-ssd \
--disk-size 256GB \
--num-nodes 0 \
--min-nodes 0 \
--max-nodes 8 \
--enable-autoscaling \
--image-type UBUNTU \
--accelerator type=nvidia-tesla-p100,count=1 \
--scopes=cloud-platform,gke-default \
--spot
The commands above will create an auto-scaling cluster in us-west1-a
with no
n1-standard machine at initialization with the option to auto-scale up to 8
Spot nodes with P100 GPUs.
By not specifying a cluster version with the --cluster-version
argument, GCP will create a cluster
with the default version in the
Stable channel.
Setting --scopes=cloud-platform,gke-default
is required to configure Nvidia drivers and dependencies for the nodes.
Also, note that only 74% of memory is available to a CodaLab worker on the VM
(see the following documentation for more information).
Next, run yes Y | gcloud beta container clusters update codalab-worker-manager-cluster
--autoscaling-profile optimize-utilization --region us-west1-a
to ensure that
the cluster scales down more aggressively.
Deleting a GKE cluster¶
To delete a cluster, simply run:
yes Y | gcloud container clusters delete codalab-worker-manager-cluster --region us-west1-a
Managing a GKE cluster¶
Use kubectl to manage the cluster and pods. Here is a list of common kubectl commands:
kubectl describe nodes
kubectl describe pods
kubectl delete pods <pod>
kubectl get pods
kubectl get pods -A
kubectl logs <pod> -c <container>
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml
You can also manage a GKE cluster in the GCP console. To manage the GKE cluster in the console, go to the GCP Kubernetes console.
Installing Nvidia Driver and Dependencies ¶
Nvidia drivers and the nvidia-container-runtime tool by default are not installed in most GCP virtual machines. Therefore, we need to create node initializers that download and install these dependencies on existing nodes and future nodes that are brought up by auto-scaling.
Additionally, the NVIDIA Device Plugin is required to expose the GPUs of the nodes in your cluster. The NVIDIA Device Plugin is a daemonset that automatically enumerates the number of GPUs on each node of the cluster and allows pods to be run on GPUs.
For more information on bootstrapping GKE nodes with DaemonSets, see the following documentation.
To set this up:
- Go to the
gcp
directory of this repository:cd docs/gcp
. - Run
kubectl apply -f cm-entrypoint.yaml && kubectl apply -f daemon-set.yaml
. - Create the Nvidia Device Plugin by running
kubectl create -f nvidia-device-plugin.yaml
. - To verify that the Nvidia drivers are installed correctly
- Go to the GCP Compute Engine console.
- Find a virtual machine with a GPU that belongs to your GKE cluster and connect to the
machine by clicking the
SSH
button. - In the terminal session, run
sudo nvidia-smi
to see if the driver can communicate with the GPU. - Run
sudo docker run --runtime=nvidia --rm nvidia/cuda:11.0-base nvidia-smi
and check that the output is the same as the output in the previous step.
Optional: Setting up a Network File System (NFS) server ¶
You can attach additional storage by creating a NFS server.
To set this up:
- Create a compute disk named
pd
in GCP by running:gcloud compute disks create --size=<Size of disk in GB>GB --zone=us-west1-a --type pd-ssd pd
. - Go to the
docs/gcp
directory of this repository:cd docs/gcp
. - Run
kubectl apply -f nfs-server.yaml && kubectl apply -f nfs-service.yaml && kubectl get svc nfs-server
. This will output the IP address of the cluster. - Update
nfs-pv.yaml
with the IP address from step 3. - Run
kubectl apply -f nfs-pv.yaml
. - Note the name of the persistent volume specified here and the volume mount path specified here as you will need these to start the worker manager.
Authentication and setting up a service account ¶
A GCP service account and cluster certificate are required to authenticate and run Kubernetes commands through the worker manager.
To create a service account:
- Run
cd codalab-worksheets && kubectl apply -f docs/gcp/service-account.yaml --namespace default
- Then, run
kubectl get secrets --namespace default
- Get the auth token by running:
kubectl describe secret/codalab-service-account-secret
.
To get the cluster certificate:
- Go to GKE console and click on the newly created GKE cluster.
- Under
Cluster Basics
, find theEndpoint
field. - Take note of the endpoint URL as this needed to start the worker manager.
- Next, click on
Show Cluster Certificate
and copy the entire contents and place it in a file calledgke.crt
.
Start a Kubernetes Batch Worker Manager¶
Use the Kubernetes Batch Worker Manager
to start the worker manager by passing in kubernetes
for the
type of worker manager.
At this point, four things are required to start a Kubernetes worker manager:
- A running Kubernetes cluster with Nvidia drivers installed
- An auth token
- Path to the GKE cluster certificate
- Endpoint URL of the cluster host
You can start a Kubernetes worker manager manually, by using the cl-worker-manager
command.
Below is an example of how to start a worker manager:
cl-worker-manager --server https://worksheets.codalab.org --min-workers 0 --max-workers 8
--min-seconds-between-workers 300 --sleep-time 120 --worker-pass-down-termination
--worker-exit-on-exception --worker-exit-after-num-runs 1 kubernetes
--cert-path <Path to gke.crt> --auth-token <Auth token> --cluster-host <Endpoint URL of cluster host>
--cpus <Number of CPUs on VM> --gpus <Number of GPUS on VM> --memory-mb <Amount of memory on VM in MB>
If you want to mount an NFS volume to the worker pods, additionally specify the following arguments:
- --nfs-volume-name
: Name of the persistent volume
- --worker-work-dir-prefix
: NFS volume mount path
Checking worker logs in GCP¶
- Go to the bundle view page of the bundle and get the worker ID from the
remote
field. - Go to the GKE console.
- In the
Cluster
dropdown menu, specify the GKE cluster the worker is running on. - Click on the pod with the name
cl-worker-<ID of worker from step 1>
. - Click on the
Logs
tab to view the worker logs.