Worker Managers¶

Worker Managers allow you to automatically spin up VMs and start CodaLab workers on them to run your staged jobs.

We support the following Worker Managers:

Name	Description
aws-batch	Worker manager for submitting jobs to AWS Batch.
azure-batch	Worker manager for submitting jobs to Azure Batch.
slurm-batch	Worker manager for submitting jobs using Slurm Batch.
kubernetes	Worker manager for submitting jobs to a Kubernetes cluster.

Setting a shared cache¶

To use a shared cache among workers, have all the workers use the same working directory by specifying the same path for --work-dir. The working directory is set by -worker-work-dir-prefix when starting a worker manager. The dependency managers can be used over NFS, so a working directory can be on a network disk.

cl-worker-manager --worker-work-dir-prefix /juice  slurm-batch --cpus 4 --gpus 1 --memory-mb 16000

In the example worker manager command above, juice is a directory on a network disk.

AWS Batch Worker Manager¶

Configure AWS Batch (one-time setup)¶

Authenticate AWS on the command-line:
1. Install the CLI: pip install awscli.
2. Authenticate by running aws configure and fill out the form.
Create a launch template for EC2 instances by running:

 aws ec2 --region <region> create-launch-template --cli-input-json file://lt.json

Your launch template lt.json should look something like this:

 {
     "LaunchTemplateName": "increase-root-volume",
     "LaunchTemplateData": {
         "BlockDeviceMappings": [
             {
                 "DeviceName": "/dev/xvda",
                 "Ebs": {
                     "Encrypted": true,
                     "VolumeSize": <Desired volume size in GB as an integer>,
                     "VolumeType": "gp2"
                 }
             }
         ]
     }
 }

Log on to the AWS console.
In the upper right corner, select the region.
Type Batch in the search bar and click Batch under Services.
Create a compute environment:
1. Click Compute environments and then Create.
2. Specify a name for Compute Environment Name.
3. Under Instance Configuration, select On-Demand or Spot.
4. Specify the type of EC2 instances under the Allowed Instance Types dropdown menu.
5. Under Additional Settings, select the launch template you created.
6. Click Create compute environment.
Configure a job queue:
1. Click Job queues and then Create.
2. Give your job queue a name.
3. Under Connected compute environments, select the compute environment from the previous step.
4. Click Create.
Wait for the job queue and compute environment to have a status of VALID.

Start a AWS Batch Worker Manager¶

Use the AWS Batch Worker Manager, to start the worker manager. Pass in the name of the job queue for --job-queue.

Azure Batch Worker Manager¶

Configure Azure Batch (one-time setup)¶

Log on to the Azure portal using your credentials.
Go to the Batch account where you want to start your workers. If you don't have a Batch account create one through the Azure portal.
Click Keys and take note of Batch account, URL, and Primary access key as you will need this information to start the worker manager.
Next create and configure a Batch Pool and a Batch Job.

How to create a Batch Pool¶

Go to Pools and select Add.
Under Pool Detail:
1. For Pool ID, give your Pool a unique name.
2. Skip Display Name.
Under Operating System:
1. Keep Image Type as Marketplace.
2. For Publisher, select microsoft-azure-batch.
3. For Offer, select ubuntu-server-container.
4. For Sku, select 20-04-lts.
5. Toggle Container configuration to Custom.
6. Make sure Container type is Docker compatible.
Under Node Size, select the appropriate VM size. If you want a CPU-only pool you would select Standard D3_v2 (4 vCPUs, 14 GB Memory) for example. For a gpu pool, select Standard NC6 (6 vCPUs, 56 GB Memory) for example.
Under Scale:
1. Toggle Mode to Auto scale.
2. Set AutoScale Evaluation Interval to an appropriate time. This controls how often the pool autoscales.
3. For Formula, create your custom autoscale formula based on your compute needs. The following is an example:

// The pool size is adjusted based on the number of tasks in the queue
// The variables prepended with '$' in this formula are Azure service-defined variables

// Adjust the min and max number of VMs accordingly
minNumberOfVMs = 1;
maxNumberOfVMs = 15;

 // Samples are obtained every 30 seconds over a 5 minute interval
pendingTaskSamplePercent = $PendingTasks.GetSamplePercent(5 * TimeInterval_Minute);

// If we have more than 50 percent data points, we use the history average of number of tasks.
// It is bad practice to simply use the last sample, as it can be stale and not indicative of the current situation.
pendingTaskSamples = pendingTaskSamplePercent < 50 ? minNumberOfVMs : avg($PendingTasks.GetSample(5 * TimeInterval_Minute));
pendingTaskSamples = max(pendingTaskSamples, minNumberOfVMs);

$TargetDedicatedNodes=min(pendingTaskSamples, maxNumberOfVMs);

// Set node deallocation mode - keep nodes active only until tasks finish
$NodeDeallocationOption = taskcompletion;

Create the Batch Pool by clicking OK.
Select Pools and ensure that the Batch Pool you just created shows up on the page.

How to create a Batch Job¶

Go to Jobs and select Add.
For Job ID, give your Job a unique ID. For example, you can give it name with the format {environment}-{resource type}, where environment is either prod or dev and resource type is either gpu or cpu (e.g. prod-cpu).
For Pool, select the corresponding Batch Pool.
Create the Batch Job by clicking OK.
Select Jobs and ensure that your Batch Job shows up on the page.

Start a Azure Batch Worker Manager¶

Use the Azure Batch Worker Manager to start the worker manager by passing in azure-batch for the type of worker manager.

Below is an example of how to start a worker manager:

cl-worker-manager --server https://worksheets.codalab.org --min-workers 0 --max-workers 8 
--min-seconds-between-workers 300 --sleep-time 120 --worker-pass-down-termination 
--worker-exit-on-exception --worker-exit-after-num-runs 1 azure-batch  
--account-name <Azure Batch account name> --account-key <Azure Batch account key> 
--service-url <Azure Batch service URL>  --log-container-url <URL of the Azure Storage container to store the worker logs>  
--job-id <Name of the Batch Job> --cpus <Number of CPUs on VM> --gpus <Number of GPUS on VM>  --memory-mb <Amount of memory on VM in MB>

Checking worker logs in Azure¶

For a running bundle¶

Go to the bundle view page of the bundle and get the worker ID from the remote field. The remote field is in form <hostname>-<worker ID>.
Login to portal.azure.com and go to your Batch account.
Under Features, select Jobs.
Select the Batch Job your worker was running on.
Search for the task by typing cl_worker_<worker ID> into the search bar.
Open the task for the log files of the running worker.

For a failed bundle¶

Go to the bundle view page of the bundle and get the worker ID from the remote field. The remote field is in form <hostname>-<worker ID>.
Login to portal.azure.com and go to the storage account.
Under Blob service, select Containers.
Select the Batch Job your worker was running on.
Search for the blob by typing cl_worker_<worker ID> into the search bar.
Open the blob and select Edit to view the logs in the browser. Select Download to download the file.

Force kill an Azure Batch worker¶

Sometimes, if a bundle cannot be killed, you may want to force kill the Azure Batch worker. Note: this will kill all other bundles that are running on this worker, so only do this if you absolutely need to (if the bundle cannot be stopped otherwise). To do so,

Follow the steps in the previous section to get the worker ID of the running bundle, then navigate to the corresponding task on the Azure Console.
Click "Terminate" to terminate the worker.
Look through the logs, if useful, and file a GitHub issue related to the problem that this particular worker was having.

Kubernetes Batch Worker Manager¶

Configure GKE (one-time setup)¶

Setting up gcloud and kubectl¶

The Cloud SDK is needed to manage GKE clusters. Follow these instructions to install the Cloud SDK and GKE.
Login to gcloud by running: gcloud auth login.
Set the project by running gcloud config set project hai-gcp-natural-language.
Install kubectl by running gcloud components install kubectl.

Creating a cluster ¶

The CodaLab Kubernetes worker manager creates pods in the GKE cluster to run jobs. Create a cluster by following these documentations:

Here are some additional links to help determine the parameter values of
gcloud container clusters create:

We will create two pools when starting a GKE cluster:

The default pool comprises a single E2-standard machine that runs essential, non-GPU jobs (e.g., running the NFS-server).
The GPU pool runs CodaLab workers.

Creating separate pools for GPU vs. non-GPU jobs allows the GPU pool to scale down to 0 when there aren't any running CodaLab jobs.

Below is an example of how to create a GKE cluster with a separate GPU pool:

gcloud container clusters create codalab-worker-manager-cluster  \
    --zone us-west1-a \
    --machine-type e2-standard-4 \
    --disk-type=pd-ssd \
    --disk-size 100GB \
    --num-nodes 1  \
    --image-type UBUNTU \
    --scopes=cloud-platform,gke-default \

gcloud beta container node-pools create gpu-pool \
    --cluster codalab-worker-manager-cluster \
    --zone us-west1-a \
    --machine-type n1-standard-8 \
    --disk-type=pd-ssd \
    --disk-size 256GB \
    --num-nodes 0  \
    --min-nodes 0  \
    --max-nodes 8  \
    --enable-autoscaling \
    --image-type UBUNTU \
    --accelerator type=nvidia-tesla-p100,count=1  \
    --scopes=cloud-platform,gke-default \
    --spot

The commands above will create an auto-scaling cluster in us-west1-a with no n1-standard machine at initialization with the option to auto-scale up to 8 Spot nodes with P100 GPUs. By not specifying a cluster version with the --cluster-version argument, GCP will create a cluster with the default version in the Stable channel.

Setting --scopes=cloud-platform,gke-default is required to configure Nvidia drivers and dependencies for the nodes. Also, note that only 74% of memory is available to a CodaLab worker on the VM (see the following documentation for more information).

Next, run yes Y | gcloud beta container clusters update codalab-worker-manager-cluster --autoscaling-profile optimize-utilization --region us-west1-a to ensure that the cluster scales down more aggressively.

Deleting a GKE cluster¶

To delete a cluster, simply run:

yes Y | gcloud container clusters delete codalab-worker-manager-cluster --region us-west1-a

Managing a GKE cluster¶

Use kubectl to manage the cluster and pods. Here is a list of common kubectl commands:

kubectl describe nodes
kubectl describe pods
kubectl delete pods <pod>
kubectl get pods
kubectl get pods -A
kubectl logs <pod> -c <container>
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml

You can also manage a GKE cluster in the GCP console. To manage the GKE cluster in the console, go to the GCP Kubernetes console.

Installing Nvidia Driver and Dependencies ¶

Nvidia drivers and the nvidia-container-runtime tool by default are not installed in most GCP virtual machines. Therefore, we need to create node initializers that download and install these dependencies on existing nodes and future nodes that are brought up by auto-scaling.

Additionally, the NVIDIA Device Plugin is required to expose the GPUs of the nodes in your cluster. The NVIDIA Device Plugin is a daemonset that automatically enumerates the number of GPUs on each node of the cluster and allows pods to be run on GPUs.

For more information on bootstrapping GKE nodes with DaemonSets, see the following documentation.

To set this up:

Go to the gcp directory of this repository: cd docs/gcp.
Run kubectl apply -f cm-entrypoint.yaml && kubectl apply -f daemon-set.yaml.
Create the Nvidia Device Plugin by running kubectl create -f nvidia-device-plugin.yaml.
To verify that the Nvidia drivers are installed correctly
Go to the GCP Compute Engine console.
Find a virtual machine with a GPU that belongs to your GKE cluster and connect to the machine by clicking the SSH button.
In the terminal session, run sudo nvidia-smi to see if the driver can communicate with the GPU.
Run sudo docker run --runtime=nvidia --rm nvidia/cuda:11.0-base nvidia-smi and check that the output is the same as the output in the previous step.

Optional: Setting up a Network File System (NFS) server ¶

You can attach additional storage by creating a NFS server.

To set this up:

Create a compute disk named pd in GCP by running: gcloud compute disks create --size=<Size of disk in GB>GB --zone=us-west1-a --type pd-ssd pd.
Go to the docs/gcp directory of this repository: cd docs/gcp.
Run kubectl apply -f nfs-server.yaml && kubectl apply -f nfs-service.yaml && kubectl get svc nfs-server. This will output the IP address of the cluster.
Update nfs-pv.yaml with the IP address from step 3.
Run kubectl apply -f nfs-pv.yaml.
Note the name of the persistent volume specified here and the volume mount path specified here as you will need these to start the worker manager.

Authentication and setting up a service account ¶

A GCP service account and cluster certificate are required to authenticate and run Kubernetes commands through the worker manager.

To create a service account:

Run cd codalab-worksheets && kubectl apply -f docs/gcp/service-account.yaml --namespace default
Then, run kubectl get secrets --namespace default
Get the auth token by running: kubectl describe secret/codalab-service-account-secret.

To get the cluster certificate:

Go to GKE console and click on the newly created GKE cluster.
Under Cluster Basics, find the Endpoint field.
Take note of the endpoint URL as this needed to start the worker manager.
Next, click on Show Cluster Certificate and copy the entire contents and place it in a file called gke.crt.

Start a Kubernetes Batch Worker Manager¶

Use the Kubernetes Batch Worker Manager to start the worker manager by passing in kubernetes for the type of worker manager.

At this point, four things are required to start a Kubernetes worker manager:

A running Kubernetes cluster with Nvidia drivers installed
An auth token
Path to the GKE cluster certificate
Endpoint URL of the cluster host

You can start a Kubernetes worker manager manually, by using the cl-worker-manager command. Below is an example of how to start a worker manager:

cl-worker-manager --server https://worksheets.codalab.org --min-workers 0 --max-workers 8 
--min-seconds-between-workers 300 --sleep-time 120 --worker-pass-down-termination 
--worker-exit-on-exception --worker-exit-after-num-runs 1  kubernetes  
--cert-path <Path to gke.crt> --auth-token <Auth token> --cluster-host <Endpoint URL of cluster host>
--cpus <Number of CPUs on VM> --gpus <Number of GPUS on VM>  --memory-mb <Amount of memory on VM in MB>

If you want to mount an NFS volume to the worker pods, additionally specify the following arguments: - --nfs-volume-name: Name of the persistent volume - --worker-work-dir-prefix: NFS volume mount path

Checking worker logs in GCP¶

Go to the bundle view page of the bundle and get the worker ID from the remote field.
Go to the GKE console.
In the Cluster dropdown menu, specify the GKE cluster the worker is running on.
Click on the pod with the name cl-worker-<ID of worker from step 1>.
Click on the Logs tab to view the worker logs.