Creating a batch compute service
Overview
A batch service is a service that makes it possible to run jobs on a homogeneous cluster managed by a batch scheduler. The batch scheduler receives requests that ask for a number of compute nodes, with a number of cores per compute node, and a duration. Requests wait in a queue and, using a range of possible batch scheduling algorithms, are dispatched to the requested compute resources in a space-sharing manner. Therefore, a job submitted to the service experiences a “queue waiting time” period (the length of which depends on the load on the service) followed by an “execution time” period. In typical batch-scheduler fashion, a running job is forcefully terminated when it reaches its requested duration (i.e., the job fails). If, instead, the job completes before the requested duration, it succeeds. In both cases, the job’s allocated compute resources are reclaimed by the batch scheduler.
A batch service also supports so-called “pilot jobs”, i.e., jobs that are submitted to the service, with requested resources and duration, but without specifying at submission time which workflow tasks/operations should be performed by the job. Instead, once the job starts it exposes to its submitter a bare-metal service. This service is available only for the requested duration, and can be used in any manner by the submitter. This allows late binding of workflow tasks to compute resources.
Creating a batch compute service
In WRENCH, a batch service is defined by the
wrench::BatchComputeService
class. An instantiation of a batch
service requires the following parameters:
The name of a host on which to start the service;
A list (
std::vector
) of hostnames (all cores and all RAM of each host is available to the batch service);A mount point (corresponding to a disk attached to the host) for the scratch space, i.e., storage local to the batch service (used to store workflow files, as needed, during job executions); and
Maps (
std::map
) of configurable properties (wrench::BatchComputeServiceProperty
) and configurable message payloads (wrench::BatchComputeServiceMessagePayload
).
The example below creates an instance of a batch service that runs on
host Gateway
and provides access to 4 hosts (using all their cores
and RAM), with scratch space on the disk mounted at path /scratch/
at host Gateway
. Furthermore, the batch scheduling algorithm is
configured to use the FCFS (First-Come-First-Serve) algorithm, and the
message with which the service answers resource request description
requests is configured to be 1KiB:
auto batch_cs = simulation->add(
new wrench::BatchComputeService("Gateway",
{"Node1", "Node2", "Node3", "Node4"},
"/scratch/",
{{wrench::BatchComputeServiceProperty::BATCH_SCHEDULING_ALGORITHM, "fcfs"}},
{{wrench::BareMetalComputeServiceMessagePayload::RESOURCE_DESCRIPTION_ANSWER_MESSAGE_PAYLOAD, 1024}}));
See the documentation of wrench::BatchComputeServiceProperty
and
wrench::BatchComputeServiceMessagePayload
for all possible
configuration options.
Also see the simulators in the examples/workflow_api/basic-examples/batch-*/
and
examples/action_api/batch-*/
directories, which use batch compute services.