WRENCH
1.11
Cyberinfrastructure Simulation Workbench
|
Overview | Installation | Getting Started | WRENCH 101 | WRENCH 102 |
In WRENCH's terminology, and execution controller is software that makes all decisions and takes all actions for executing some application workflow using cyberinfrastructure services. It is thus a crucial component in every WRENCH simulator. WRENCH does not provide any execution controller implementation, but provides the means for developing custom ones. This page is meant to provide high-level and detailed information about implementing an execution controller in WRENCH. Full API details are provided in the Developer API Reference.
An execution controller implementation needs to use many WRENCH classes, which are accessed by including a single header file:
An execution controller implementation must derive the wrench::ExecutionController
class, which means that it must override several the virtual main()
member function. A typical such implementation of this function goes through a simple loop as follows:
In the next three sections, we give details on how to implement the above. To provide context, we make frequent references to the execution controllers implemented as part of the example simulators in the examples/
directory. Afterwards are a few sections that highlight features and functionality relevant to execution controller development.
Services that the execution controller can use are typically passed to its constructor. Most service classes provide member functions to get information about the capabilities and properties of the services. For instance, a wrench::ComputeService
has a wrench::ComputeService::getNumHosts()
member function that returns how many compute hosts the service has access to in total. A wrench::StorageService
has a wrench::StorageService::getFreeSpace()
member function to find out how many bytes of free space are available on it. And so on...
To take a concrete example, consider the execution controller implementation in examples/basic-examples/batch-bag-of-tasks/TwoTasksAtATimeBatchWMS.cpp
. This WMS finds out the compute speed of the cores of the compute nodes available to a wrench::BatchComputeService
as:
Member function wrench::ComputeService::getCoreFlopRate()
returns a map of core compute speeds indexed by hostname (the map thus has one element per compute node available to the service). Since the compute nodes of a batch compute service are homogeneous, the above code simply grabs the core speed value of the first element in the map.
It is important to note that these member functions actually involve communication with the service, and thus incur overhead that is part of the simulation (as if, in the real-world, you would contact a running service with a request for information over the network). This is why the line of code above, in that example execution controller, is executed once and the core compute speed is stored in the core_flop_rate
variable to be re-used by the execution controller repeatedly throughout its execution.
An execution controller can have many and complex interactions with services, especially with compute and storage services. In this section, we describe how WRENCH makes these interactions relatively easy, providing examples for each kind of interaction for each kind of service.
As expected, each service type provides its own API. For instance, a network proximity service provides member functions to query the service's host distance databases. The Developer API Reference provides all necessary documentation, which also explains which member functions are synchronous and which are asynchronous (in which case some event will occur in the future). However, the WRENCH developer will find that many member functions that one would expect are nowhere to be found. For instance, the compute services do not have (public) member functions for submitting jobs for execution!
The rationale for the above is that many member functions need to be asynchronous so that the execution controller can use services concurrently. For instance, an execution controller could submit a job to two distinct compute services asynchronously, and then wait for the service which completes its job first and cancel the job on the other service. Exposing this asynchronicity to the execution controller would require that the WRENCH developer use data structures to perform the necessary bookkeeping of ongoing service interactions, and process incoming control messages from the services on the (simulated) network or alternately register many callbacks. Instead, WRENCH provides managers. One can think of managers as separate threads that handle all asynchronous interactions with services, and which have been implemented for your convenience to make interacting with services easy.
There are two managers: a job manager (classwrench::JobManager
) and a data movement manager (class wrench::DataMovementManager
). The base wrench::ExecutionController
class provides two member functions for instantiating and starting these managers: wrench::ExecutionController::createJobManager()
and wrench::ExecutionController::createDataMovementManager()
.
Creating one or two of these managers typically is the first thing an execution controller does. For instance, the execution controller in examples/basic-examples/bare-metal-data-movement/DataMovementWMS.cpp
starts by doing:
Each manager has its own documented API, and is discussed further in sections below.
The possible interactions between an execution controller and a storage service include:
The first 4 interactions above are done by calling member functions of the wrench::StorageService
class. The last two are done via a Data Movement Manager, i.e., by calling member functions of the wrench::DataMovementManager
class. Some of these member functions take an optional wrench::FileRegistryService
argument, in which case they will also update entries in a file registry service (e.g., removing an entry when a file is deleted).
See this page for concrete examples of interactions with a wrench::SimpleStorageService
.
The main activity of an execution controller is to execute workflow tasks on compute services. Rather than submitting tasks directly to compute services, an execution controller must create "jobs", which can comprise multiple tasks and involve data copy/deletion operations. The job abstraction is powerful and greatly simplifies the task of an execution controller while affording flexibility.
There are three kinds of jobs in WRENCH: wrench::CompoundJob
, wrench::StandardJob
, and wrench::PilotJob
.
A Compound Job is simply set of actions to be performed, with possible control dependencies between actions. It is the most generic, flexible, and expressive kind of job. See the API documentation for the wrench::CompoundJob
class and the examples in the examples/action_api
directory. The other types of jobs below are actually implemented internally as compound jobs. The Compound Job abstraction is the most recent addition to the WRENCH API, and vastly expands the list of possible things that an execution controller can do. But because it is more recent, the reader will find that there are more examples in these documents and in the examples
directory that use standard jobs (described below). But all these examples could be easily rewritten using the more generic compound job abstraction.
A Standard Job is a specific kind of job designed for workflow applications. In its most complete form, a standard job specifies:
std::shared_ptr<wrench::WorkflowTask>
to execute, so that each task without all its predecessors in the set is ready;std::map
of <std::shared_ptr<wrench::DataFile>>, std::shared_ptr<wrench::StorageService>>
pairs that specifies from which storage services particular input files should be read and to which storage services output files should be written;Any of the above can actually be empty, and in the extreme a standard job can do nothing.
A Pilot Job (sometimes called a "placeholder job" in the literature) is a concept that is mostly relevant for batch scheduling. In a nutshell, it is a job that allows late binding of tasks to resources. It is submitted to a compute service (provided that service supports pilot jobs), and when it starts it just looks to the execution controller like a short-lived wrench::BareMetalComputeService
to which compound and/or standard jobs can be submitted.
All jobs are created via the job manager, which provides wrench::JobManager::createCompoundJob()
, wrench::JobManager::createStandardJob()
, and wrench::JobManager::createPilotJob()
member functions (the job manager is thus a job factory).
In addition to member functions for job creation, the job manager also provides the following:
wrench::JobManager::submitJob()
: asynchronous submission of a job to a compute service.wrench::JobManager::terminateJob()
: synchronous termination of a previously submitted job.The next section gives examples of interactions with each kind of compute service.
Click on the following links to see detailed descriptions and examples of how jobs are submitted to each compute service type:
Interaction with a file registry service is straightforward and done by directly calling member functions of the wrench::FileRegistryService
class. Note that often file registry service entries are managed automatically, e.g., via calls to wrench::DataMovementManager
and wrench::StorageService
member functions. So often an execution controller does not need to interact with the file registry service.
Adding/removing an entry to a file registry service is done as follows:
The wrench::FileLocation
class is a convenient abstraction for a file copy available at some storage service (with optionally a directory path at that service).
Retrieving all entries for a given file is done as follows:
If a network proximity service is running, it is possible to retrieve entries for a file sorted by non-decreasing proximity from some reference host. Returned entries are stored in a (sorted) std::map
where the keys are network distances to the reference host. For instance:
See the documentation of wrench::FileRegistryService
for more API member functions.
Querying a network proximity service is straightforward. For instance, to obtain a measure of the network distance between hosts "Host1" and "Host2", one simply does:
This distance corresponds to half the round-trip-time, in seconds, between the two hosts. If the service is configured to use the Vivaldi coordinate-based system, as in our example above, this distance is actually derived from network coordinates, as computed by the Vivaldi algorithm. In this case, one can actually ask for these coordinates for any given host:
See the documentation of wrench::NetworkProximityService
for more API member functions.
Because the execution controller performs asynchronous operations, it needs to wait for and re-act to events. This is done by calling the wrench::ExecutionController::waitForAndProcessNextEvent()
member function implemented by the base wrench::ExecutionController
class. A call to this member function blocks until some event occurs and then calls a callback member function. The possible event classes all derive from the wrench::ExecutionEvent
class, and an execution controller can override the callback member function for each possible event (the default member function does nothing but print some log message). These overridable callback member functions are:
wrench::ExecutionController::processEventCompoundJobCompletion()
: react to a compound job completionwrench::ExecutionController::processEventCompoundJobFailure()
: react to a compound job failurewrench::ExecutionController::processEventStandardJobCompletion()
: react to a standard job completionwrench::ExecutionController::processEventStandardJobFailure()
: react to a standard job failurewrench::ExecutionController::processEventPilotJobStart()
: react to a pilot job beginning executionwrench::ExecutionController::processEventPilotJobExpiration()
: react to a pilot job expirationwrench::ExecutionController::processEventFileCopyCompletion()
: react to a file copy completionwrench::ExecutionController::processEventFileCopyFailure()
: react to a file copy failureEach member function above takes in an event object as parameter. In the case of failure, the event includes a wrench::FailureCause
object, which can be accessed to analyze (or just display) the root cause of the failure.
Consider the execution controller in examples/basic-examples/bare-metal-bag-of-tasks/TwoTasksAtATimeWMS.cpp
. At each each iteration of its main loop it does:
In this simple example, only one of two events could occur at this point: a standard job completion or a standard job failure. As a result, this execution controller overrides the two corresponding member functions as follows:
You may note some difference between the above code and that in examples/basic-examples/bare-metal-bag-of-tasks/TwoTasksAtATimeWMS.cpp
. This is for clarity purposes, and especially because we have not yet explained how WRENCH does message logging. See an upcoming section about logging.
While the above callbacks are convenient, sometimes it is desirable to do things more manually. That is, wait for an event and then process it in the code of the main loop of the execution controller rather than in a callback member function. This is done by calling the wrench::waitForNextEvent()
member function. For instance, the execution controller in examples/basic-examples/bare-metal-data-movement/DataMovementWMS.cpp
does it as:
Most member functions in the WRENCH Developer API throw exceptions. In fact, most of the code fragments above should be in try-catch clauses, catching these exceptions.
Some exceptions correspond to failures during the simulated workflow executions (i.e., errors that would occur in a real-world execution and are thus part of the simulation). Each such exception contains a wrench::FailureCause
object, which can be accessed to understand the root cause of the execution failure. Other exceptions (e.g., std::invalid_arguments
, std::runtime_error
) are thrown as well, which are used for detecting misuses of the WRENCH API or internal WRENCH errors.
The wrench::Simulation
class provides many member functions to discover information about the (simulated) hardware platform and interact with it. It also provides other useful information about the simulation itself, such as the current simulation date. Some of these member functions are static, but others are not. The wrench:ExecutionController
class includes a simulation
object. Thus, the execution controller can call member functions on the this->simulation
object. For instance, this fragment of code shows how an execution controller can figure out the current simulated date and then check that a host exists (given a hostname) and, if so, set its pstate
(power state) to the highest possible setting.
See the documentation of the wrench::Simulation
class for all details. Specifically regarding host pstates, see the example execution controller in examples/basic-examples/cloud-bag-of-tasks-energy/TwoTasksAtATimeCloudWMS.cpp
, which interacts with host pstates (and the examples/basic-examples/cloud-bag-of-tasks-energy/four_hosts_energy.xml
platform description file which defines pstates).
It is typically desirable for the execution controller to print log output to the terminal. This is easily accomplished using the wrench::WRENCH_INFO()
, wrench::WRENCH_DEBUG()
, and wrench::WRENCH_WARN()
macros, which are used just like C's printf()
. Each of these macros corresponds to a different logging level in SimGrid. See the SimGrid logging documentation for all details.
Furthermore, one can change the color of the log messages with the wrench::TerminalOutput::setThisProcessLoggingColor()
member function, which takes as parameter a color specification:
wrench::TerminalOutput::COLOR_BLACK
wrench::TerminalOutput::COLOR_RED
wrench::TerminalOutput::COLOR_GREEN
wrench::TerminalOutput::COLOR_YELLOW
wrench::TerminalOutput::COLOR_BLUE
wrench::TerminalOutput::COLOR_MAGENTA
wrench::TerminalOutput::COLOR_CYAN
wrench::TerminalOutput::COLOR_WHITE
When inspecting the code of the execution controllers in the example simulators you will find many examples of calls to wrench::WRENCH_INFO()
. The logging is per .cpp
file, each of which corresponds to a declared logging category. For instance, in examples/basic-examples/batch-bag-of-tasks/TwoTasksAtATimeBatchWMS.cpp
, you will find the typical pattern:
The name of the logging category, in this case custom_wms
, can then be passed to the --log
command-line argument. For instance, invoking the simulator with additional argument --log=custom_wms.threshold=info
will make it so that only those WRENCH_INFO
statements in TwoTasksAtATimeBatchWMS.cpp
will be printed (in green!).