Frequently Asked Questions

The following sections provide solutions for some common issues.

General

What are appropriate use cases?

These are common use cases:

Distributed computing
Embarrassingly parallel algorithms
Coarse grain parallelism of single threaded applications
Running work on remote machines which have a limited resource (for instance software licenses)

What is the ideal system configuration for the coordinator?

The coordinator's main responsibilities include passing messages around to different machines on the network and keeping a storage of the jobs and all it's necessary data. Thus the ideal system would have a fast network connection and enough memory to store a lot of jobs - 2GB of free memory is a good rule of thumb.

What is the ideal system configuration for the agent?

The agents are the machines that do the actual work. Thus the ideal system would depend on what type of resources your jobs need. For instance, a machine with many cores will execute more tasks in parallel.

Tasks

I get an error that says I can't connect to the Coordinator. How do I troubleshoot this?

Check if the coordinator machine and port is running at that hostname and port. Ping the coordinator's machine to see if it is running. Check if a firewall is blocking the coordinator's port. Check to see if your client application is timing out. Go into the coordinator logs and search for "timed-out" and/or "ERROR". Look at the troubleshooting documentation at Troubleshooting

Under what account is the host running?

The hosts run using the same user as the agent that started it.

Why does task progress sometimes not get back to the client?

For performance reasons, task progress is polled every 200 milliseconds. If you are reporting progress faster than every 200 milliseconds, some of your progress updates will not be sent back to the client. If you need a way to send data back to the client with guaranteed delivery, use message passing: Communicate with tasks by sending messages.

Why do the changes I make to my task and task environment seem to not take effect?

You are likely using a constant Id for your task environments. If the assemblies in the task environment are resident in a host process, any changes to the task environment with the same identity will not take effect until the host process recycles. Whenever you update your task assemblies, make sure you assign a new Id value.

Why does it seem like the memory footprint of the host never decreases?

By default, the host process uses .NET's server garbage collection. If you want to change the type of garbage collection the host uses, edit the host's app.config file found by default at C:\Program Files\AGI\STK Parallel Computing Server 2.9\Agent\bin\AGI.Parallel.Host.exe.config.

Is it possible to avoid sending the assemblies needed by the job?

There are two ways to avoid sending assemblies. If you choose to avoid sending assemblies, you will need to manually ensure the host process can resolve the assembly. For instance, you can JIT the assembly on each of the agent machines.

The first way is to use the ExcludedDependencies property to exclude all the assemblies required by your job.

The API will not send an assembly if the assembly is JIT'ed on the client submitter machine. Thus, another way to avoid sending an assembly is to manually JIT the assembly on the client machine. You will also need to JIT the assembly on each of the agent machines.

Failure scenarios

What happens if a task fails or throws an exception?

The host catches all user exceptions a task can throw. The task's status will be set to Failed and the host process that executed the task will recycle. The host will write the exception reason to the host log.

What happens if the host process dies?

A host process could die for a number of reasons outside of the host's control. Examples include a user killing the host process or the task suddenly exiting from the process. This type of error is identified as a task interruption (Interrupted). The task is retried on another process up to the value specified in MaxTaskInterruptedRetryAttempts. If the task exceeds this count, the task's status will be set to Failed.

What happens when an agent machine dies?

All the tasks of the agent are reassigned to another agent and the tasks are restarted. The agent is removed from the list of available agents. The coordinator identifies an agent as dead if the agent does not respond to an internal signal or "heartbeat" within 30 seconds (configurable).

What happens when the coordinator dies?

All agents connected to the coordinator will detect the coordinator is not running. Any tasks assigned to the agent are canceled. The agent then goes into a retry loop, checking periodically whether the coordinator is restarted.

What happens when the client dies before all tasks are done?

The messages sent to the client are simply ignored. By default, the tasks still run until completion. You can choose to cancel tasks once the client disconnects by specifying the CancelOnClientDisconnection option.