Frequently Asked Questions

The following sections provide solutions for some common issues.

General

What are appropriate use cases?

These are common use cases:

  • Distributed computing

  • Embarrassingly parallel algorithms

  • Coarse grain parallelism of single threaded applications

  • Running work on remote machines which have a limited resource (for instance software licenses)

What is the ideal system configuration for the coordinator?

The coordinator’s main responsibilities include passing messages around to different machines on the network and keeping a storage of the jobs and all it’s necessary data. Thus the ideal system would have a fast network connection and enough memory to store a lot of jobs - 2GB of free memory is a good rule of thumb.

What is the ideal system configuration for the agent?

The agents are the machines that do the actual work. Thus the ideal system would depend on what type of resources your jobs need. For instance, a machine with many cores will execute more tasks in parallel.

Tasks

I get an error that says I can’t connect to the Coordinator. How do I troubleshoot this?

Check if the coordinator machine and port is running at that hostname and port. Ping the coordinator’s machine to see if it is running. Check if a firewall is blocking the coordinator’s port. Check to see if your client application is timing out. Go into the coordinator logs and search for “timed-out” and/or “ERROR”. Look at the troubleshooting documentation at Troubleshooting.

Under what account is the host running?

The hosts run using the same user as the agent that started it.

Why does task progress sometimes not get back to the client?

For performance reasons, task progress is polled every 200 milliseconds. If you are reporting progress faster than every 200 milliseconds, some of your progress updates will not be sent back to the client.

Why do the changes I make to my task and task environment seem to not take effect?

You are likely using a constant unique_id for your task environments. If the assemblies in the task environment are resident in a host process, any changes to the task environment with the same identity will not take effect until the host process recycles. Whenever you update your task assemblies, make sure you assign a new unique_id value.

Failure Scenarios

What happens if a task fails or throws an exception?

The host catches all user exceptions a task can throw. The task’s status will be set to FAILED and the host process that executed the task will recycle. The host will write the exception reason to the host log.

What happens if the host process dies?

A host process could die for a number of reasons outside of the host’s control. Examples include a user killing the host process or the task suddenly exiting from the process. This type of error is identified as a task interruption (INTERRUPTED). The task is retried on another process up to the value specified in max_interrupted_retry_attempts. If the task exceeds this count, the task’s status will be set to FAILED.

What happens when an agent machine dies?

All the tasks of the agent are reassigned to another agent and the tasks are restarted. The agent is removed from the list of available agents. The coordinator identifies an agent as dead if the agent does not respond to an internal signal or “heartbeat” within 30 seconds (configurable).

What happens when the coordinator dies?

All agents connected to the coordinator will detect the coordinator is not running. Any tasks assigned to the agent are canceled. The agent then goes into a retry loop, checking periodically whether the coordinator is restarted.

What happens when the client dies before all tasks are done?

The messages sent to the client are simply ignored. By default, the tasks still run until completion. You can choose to cancel tasks once the client disconnects by specifying the cancel_on_client_disconnection option.