The following sections provide solutions for some common issues.

General

What are appropriate use cases?

These are common use cases:

  • Distributed computing
  • Embarrassingly parallel algorithms
  • Coarse grain parallelism of single threaded applications
  • Running work on remote machines which have a limited resource (for instance software licenses)

What is the ideal system configuration for the coordinator?

The coordinator's main responsibilities include passing messages around to different machines on the network and keeping a storage of the jobs and all it's necessary data. Thus the ideal system would have a fast network connection and enough memory to store a lot of jobs - 2GB of free memory is a good rule of thumb.

What is the ideal system configuration for the agent?

The agents are the machines that do the actual work. Thus, the ideal system would depend on what type of resources your jobs need. For instance, a machine with many cores will execute more tasks in parallel.

Tasks

I get an error that says I can't connect to the Coordinator. How do I troubleshoot this?

Check if the coordinator machine and port is running at that hostname and port. Ping the coordinator's machine to see if it is running. Check if a firewall is blocking the coordinator's port. Check to see if your client application is timing out. Go into the coordinator logs and search for "timed-out" and/or "ERROR". Look at the troubleshooting documentation at Troubleshooting

I get an error which says "WARNING: Detected that task class (MyClass) is not run from a jar. Please see documentation". How do I troubleshoot this?

This error message is a warning that the task is run directly from the class files instead of a jar. This causes issues because STK Scalability can't send the necessary jars that are needed to load the task class remotely. In order to fix this, you must jar your class code and all its dependencies and run the application with the jar in the classpath. You can find instructions in the Programmer's Guide or the Tutorial. In cases where you have ensured that the host can load the necessary jar files you can suppress this message by setting the AGI_STKSCALABILITY_IGNORE_JAR_WARNING environment variable.

Under what account is the host running?

The hosts run using the same user as the agent that started it.

Why does task progress sometimes not get back to the client?

For performance reasons, task progress is polled every 200 milliseconds. If you are reporting progress faster than every 200 milliseconds, some of your progress updates will not be sent back to the client. If you need a way to send data back to the client with guaranteed delivery, use message passing: Communicate with tasks by sending messages.

Why do the changes I make to my task and task environment seem to not take effect?

You are likely using a constant Id for your task environments. If the assemblies in the task environment are resident in a host process, any changes to the task environment with the same identity will not take effect until the host process recycles. Whenever you update your task assemblies, make sure you assign a new Id value.

Failure scenarios

What happens if a task fails or throws an exception?

The host catches all user exceptions a task can throw. The task's status will be set to FAILED and the host process that executed the task will recycle. The host will write the exception reason to the host log.

What happens if the host process dies?

A host process could die for a number of reasons outside of the host's control. Examples include a user killing the host process or the task suddenly exiting from the process. This type of error is identified as a task interruption (INTERRUPTED). The task is retried on another process up to the value specified in MaxTaskInterruptedRetryAttempts. If the task exceeds this count, the task's status will be set to FAILED.

What happens when an agent machine dies?

All the tasks of the agent are reassigned to another agent and the tasks are restarted. The agent is removed from the list of available agents. The coordinator identifies an agent as dead if the agent does not respond to an internal signal or "heartbeat" within 30 seconds (configurable).

What happens when the coordinator dies?

All agents connected to the coordinator will detect the coordinator is not running. Any tasks that are assigned to the agent are canceled. The agent then goes into a retry loop, checking periodically whether the coordinator is restarted.

What happens when the client dies before all tasks are done?

The messages sent to the client are simply ignored. By default, the tasks still run until completion. You can choose to cancel tasks once the client disconnects by specifying the CancelOnClientDisconnection option.