Frequently Asked Questions |
The following sections provide solutions for some common issues.
These are common use cases:
The coordinator's main responsibilities include passing messages around to different machines on the network and keeping a storage of the jobs and all it's necessary data. Thus the ideal system would have a fast network connection and enough memory to store a lot of jobs - 2GB of free memory is a good rule of thumb.
The agents are the machines that do the actual work. Thus, the ideal system would depend on what type of resources your jobs need. For instance, a machine with many cores will execute more tasks in parallel.
Check if the coordinator machine and port is running at that hostname and port. Ping the coordinator's machine to see if it is running. Check if a firewall is blocking the coordinator's port. Check to see if your client application is timing out. Go into the coordinator logs and search for "timed-out" and/or "ERROR". Look at the troubleshooting documentation at Troubleshooting
This error message is a warning that the task is run directly from the class files instead of a jar. This causes issues because STK Parallel Computing Server can't send the necessary jars that are needed to load the task class remotely. In order to fix this, you must jar your class code and all its dependencies and run the application with the jar in the classpath. You can find instructions in the Programmer's Guide or the Tutorial. In cases where you have ensured that the host can load the necessary jar files you can suppress this message by setting the AGI_PARALLEL_IGNORE_JAR_WARNING environment variable.
The hosts run using the same user as the agent that started it.
For performance reasons, task progress is polled every 200 milliseconds. If you are reporting progress faster than every 200 milliseconds, some of your progress updates will not be sent back to the client. If you need a way to send data back to the client with guaranteed delivery, use message passing: Communicate with tasks by sending messages.
You are likely using a constant Id for your task environments. If the assemblies in the task environment are resident in a host process, any changes to the task environment with the same identity will not take effect until the host process recycles. Whenever you update your task assemblies, make sure you assign a new Id value.
The host catches all user exceptions a task can throw. The task's status will be set to FAILED and the host process that executed the task will recycle. The host will write the exception reason to the host log.
A host process could die for a number of reasons outside of the host's control. Examples include a user killing the host process or the task suddenly exiting from the process. This type of error is identified as a task interruption (INTERRUPTED). The task is retried on another process up to the value specified in MaxTaskInterruptedRetryAttempts. If the task exceeds this count, the task's status will be set to FAILED.
All the tasks of the agent are reassigned to another agent and the tasks are restarted. The agent is removed from the list of available agents. The coordinator identifies an agent as dead if the agent does not respond to an internal signal or "heartbeat" within 30 seconds (configurable).
All agents connected to the coordinator will detect the coordinator is not running. Any tasks that are assigned to the agent are canceled. The agent then goes into a retry loop, checking periodically whether the coordinator is restarted.
The messages sent to the client are simply ignored. By default, the tasks still run until completion. You can choose to cancel tasks once the client disconnects by specifying the CancelOnClientDisconnection option.
STK Parallel Computing Server 2.9 API for Java