Frequently Asked Questions |
The following sections provide solutions for some common issues.
These are common use cases:
The coordinator's main responsibilities include passing messages around to different machines on the network and keeping a storage of the jobs and all it's necessary data. Thus the ideal system would have a fast network connection and enough memory to store a lot of jobs - 2GB of free memory is a good rule of thumb.
The agents are the machines that do the actual work. Thus the ideal system would depend on what type of resources your jobs need. For instance, a machine with many cores will execute more tasks in parallel.
Check if the coordinator machine and port is running at that hostname and port. Ping the coordinator's machine to see if it is running. Check if a firewall is blocking the coordinator's port. Check to see if your client application is timing out. Go into the coordinator logs and search for "timed-out" and/or "ERROR". Look at the troubleshooting documentation at Troubleshooting
The hosts run using the same user as the agent that started it.
For performance reasons, task progress is polled every 200 milliseconds. If you are reporting progress faster than every 200 milliseconds, some of your progress updates will not be sent back to the client. If you need a way to send data back to the client with guaranteed delivery, use message passing: Communicate with tasks by sending messages.
You are likely using a constant Id for your task environments. If the assemblies in the task environment are resident in a host process, any changes to the task environment with the same identity will not take effect until the host process recycles. Whenever you update your task assemblies, make sure you assign a new Id value.
By default, the host process uses .NET's server garbage collection. If you want to change the type of garbage collection the host uses, edit the host's app.config file found by default at C:\Program Files\AGI\STK Parallel Computing Server 2.9\Agent\bin\AGI.Parallel.Host.exe.config.
There are two ways to avoid sending assemblies. If you choose to avoid sending assemblies, you will need to manually ensure the host process can resolve the assembly. For instance, you can JIT the assembly on each of the agent machines.
The first way is to use the ExcludedDependencies property to exclude all the assemblies required by your job.
The API will not send an assembly if the assembly is JIT'ed on the client submitter machine. Thus, another way to avoid sending an assembly is to manually JIT the assembly on the client machine. You will also need to JIT the assembly on each of the agent machines.
The host catches all user exceptions a task can throw. The task's status will be set to Failed and the host process that executed the task will recycle. The host will write the exception reason to the host log.
A host process could die for a number of reasons outside of the host's control. Examples include a user killing the host process or the task suddenly exiting from the process. This type of error is identified as a task interruption (Interrupted). The task is retried on another process up to the value specified in MaxTaskInterruptedRetryAttempts. If the task exceeds this count, the task's status will be set to Failed.
All the tasks of the agent are reassigned to another agent and the tasks are restarted. The agent is removed from the list of available agents. The coordinator identifies an agent as dead if the agent does not respond to an internal signal or "heartbeat" within 30 seconds (configurable).
All agents connected to the coordinator will detect the coordinator is not running. Any tasks assigned to the agent are canceled. The agent then goes into a retry loop, checking periodically whether the coordinator is restarted.
The messages sent to the client are simply ignored. By default, the tasks still run until completion. You can choose to cancel tasks once the client disconnects by specifying the CancelOnClientDisconnection option.
STK Parallel Computing Server 2.9 API for .NET