Troubleshooting |
The topics in this section describe ways to diagnose problems that might arise and techniques for capturing information when they do occur.
In this section the following is explained:
Tasks can write debugging statements to the log and standard output. First, if it is not already running, open the Coordinator Tray Application located at the default location of C:\Program Files\AGI\STK Parallel Computing Server 2.9\Coordinator\bin\AGI.Parallel.CoordinatorTray.exe. The Tray application needs to be started before the task is submitted. Once the tray application is up, any task submitted to the Coordinator will be displayed.
Open the task monitor to view the list of submitted and running/completed tasks:
Double-click a task to view its standard output and standard error. For example, a task can write a number of trace statements to standard output.
[Serializable] public class YourTask : Task { public override void Execute() { // Do something Console.WriteLine("Task did something"); // Do another thing Console.WriteLine("Task did another thing"); // Dare to do that thing Console.WriteLine("Task dared to do that thing"); } }
When a task is finished, the trace statements can be viewed in the task properties window. This can be an easy way to troubleshoot many issues within tasks.
Caution |
---|
The task monitoring applications do not work with the embedded job scheduler. |
The Coordinator, Agent, and Hosts all write logging information to disk. This can be useful when troubleshooting issues. Information in the log includes system messages, times when tasks change state, and any error information encountered. User defined messages can also be logged. See instructions at Log messages in task. There are two ways to view the log files, through the GUI monitoring applications or directly viewing the log files themselves.
The Coordinator Monitor GUI provides the host log files. Start the task monitor and right click on a task to view.
...and click Show Host Log.
If the GUI monitoring applications are not available, view the log files by opening the files manually. The default location of the host log files is C:\ProgramData\AGI\STK Parallel Computing Server 2.9\logs. The naming convention of a log file is "host-" with the host process id appended. For example, host-3048.log would be the log for the host process with the process id of 3048.
Tip |
---|
In Windows Explorer, sort the log files by the date most recently modified in decreasing order. The most recent logs should correspond to the most recent tasks. |
When troubleshooting an application logic error, it can be beneficial to simplify the problem into smaller pieces. Here are some useful tips:
Check the status of the task. Did it fail? Was there an exception? For many cases, TaskStandardError will contain a clue to the problem.
Check the log files. Are there any errors (search for the "ERROR" string in the logs)? Check the exit code of the host process. The agent writes a log entry if the exit code of a process is not expected. If the agent exits gracefully, the exit code is not logged.
Check if there are any messages in the Windows Event Viewer. Instructions are available here.
If an exception similar to one below is thrown, either the task or task environment instance is not serializable.
AGI.Parallel.Infrastructure.Serialization.MessageSerializationException was unhandled Message="Could not serialize object graph" Source="AGI.Parallel.Infrastructure" StackTrace: at AGI.Parallel.Infrastructure.Serialization.DefaultSerializationStrategy.SerializeObjectInternal(Object graph, ISerializationFormatter formatter) at AGI.Parallel.Infrastructure.Serialization.DefaultSerializationStrategy.SerializeTask(Task task, TaskEnvironmentIdentification environmentIdentification) at AGI.Parallel.Client.JobSubmissionFactory.CreateJobSubmissionMessage(JobSubmissionParameters submissionParameters) at AGI.Parallel.Client.CoordinatorProxy.SubmitJob(Guid jobId, JobSubmissionParameters submissionParameters, Action`1 jobSubmittedCallback, Action`1 taskCompletedCallback, Action`1 taskUpdatedCallback, TaskProgressEventHandler taskProgressUpdatedCallback) at AGI.Parallel.Client.JobDispatcher.Submit(Job job) at AGI.Parallel.Client.ClusterJobScheduler.SubmitJob(Job job, Action`1 taskCompletedCallback, Action`4 taskStateChangedCallback, Action`1 jobSubmittedCallback, Action jobCompletedCallback, Action`2 taskProgressCallback) at AGI.Parallel.Client.Job.Submit()
Tasks and the task environment are serialized using .NET's BinaryFormatter. To diagnose issues with serialization, try to remove the submission code and concentrate only on the serialization issues. That is, try to serialize the task or task environment using only the .NET BinaryFormatter and see if the problem persists. For instance, the following code snippet can be used to test serialization of an object:
Task yourTask = new YourTask(); BinaryFormatter binaryFormatter = new BinaryFormatter(); using (ChunkedMemoryStream ms = new ChunkedMemoryStream()) { // Check if this throws an exception. binaryFormatter.Serialize(ms, yourTask); }
Tip |
---|
If there is a field in the object graph that is non-serializable (for instance, a third-party type), check if it really needs to be serialized. If not, avoid serializing the field by using the SystemNonSerializedAttribute attribute. C# [Serializable] public class DoNotSerializeProcess : Task { // Mark process as NonSerialized to avoid the serialization issues. [NonSerialized] private Process process; public DoNotSerializeProcess() { this.process = Process.GetCurrentProcess(); } public override void Execute() { } } |
Another issue could be that the task payload is too big. The limit for a serialized object is 2GB. Measure the size of a .NET object with the following code:
Task yourTask = new YourTask(); BinaryFormatter binaryFormatter = new BinaryFormatter(); Array array; using (ChunkedMemoryStream ms = new ChunkedMemoryStream()) { binaryFormatter.Serialize(ms, yourTask); array = ms.ToArray(); } Console.WriteLine("Serialized size is: " + array.Length);
If a missing assembly or similar exception is thrown or if the assembly loaded in the task is incorrect (such as below), check the host log.
AGI.Parallel.Infrastructure.Serialization.MessageSerializationException: Could not deserialize object graph ---> System.Runtime.Serialization.SerializationException: Unable to find assembly 'YourAssembly, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null'.
In the host log, an entry is written every time the host tries to resolve an assembly. This should be the first place to look for issues. Take a look at JobExcludedDependencies and JobAdditionalDependencies to have more control over what assemblies to send and what assemblies not to send. More instructions can be found at Explicitly manage .NET assemblies.
There are a few tools provided to measure the execution times of tasks.
You can get the total run time of a task using task.GetProperty<DateTime>(TaskProperties.HostStartTime) and task.GetProperty<DateTime>(TaskProperties.HostEndTime). The times returned from the method are the start and stop time of the task's Execute method. This does not include the task environment's setup time.
DateTime startTime = task.GetProperty<DateTime>(TaskProperties.HostStartTime); DateTime endTime = task.GetProperty<DateTime>(TaskProperties.HostEndTime); Console.WriteLine("Task start time: " + startTime); Console.WriteLine("Task end time: " + endTime); Console.WriteLine("Duration: " + endTime.Subtract(startTime));
Also, the logs that are written all show the times for different events that happen in the system.
Finally, the duration time of tasks can also be viewed using the Task Monitor.
If a task doesn't make any progress in the execute method it is likely that it is blocked. There are many cases where the reason is application specific, for instance a deadlock bug that needs to be fixed in the task. In other cases, it's something that may not be intuitive.
Is there a UI prompt in the task? There can be issues when a task is ported from a legacy application to a task and a UI prompt is not removed, causing the task to wait forever.
Are there issues with the user rights? If the agent is running as a SYSTEM user, is there an issue in the task?
Setting JobTaskExecutionTimeout to a reasonable value is a good way to allow a graceful failure.
STK Parallel Computing Server 2.9 API for .NET