Troubleshooting¶
The topics in this section describe ways to diagnose problems that might arise and techniques for capturing information when they do occur.
Overview¶
Viewing the Tracing Information with the Task Monitor¶
Tasks can write debugging statements to the log and standard output. First, if it is not already running, open the Coordinator Tray Application. The default location is:
C:\Program Files\AGI\STK Parallel Computing Server 2.9\Coordinator\bin\AGI.Parallel.CoordinatorTray.exe.
The Tray application needs to be started before the task is submitted. Once the tray application is up, any task submitted to the Coordinator will be displayed.
Open the task monitor to view the list of submitted and running/completed tasks:
Double-click a task to view its standard output and standard error. For example, a task can write a number of trace statements to standard output.
class YourTask:
def execute():
# do something
print("Task did something")
# do another thing
print("Task did another thing")
# dare to do that thing
print("Task dared to do that thing")
When a task is finished, the trace statements can be viewed in the task properties window. This can be an easy way to troubleshoot many issues within tasks.
Viewing Log Files¶
The Coordinator, Agent, and Hosts all write logging information to disk. This can be useful when troubleshooting issues. Information in the log includes system messages, times when tasks change state, and any error information encountered. User defined messages can also be logged. See instructions at Log Messages In Task. There are two ways to view the log files, through the GUI monitoring applications or directly viewing the log files themselves.
The Coordinator Monitor GUI provides the host log files. Start the task monitor and right click on a task to view.
…and click Show Host Log.
If the GUI monitoring applications are not available, view the log files by opening the files manually. The default location of the host log files is C:\ProgramData\AGI\STK Parallel Computing Server 2.9\logs. The naming convention of a log file is “python-” with the host process id appended. For example, python-3048.log would be the log for the host process with the process id of 3048.
Note
In Windows Explorer, sort the log files by the date most recently modified in decreasing order. The most recent logs should correspond to the most recent tasks.
Simplify the Problem¶
When troubleshooting an application logic error, it can be beneficial to simplify the problem into smaller pieces. Here are some useful tips:
Add only a single task to the job.
Simulate the calling sequence.
First make sure it works on a local machine. Incrementally add machines to the cluster until the problem is found.
If it works on one machine but not another, check that it is not a user rights issue.
See if the same problem exists when executing the code directly, without executing it in a job/task.
Check Task Status¶
Check the status of the task. Did it fail? Was there an exception? For many cases, task.standard_error
will contain a clue to the problem.
Check the log files. Are there any errors (search for the “ERROR” string in the logs)? Check the exit code of the host process. The agent writes a log entry if the exit code of a process is not expected. If the agent exits gracefully, the exit code is not logged.
Object Serialization Problems¶
If you get an error message that looks like the following, either the task or task environment is not serializable.
AttributeError: Can’t pickle local object ‘Task.execute.<locals>.<something_unserializable>’
or
TypeError: Cannot serialize socket object
Tasks and the task environment are serialized using the Pickle module. To diagnose issues with serialization, try to remove the submission code and concentrate only on the serialization issues. That is, try to serialize the task or task environment using only Pickle and see if the problem persists. You can also check what types can be serialized with pickle here.
Measuring Task Performance¶
There are a few tools provided to measure the execution times of tasks.
You can get the total run time of a task using TaskProperties.HOST_START_TIME
and
TaskProperties.HOST_END_TIME
. The times returned from the method are the start and stop time
of the task’s execute method. This does not include the task environment’s setup time.
ample output:
sk #1 ran on MyComputerName from 2019-10-08 09:50:31.782120 to 2019-10-08 09:50:36.782179
sk #2 ran on MyComputerName from 2019-10-08 09:50:36.810140 to 2019-10-08 09:50:41.811087
Also, the logs that are written all show the times for different events that happen in the system.
Finally, the duration time of tasks can also be viewed using the Task Monitor.
Diagnose Blocking Tasks¶
If a task doesn’t make any progress in the execute method it is likely that it is blocked. There are many cases where the reason is application specific, for instance a deadlock bug that needs to be fixed in the task. In other cases, it’s something that may not be intuitive.
Is there a UI prompt in the task? There can be issues when a task is ported from a legacy application to a task and a UI prompt is not removed, causing the task to wait forever.
Are there issues with the user rights? If the agent is running as a SYSTEM user, is there an issue in the task?
Setting Job.task_execution_timeout
to a reasonable value is a good way to allow a graceful failure.