Isolating Runtime Faults with Callstack Debugging using TAU

Traditional debugging tools do not fully support state inspection while examining failures in multi-language applications written in a combination of Python, C++, C, and Fortran. When an application experiences a runtime fault, such as numerical or memory error, it is difficult to relate the location of the fault to the original source code and examine the performance of the application. We present a tool that can help identify the nature and location of runtime errors in a multi- language program at the point of failure. This debugging tool, integrated in the TAU Performance System®, isolates the fault by capturing the signal associated with it and reports the program callstack. It captures the performance data at the point of failure, stores detailed information for each frame in the callstack, and generates a file that may be shipped back to the developers for further analysis. The tool works on parallel programs, providing feedback about every process regardless of whether it experienced the fault. This paper describes the tool and demonstrates its application to the multi-language CREATE- AV applications Kestrel and Helios. The tool is useful to both software developers and to users experiencing runtime software issues as the file output may be exchanged between the user and the development team without disclosing potentially sensitive application data.

Citation

Sameer Shende, Allen D. Malony, John C. Linford, Andrew Wissink, and Stephen Adamec. Isolating Runtime Faults with Callstack Debugging using TAU. Proceedings of the 2012 IEEE High Performance Extreme Computing Conference (HPEC’12). Waltham, MA. Sept 10–12, 2012.

Downloads

Paper
Slides