Efficient Parallel Runtime Bounds Checking with the TAU Performance System

Memory errors, such as an invalid memory access, misaligned allocation, or write to deallocated memory, are among the most difficult problems to debug because popular debugging tools do not fully support state inspection when examining failures. This is particularly true for applications written in a combination of Python, C++, C, and Fortran. We present a tool that can help identify and debug memory errors in amulti-language program at the point of failure. Integrated in the TAU Performance System (R), this debugging tool allocates pages of protected memory immediately before and after dynamic memory allocations. Accessing these “guard pages” raises an error signal that causes TAU to capture performance data at the point of failure, store detailed information for each frame in the callstack, and generate a file that may be sent to the developers for analysis. The tool works on parallel programs, providing feedback about every process regardless of whether it experienced the fault, and is useful to both software developers and users experiencing memory error issues as the file output may be exchanged between the user and the development team without disclosing potentially sensitive application data. This paper describes the tool and demonstrates its application to the multi-language CREATE-AV applications Kestrel and Helios. Since those codes are export controlled, we present results from an analogous code written specifically for testing but with structure and content derived from Helios and Kestrel. The analogous performance and debugging data closely match the data obtained from the CREATE-AV codes.


John C. Linford, Sameer Shende, Allen D. Malony, Andrew Wissink. Efficient Parallel Runtime Bounds Checking with the TAU Performance System. Proceedings of the 2013 IEEE High Performance Extreme Computing Conference (HPEC'13). Waltham, MA. Sept 10--12, 2013.