The TAU Performance System is a powerful and highly versatile profiling and tracing tool ecosystem for performance analysis of parallel programs at all scales. Developed for almost two decades, TAU has evolved with each new generation of HPC systems and presently scales efficiently to hundreds of thousands of cores on the largest machines in the world. TAU has helped many projects scale up successfully on systems at Oak Ridge Leadership Computing Facility (OLCF), the National Energy Research Scientific Computing Center (NERSC), the Argonne Leadership Computing Facility (ALCF), and others. In one case, TAU helped reduced the runtime of the IRMHD INCITE code from 528 hours to 70 hours.
This tutorial will focus on performance data collection, analysis, and performance optimization of Python applications. The tutorial will introduce profiling and debugging support in TAU, cover performance evaluation of parallel programs written in pure Python or Python mixed with Fortran, C++, and/or C. The tutorial will also cover parallel performance analysis of applications using MPI, OpenMP, and other parallel runtime environments via packages like mpi4py. The common case of Python as a high-level “glue” language for high performance components will be covered extensively. We will demonstrate different techniques for program instrumentation and highlight TAU's support for memory debugging and I/O evaluation. The hands-on portion of the tutorial will guide the developers through the instrumentation, measurement, and analysis process steps in TAU. Performance data will include MPI timings, runtime bounds checking, I/O and memory, and hardware performance counters from PAPI. The tutorial will demonstrate how TAU's instrumentation and analysis tools may be used with external tools such as Score-P, Scalasca, OTF2, PAPI, and Vampir.