Acumem Freja

Manual

Version 15055

2011-03-11

All rights reserved.


      Rogue Wave Software, Inc

      5500 Flatiron Parkway
      80301 Boulder
      CO

      
    


Table of Contents

1. Introduction
1.1. Overview
1.2. Technology
1.3. Limitations
2. Running Acumem Freja
2.1. Installation
2.1.1. Linux System Requirements
2.1.2. Windows System Requirements
2.1.3. Installing Acumem Freja on Linux
2.1.4. Installing the License
2.2. Using the Graphical User Interface
2.2.1. Sampling an Application
2.2.2. Generating a Report from a Sample File
2.2.3. Sampling and Generating a Report
2.2.4. Viewing an Existing Report
2.2.5. Advanced Sampling Settings
2.2.6. Advanced Report Settings
2.3. Using the Command Line Tools
2.3.1. Sampling an Application
2.3.2. Creating a Report
2.3.3. Viewing a Report
2.4. Common Report Settings
2.4.1. Cache Performance
2.4.2. Analysis of software prefetch instructions
2.4.3. Threading Advice - Inter-Cache Communication
2.4.4. Threading Advice - Inter-Socket Communication
2.5. Advanced Use
2.5.1. Burst Sampling
2.5.2. Sampling Start Conditions
2.5.3. Sampling Stop Conditions
2.5.4. Sample Files
3. Introduction to Caches
3.1. Motivation for Caches
3.2. Cache Lines and Cache Size
3.3. Replacement Policies
3.4. Cache Misses
3.5. Data Locality
3.6. Prefetching
3.6.1. Software Prefetching
3.6.2. Hardware Prefetching
3.7. Multithreading and Cache Coherence
3.8. Fetch Ratio
3.9. Upgrade Ratio
3.10. Write-Back Ratio
3.11. Memory Bandwidth
4. Acumem Freja Concepts
4.1. Issues
4.2. Loops
4.3. Instruction Groups
4.4. Last Writer
4.5. Fetch Utilization
4.6. Write-Back Utilization
4.7. Communication Utilization
4.8. Utilization Corrected Fetch Ratio
4.9. Utilization Corrected Write-Back Ratio
4.10. Hardware Prefetch Probability
4.11. Access Randomness
4.12. Call Stack
4.13. Sample Period
5. Memory Performance Problems and Solutions
5.1. Data Layout Problems
5.1.1. Partially Used Structures
5.1.2. Too Large Data Types
5.1.3. Alignment Problems
5.1.4. Dynamic Memory Allocation
5.2. Data Access Pattern Problems
5.2.1. Inefficient Loop Nesting
5.2.2. Random Access Pattern
5.2.3. Unexploited Data Reuse Opportunities
5.3. Non-Temporal Data
5.3.1. Example of Non-Temporal Data Optimization
5.3.2. Singlethreaded Uses of Non-Temporal Hints
5.3.3. Multithreaded Uses of Non-Temporal Hints
5.3.4. Concurrent Uses of Non-Temporal Hints
5.3.5. Types of Non-Temporal Hint Instructions
5.3.6. Using Non-Temporal Hint Instructions
5.4. Multithreading Problems
5.4.1. False Sharing
5.4.2. Poor Communication Utilization
5.5. Common Data Structures
5.5.1. Arrays
5.5.2. Linked Lists
5.5.3. Trees
5.5.4. Hash Tables
5.6. Final Remedies
6. Optimization Workflow
6.1. Initial State: Correct, Measurable Program, Good Test Case
6.2. Avoid Unnecessary Memory Accesses
6.3. Optimize Data Layout
6.4. Optimize Access Patterns
6.5. Utilize Reuse Opportunities
6.6. Use Non-Temporal Hints for Data without Temporal Reuse
6.7. Avoid False Sharing
6.8. Avoid Communication between Caches (Coherence Traffic)
6.9. Hide Remaining Misses
7. Reading the Report
7.1. Statistics
7.1.1. Reading the Statistics
7.1.2. Reading the Diagrams
7.2. The Report Layout
7.3. The Summary Frame
7.3.1. The Summary Tab
7.3.2. The Loops Tab
7.3.3. The Bandwidth Issues Tab
7.3.4. The Latency Issues Tab
7.3.5. The Multi-Threading Issues Tab
7.3.6. The Pollution Issues Tab
7.3.7. The Files Tab
7.3.8. The Execution Tab
7.3.9. The About/Help Tab
7.4. The Issue Frame
7.4.1. Statistics
7.4.2. Instructions
7.4.3. Loop Details
7.4.4. Issue Details
7.5. The Source Code Frame
8. Issue Reference
8.1. Utilization Issues
8.1.1. Fetch Utilization
8.1.2. Write-Back Utilization
8.1.3. Communication Utilization
8.2. Inefficient Loop Nesting
8.3. Random Access Pattern
8.4. Loop Fusion
8.5. Blocking
8.6. Software Prefetch Issues
8.6.1. Prefetch Unnecessary
8.6.2. Prefetch too Distant
8.6.3. Prefetch too Close
8.7. Fetch Hot-Spot
8.8. Write-back Hot-Spot
8.9. Non-Temporal Store Possible
8.10. Non-Temporal Data
8.11. False Sharing
8.12. Communication Hot-Spot
9. Credits
9.1. libelf
9.2. libdwarf
9.3. libgd-2.0.34
9.4. OpenSSL
9.5. klibc
A. Sampling MPI Applications
A.1. Introduction
A.2. Sampling of MPI Applications
B. Cross-Architecture Analysis
B.1. Introduction
B.2. Supported Non-x86 Processors
B.3. Considerations for Accurate Cross-Architecture Analysis
B.4. Sampling the Required Cache Line Size
B.5. x86-centric Issues
B.5.1. Non-Temporal Data
B.5.2. Non-Temporal Store Possible
B.6. Considerations for Specific Processors
C. Acumem License Server
C.1. Overview
C.2. Ordering Licenses
C.3. Installing the License Server
C.4. Installing Licenses
C.5. Maintenance
C.5.1. Stopping the Server
C.5.2. Starting the Server
C.5.3. Listing Current Leases
I. Command Reference
internal — GUI for Acumem Freja sampling and report generation
sample — sample the memory access pattern of a process and generate a sample file
report — generate a report from a sample file
view — start a report viewer

List of Figures

2.1. Overview of the GUI
2.2. Processor Model Selector
2.3. Advanced Sampling Settings
2.4. Advanced Report Settings
3.1. Example System
3.2. Cache Coherence, Example 1
3.3. Cache Coherence, Example 2
3.4. Cache Coherence, Example 3
3.5. Cache Coherence, Example 4
3.6. Cache Coherence, Example 5
3.7. Cache Coherence, Example 6
3.8. Cache Coherence, Example 7
5.1. Data Layout Example
5.2. Good Utilization
5.3. Poor Utilization
5.4. Unused Fields
5.5. No Unused Fields
5.6. Poor Internal Alignment
5.7. Good Internal Alignment
5.8. External Alignment
5.9. Dynamic Memory Allocation
5.10. Inefficient Loop Nesting
5.11. Efficient Loop Nesting
5.12. False Sharing Example, Step 1
5.13. False Sharing Example, Step 2
5.14. False Sharing Example, Step 3
5.15. False Sharing Example, Step 4
5.16. False Sharing Example, Step 5
5.17. False Sharing Example, Step 6
5.18. False Sharing Example, Fixed
5.19. Matrix Accesses with False Sharing
5.20. Matrix Accesses without False Sharing
7.1. Issue Statistics Section
7.2. Summary Statistics
7.3. Issue Statistics
7.4. Loop Statistics
7.5. Instruction Group Statistics
7.6. Fetch/Miss Ratio Diagram
7.7. Write-Back Ratio Diagram
7.8. Utilization Diagram
7.9. Report Outline
7.10. The Summary Tab
7.11. The Loops Tab
7.12. The Bandwidth Issues Tab
7.13. The Latency Issues Tab
7.14. The Multi-Threading Issues Tab
7.15. The Pollution Issues Tab
7.16. The Files Tab
7.17. The Execution Tab
7.18. The About/Help Tab
7.19. Issue Statistic Sections
7.20. Instructions with Collapsed Call Stack
7.21. Instructions with Expanded Call Stack
7.22. Loop
7.23. Source Code with Collapsed Lines
7.24. Source Code with Expanded Lines
8.1. Fetch Utilization Issue
8.2. Write-Back Utilization Issue
8.3. Communication Utilization Issue
8.4. Inefficient Loop Nesting Issue
8.5. Random Access Pattern Issue
8.6. Loop Fusion Issue
8.7. Blocking Issue
8.8. Prefetch Unnecessary Issue
8.9. Prefetch too Distant Issue
8.10. Prefetch too Close Issue
8.11. Fetch Hot-Spot Issue
8.12. Write-back Hot-Spot Issue
8.13. Non-Temporal Store Possible Issue
8.14. Non-Temporal Data Issue
8.15. False Sharing Issue
8.16. Communication Hot-Spot Issue
A.1. MPI Sampling Overview

List of Examples

1. Starting an application in the sampler
2. Attaching to a running process
3. Waiting for a process and sampling for a fixed time
4. Burst sampling a long running application
5. Analyzing sample files using autodetected CPU models
6. Specifying a CPU model
7. Using custom thread to cache mappings