© 2002 IBM Corporation
IBM Software Group, Compilation Technologies
IBM Toronto Lab | September 19, 2003
Poly3D Case Study...
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation2
Table of content...
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation
Background
3
It i...
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation
Background
4
Clie...
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation5
Background
Machi...
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation6
Background
Poly3...
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation
Background
7
Run ...
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation
Background
8
Poss...
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation9
Table of content...
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation10
Poly3D Profilin...
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation11
Poly3D Profilin...
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation12
Poly3D Profilin...
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation13
Table of conten...
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation14
Benchmarks
SPEC...
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation15
Benchmarks
LINP...
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation16
Benchmarks
LINP...
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation17
Table of conten...
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation18
Summary
Poly3D ...
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation19
Sources
IBM eSe...
of 19

Poly3D Case Study: The Impact of Cache Misses on Performance of CPU-Intensive Applications

It is often assumed that the performance of CPU-bound applications (e.g., computational science and engineering simulations) increases in more or less a linear correlation with the CPU clock rate. Users are often surprised when new hardware yields less than expected performance improvement. This study profiles a thermal dynamics application and shows how factors other than CPU speed influence the overall application performance.
Published on: Mar 4, 2016
Published in: Software      
Source: www.slideshare.net


Transcripts - Poly3D Case Study: The Impact of Cache Misses on Performance of CPU-Intensive Applications

  • 1. © 2002 IBM Corporation IBM Software Group, Compilation Technologies IBM Toronto Lab | September 19, 2003 Poly3D Case Study: The Impact of Processor Cache Misses on Performance of CPU-Intensive Applications Zoran Kulina Staff Software Engineer (C/C++ Support)
  • 2. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation2 Table of contents Background Poly3D Profiling Results Benchmarks Summary
  • 3. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation Background 3 It is often assumed that the performance of CPU-bound applications (e.g., computational science and engineering simulations) increases in more or less a linear correlation with the CPU clock rate. Users are often surprised when new hardware yields less than expected performance improvement. This study profiles Poly3D (a thermal dynamics simulator) and shows how factors other than CPU speed influence the overall application performance. The study is the outcome of my recent service engagement. I was tasked with benchmarking application performance to determine why a twice faster computer failed to yield a requisite performance gain.
  • 4. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation Background 4 Client purchased a new p630 server for their CPU-intensive simulation app 2x clock rate (1000 vs 450 MHz) 4x the memory (8 vs 2 GB) Expected at least a two-fold increase is application speed Got something along the lines of 40-50% The Story
  • 5. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation5 Background Machine Specifications 44P-170 p630-6C4 Processor POWER3-II (1-way) POWER4 (1-way) Clock rate 450 Mhz 1000 Mhz Memory 2 GB 8 GB L1 96 KB 96 KB L2 8 MB 1.44 MB L3 none 32 MB
  • 6. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation6 Background Poly3D Overview Fluid dynamics simulation program Written in C (originally Fortran 77) Single-threaded Uses about 100 megabytes of memory Performs dot product calculations and other matrix operations Runs for ~150 seconds on 44p-170 Runs for ~96 seconds on p630
  • 7. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation Background 7 Run a representative workload on both old and new server Ensure nothing else runs concurrently on the system Collect hardware utilization metrics using pmcount utility Summarize and compare the pmcount metrics on old vs new server Gather the officially published benchmarks (SPEC2000 and Linpack) for both systems. The challenge here is to find the matching server configurations. Determine how our pmcount metrics compare to the official benchmarks Take corrective action as needed My Method
  • 8. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation Background 8 Possible causes CPU: more than 2x faster so any slowdown will have to come from caching, memory or I/O Disk I/O: Poly3D is CPU intensive as it mainly performs floating point calculations, so disk I/O is not the likely bottleneck. SAN throughput is nearly identical on both systems anyway. Memory: p630 has 4x as much memory, so not a likely bottleneck. Cache: p630 actually has less L2 cache than 44P-170. This is something that we want to keep an eye on. My Method (cont’d)
  • 9. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation9 Table of contents Background Poly3D Profiling Results Benchmarks Summary
  • 10. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation10 Poly3D Profiling Results Memory Access Distribution Event Description Hits PM_DATA_FROM_L2 Data loaded from L2 cache 940,204,334 PM_DATA_FROM_L3 * Data loaded from L3 cache 63,310,703 PM_DATA_FROM_L3.5 * Data loaded from L3.5 cache 55,488,257 PM_DATA_FROM_MEM Data loaded from memory 77,330,243 Total 1,136,333,537 82% 6% 5% 7% Data loaded from L2 cache Data loaded from L3 cache Data loaded from L3.5 cache Data loaded from memory * Total L3 cache access = L3 + L3.5 Obtained using the pmcount utility on p630-6C4
  • 11. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation11 Poly3D Profiling Results Processor Time Distribution Activity Cycles Seconds L2 cache access 59,930,474,670 59.93 L3 cache access 11,879,896,000 11.88 Memory access 23,199,072,900 23.20 Total 95,009,443,570 95.01 63%13% 24% L2 cache access L3 cache access Memory access Obtained by the pmcount utility on p630-6C4
  • 12. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation12 Poly3D Profiling Results Observations Memory access constitutes a significant proportion of the execution time (24%) Cost of one L3 cache access = ~100 cycles Cost of one memory access = ~300 cycles 118,798,960 L3 accesses x 100 cycles = 11,879,896,000 cycles (11.9 seconds @ 1GHz) 77,330,243 memory accesses x 300 cycles = 23,199,072,900 cycles (23.2 seconds @ 1GHz) Total of 35,078,968,900 cycles or 35.1 seconds spent on L3 cache and memory accesses This portion of work will take less on 44P-170 due to a much larger L2 cache The remaining work is expected to scale down with clock speed increase Target of 70 seconds (or less) was achieved on p690 1Ghz, which due to a larger L3 cache accessed memory eight times less than p630 (77 vs. 9 million)
  • 13. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation13 Table of contents Background Poly3D Profiling Results Benchmarks Summary
  • 14. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation14 Benchmarks SPEC CPU2000 and LINPACK Results SPEC CPU2000 LINPACK int int_base fp fp_base DP TPP HPC 44p-170 346 333 434 426 503 1,440 --- p630-6C4 639 624 886 843 842 2,172 --- Improvement ratio 1.85 1.87 2.04 1.98 1.67 1.51 --- Source: IBM eServer pSeries and IBM RS/6000 Performance Report Greater improvement ratio shown for CPU-intensive benchmarks, i.e. SPEC CPU2000 Lower improvement ratio shown for memory-intensive benchmarks, i.e. LINPACK
  • 15. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation15 Benchmarks LINPACK Overview LINear equations software PACKage Developed by Dr. Jack Dongarra, University of Tennessee Consists of algorithms that solve a dense system of linear equations / matrices using Gaussian elimination Uses matrix of order 100 for DP, and matrix of order 1000 for TPP benchmark Used by TOP500 Supercomputer sites (www.top500.org) Used to test overall performance rather than just CPU clock rate Memory reference and CPU usage patterns similar to Poly3D Problems being solved similar to those of Poly3D
  • 16. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation16 Benchmarks LINPACK Cont’d Source: Performance of Various Computers Using Standard Linear Equations Software, Dr. Jack Dongarra Theoretical peak performance is determined by counting the number of floating point operations (flops) that can be completed in one second Theoretical peak performance does not take into account factors such as: data movement between different levels of memory, cache misses, pipeline start-ups, memory load, bus speed, and others Actual performance reflects those factors and it also depends on application code efficiency, compiler optimization, operating system, hardware characteristics, etc DP (Mflop/s) TPP (Mflop/s) Theoretical Peak (Mflop/s) Poly3D (seconds) 44p-170 503 1,440 1,800 150 p630-6C4 842 2,172 4,000 96 Improvement ratio 1.67 1.51 2.22 1.56
  • 17. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation17 Table of contents Background Poly3D Profiling Results Benchmarks Summary
  • 18. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation18 Summary Poly3D memory reference pattern is causing a high cache miss rate and extensive data movement between the main memory and CPU Smaller L2 cache and high L3 cache miss rate is making Poly3D go to the main memory on p630 more often than on 44P-170 Significant portion of execution is limited to the speed of the main memory Total amount of memory used by Poly3D is greater than the system cache Poly3D improvement ratio is consistent with LINPACK Difference between the actual and peak performance for p630 LINPACK benchmark is consistent with other systems A single benchmark should not be used to judge the overall performance of a system. Rather, a set of specialized benchmarks can measure overall performance more accurately
  • 19. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation19 Sources IBM eServer pSeries and IBM RS/6000 Performance Report (June 2003) http://www.ibm.com/servers/eserver/pseries/hardware/system_perf.pdf Performance of Various Computers Using Standard Linear Equations Software, Jack Dongarra http://www.netlib.org/benchmark/performance.ps Frequently Asked Questions on the Linpack Benchmark http://www.netlib.org/utk/people/JackDongarra/faq-linpack.html The LINPACK Benchmark: Past, Present, and Future, Dongarra, Luszczek, and Petitet http://www.netlib.org/utk/people/JackDongarra/PAPERS/hpl.pdf

Related Documents