Improving throughput of simultaneous multithreaded (SMT) processors using shareable resource signatures and hardware thread priorities
In this dissertation we present a methodology for predicting the best priority pair for a given co-schedule of two application threads. Our approach exploits resource-utilization information that is collected during an application thread's execution in single-threaded mode. This information provides insights about the availability of resources that are shared by threads concurrently executed in simultaneous multithreading (SMT) mode for use by another co-scheduled application thread.^ The main contributions of this dissertation are: (1) Demonstration of the efficacy of using non-default hardware thread priority pairs to improve SMT core throughput: Using a POWER5 simulator, we show that equal (default) priorities are not the best for 82% of the 263 application trace-pairs studied. (2) The concept of a "Shareable Resource Signature": this signature characterizes an application's utilization of critical shareable SMT core resources during a specified execution time interval when executed in single-threaded mode. (3) A best priority pair prediction methodology: Given shareable resource signatures of an application-thread pair, we present a methodology to predict the best priority pair for the application-thread pair when co-scheduled to run in SMT mode. (4) An implementation and validation of the methodology for the IBM POWER5 processor, which shows that the following: (a) 17 of 10,000 possible signatures are sufficient to characterize 95.6% of the execution times of a set of applications that consists of 20 SPEC CPU2006 benchmarks (1 data input), three NAS NPB benchmarks (3 data inputs), and 10 PETSc KSP solvers (12 data inputs). The cgs and lsqr PETSc KSP solvers have signatures that are independent of input data, while one of three NAS NPB benchmarks (bt-mz) has a signature that is independent of the input data. (b) For 21 co-schedules of applications, each with a signature that characterizes 95% of its execution time, our validation study shows the following: (i) Predicted best priorities yield higher throughput than default priorities for all but one of the 21 co-schedules. Initial results showed that two co-schedules (462.libquantum, 437.leslie3d) and (bt-mz.A, lu-mz.A) experience a throughput loss of 7.46% and 20.05%, respectively, at predicted priorities, as compared to that achieved at default priorities. Further investigation shows that for the co-schedule (bt-mz.A, lu-mz.A) mapping and executing the co-schedule with the predicted best priorities on hardware threads (5, 4), instead of (4, 5), results in a 3.56% higher throughput as compared to default priorities – this is in contrast to the 20.05% throughput loss experienced when executed on hardware threads (4, 5). Although we have not verified it, one possible reason for this is that the processor core favors one hardware thread over the other. Re-executing the co-schedule (462.libquantum, 437.leslie3d) on hardware threads (5,4), instead of (4, 5), results in predicted priorities yielding lower throughput than the default priorities. Thus, we claim that predicted best priorities yield equal or higher throughput than default priorities for 20 of the 21 co-schedules studied, and for the outlier the throughput loss is 7.46%. (ii) Using non-default priorities improves throughput. The default priority pair yields best throughput for only six of the 21 co-schedules. For the remaining 15 the default priority pair yields throughput that is between 0.74% and 14.10% lower than that achieved with the best priority pair. (iii) Using the predicted best priority pair, rather than default priorities, improves throughput or at least provides throughput equal to that achieved with default priorities. For 11 of the 21 co-schedules both the default and predicted priorities yield equal throughput. For nine of the 21 predicted priorities yield throughput that is between 0.59% and 16.42% higher than that achieved with default priorities. For two of these nine co-schedules the predicted priority pair yields a throughput improvement of less than 5%. Furthermore, for three the throughput improvement associated with executing with the predicted priority pair, rather than default priorities, is between 5% and 10% and for the other four the improvement is greater than 10%. (iv) Using predicted best priority pairs appears to be most applicable to floating-point "intensive" applications: For eight co-schedules comprising applications for which the utilization of the floating-point unit exceeds that of the fixed-point unit by 10% or more, the predicted priority pairs, as compared to the default priorities, yield a throughput improvement between 3.56% and 16.42%. This result indicates that the methodology for predicting best priority pairs is most applicable to applications for which floating-point unit utilization dominates that of the fixed point unit by at least 10%. (Abstract shortened by UMI.)^
Meswani, Mitesh R, "Improving throughput of simultaneous multithreaded (SMT) processors using shareable resource signatures and hardware thread priorities" (2009). ETD Collection for University of Texas, El Paso. AAI3390618.