◇◇新语丝(www.xys.org)(xys.dxiong.com)(xys.dropin.org)(xys-reader.org)◇◇   国防科学技术大学计算机学院最新科研成果——两篇拼装大作   作者:卿本佳人   拼装文者, 东抄西抄,七拼八凑之文章是也。   湖南长沙的国防科学技术大学计算机学院、“并行与分布处理”国家重点实 验室之DENG Qingying, ZHANG Minxuan (张民选教授), JIANG Jiang (蒋江副教 授) 最近有两篇大作分别发表在International Symposium on Parallel and Distributed Processing and Applications (ISPA 2007) conference 和2nd International Workshop on Parallel and Distributed Multimedia Computing (ParDMCom 2007) 并刊登于 Lecture Notes in Computer Science:   1. A Parallel Infrastructure on Dynamic EPIC SMT and Its Speculation Optimization LNCS Vol 4742, 2007 pp.235-244 (http://www.springerlink.com/content/f1344v5831754223/?p=6ecffa251a2e4 5debb32ef1338980002&pi=0) (邓文一)   2. Register File Management and Compiler Optimization on EDSMT LNCS Vol. 4743, 2007 pp. 394-403 (http://www.springerlink.com/content/g186714062138906/?p=6ecffa251a2e4 5debb32ef1338980002&pi=1) (邓文二)   两篇杰作每篇抄自至少四五篇他人著作。 两篇杰作之间又有部份雷同。 被邓某等抄袭之文章包括如下:   1. "The Future of Microprocessors" ('Olukotun文') by Olukotun and Hammond, ACM Queue vol. 3, no. 7 - September 2005    (http://www.acmqueue.com/modules.php?name=Content&pa=showpage&pid=326)   2. "Intel Itanium Architecture Software Developer's Manual, Volume 1:   Application Architecture Revision 2.2 January 2006" ('Intel文') by Intel (http://download.intel.com/design/Itanium/manuals/24531705.pdf)   3. "Optimizations for the Intel Itanium Architecture Register Stack" ('Settle文'), by Settle, Connors, Hoflehner, and Lavery. Proceedings of the 1st Conference on Code Generation and Optimization. March, 2003. (http://rogue.colorado.edu/draco/papers/cgo-03-register.pdf)   4. "Tuning Compiler Optimizations for Simultaneous Multithreading" ('Lo文一'), by Lo, Eggers, Levy, Parekh, and Tullsen, Proceedings of the 30th Annual International Symposium on Microarchitecture, December 1997, pp 114-124 (http://www.cs.washington.edu/research/smt/papers/smtcompiler.pdf) (邓 文一列为参考文献Ref 7)   5. "Software-Directed Register Deallocation for Simultaneous Multithreaded Processors" ('Lo文二') by Lo, Parekh, Eggers, Levy, and Tullsen, University of Washington Technical Report #UW-CSE-97-12-01,   (http://www.cs.washington.edu/research/smt/papers/register.TR.ps), 其修订本发表于IEEE Transactions on Parallel and Distributed Systems, Volume 10, Issue 9 (September 1999)   6. "OpenUH: An Optimizing, Portable OpenMP Compiler" by Liao, Hernandez,   Chapman, Chen, and Zheng, (www2.cs.uh.edu/~copper/openuh.pdf) (邓 文一列为参考文献Ref 9)   6. "OpenUH Compiler Suite User's Guide" (http://www2.cs.uh.edu/~openuh/OpenUHUserGuide.pdf)   抄袭情况俯拾皆是, 略举数例如下:   1. 邓文一第1节 :   The combination of limited instruction parallelism suitable for superscalar issue, practical limits to pipelining, and a “power ceiling” limited by practical cooling limitations has limited future speed increases within conventional processor cores to the basic Moore’ s law improvement rate of the underlying transistors.   Processor designers must find new ways to effectively utilize the increasing transistor budgets in high-end silicon chips to improve performance in ways to minimize both additional power usage and design complexity. And it is also useful to examine the problem from the point of view of different performance requirements.   抄自Olukotun 文:   The combination of limited instruction parallelism suitable for superscalar issue, practical limits to pipelining, and a power ceiling limited by practical cooling limitations has limited future speed increases within conventional processor cores to the basic Moore’s law improvement rate of the underlying transistors.   ...   Processor designers must find new ways to effectively utilize the increasing transistor budgets in high-end silicon chips to improve performance in ways to minimize both additional power usage and design complexity. ... so it is useful to examine the problem from the point of view of these different performance requirements..   2. 邓文一第2.2节:   Explicitly Parallel Instruction Computing (EPIC) architectures developed by HP and Intel allow the compiler to express program instruction level parallelism directly to the hardware to deal with increasing memory latencies and penalties. Specifically, the Itanium architecture deploys a number of EPIC techniques which enable the compiler to represent control speculation, data dependence speculation, and predication to enhance performance. These techniques have individually been shown to be very effective in dealing with memory penalties. In addition to these techniques, the Itanium architecture provides a virtual register stack to reduce the penalty of memory accesses associated with procedure calls and to leverage the performance advantages of a large register file.   抄自Settle 文第1节:   Explicitly Parallel Instruction Computing (EPIC) architectures allow the compiler to express program instruction level parallelism directly to the hardware to deal with increasing memory latencies and penalties. Specifically, the Itanium architecture deploys a number of EPIC techniques which enable the compiler to represent control speculation, data dependence speculation, and predication [3] to enhance performance. These techniques have individually been shown to be very effective [2] in dealing with memory penalties. In addition to these techniques, the Itanium architecture provides a virtual register stack to reduce the penalty of memory accesses associated with procedure calls and to leverage the performance advantages of a large register file.   3. 邓文一第4节:   Today’s optimizing compilers rely on aggressive code scheduling to hide instruction latencies. In global scheduling techniques, such as trace scheduling or hyperblock scheduling, instructions from a predicted branch path may be moved above a conditional branch, so that their execution becomes speculative. If at runtime, the other branch path is taken, then the speculative instructions are useless and potentially waste processor resources.   On in-order superscalars or VLIW machines, software speculation is necessary, because the hardware provides no scheduling assistance. On an EDSMT processor, multithreading is also used to hide latencies. (As the number of SMT threads is increased, instruction throughput also increases.) Therefore, the latency-hiding benefits of software speculative execution may be needed less, or even be unnecessary, and the additional instruction overhead introduced by incorrect speculations may degrade performance.   Our experiments were designed to evaluate the appropriateness of software speculative execution ... The results highlight two factors that determine its effectiveness for EDSMT: static branch prediction accuracy and instruction throughput.   Correctly-speculated instructions have no instruction overhead; incorrectly-speculated instructions, however, add to the dynamic instruction count. Therefore, speculative execution is more beneficial for applications that have high speculation accuracy, e.g., loop-based programs with either profile-driven or state-of-the-art static branch prediction.   Fig. 6 compares the dynamic instruction counts between (profile-driven)   speculative and non-speculative versions of our applications. Small increases in the dynamic instruction count indicate that the compiler has been able to accurately predict which paths will be executed. Consequently, speculation may incur no penalties. Higher increases in dynamic instruction count, on the other hand, mean wrong-path speculations, and a probable loss in SMT performance.   While instruction overhead influences the effectiveness of speculation, it is not the only factor. The level of instruction throughput in programs without speculation is also important, because it determines how easily speculative overhead can be absorbed. With sufficient instruction issue bandwidth (low IPC), incorrect speculations may cause no harm; with higher per-thread ILP or more threads, software speculation should be less profitable, because incorrectly-speculated instructions are more likely to compete with useful instructions for processor resources (in particular, fetch bandwidth and functional unit issue). Figure 7 contains the instruction throughput for each of the applications. For some programs IPC is higher with software speculation, indicating some degree of absorption of the speculation overhead. In others, it is lower, because of additional hardware resource conflicts, most notably L1 cache misses.   Speculative instruction overhead (related to static branch prediction accuracy) and instruction throughput together explain the speedups illustrated in Fig. 8. When both factors were high (bizp2), speedups without software speculation were greatest. If one factor was low or only moderate, speedups were minimal or nonexistent (perlbmk had only speculation overhead). Without either factor (vortex), software speculation helped performance, and for the same reasons it benefits other architectures -- it hid latencies and executed the speculative instructions in otherwise idle functional units. For these applications (and a few others as well), as more threads are used, the advantage of turning off speculation generally becomes even larger. Additional threads provide more parallelism, and therefore, speculative instructions are more likely to compete with useful instructions for processor resources.   The bottom line is that, while loop-based applications should be compiled with software speculative execution, non-loop applications should be compiled without it. Doing so either improves EDSMT   program performance or maintains its current level -- performance is never hurt.   抄自Lo 文一第6节:   Today’s optimizing compilers rely on aggressive code scheduling to hide instruction latencies. In global scheduling techniques, such as trace scheduling [22] or hyperblock scheduling [23], instructions from a predicted branch path may be moved above a conditional branch, so that their execution becomes speculative. If at runtime, the other branch path is taken, then the speculative instructions are useless and potentially waste processor resources.   On in-order superscalars or VLIW machines, software speculation is necessary, because the hardware provides no scheduling assistance. On an SMT processor (whose execution core is an out-of-order superscalar), not only are instructions dynamically scheduled and speculatively executed by the hardware, but multithreading is also used to hide latencies. (As the number of SMT threads is increased, instruction throughput also increases.) Therefore, the latency-hiding benefits of software speculative execution may be needed less, or even be unnecessary, and the additional instruction overhead introduced by incorrect speculations may degrade performance.   Our experiments were designed to evaluate the appropriateness of software speculative execution for an SMT processor. The results highlight two factors that determine its effectiveness for SMT: static branch prediction accuracy and instruction throughput.   Correctly-speculated instructions have no instruction overhead; incorrectly-speculated instructions, however, add to the dynamic instruction count. Therefore, speculative execution is more beneficial for applications that have high speculation accuracy, e.g., loop-based programs with either profile-driven or state-of-the-art static branch prediction.   Table 5 compares the dynamic instruction counts between (profile-driven)   speculative and non-speculative versions of our applications. Small increases in the dynamic instruction count indicate that the compiler (with the assistance of profiling information) has been able to accurately predict which paths will be executed. Consequently, speculation may incur no penalties. Higher increases in dynamic instruction count, on the other hand, mean wrong-path speculations, and a probable loss in SMT performance.   While instruction overhead influences the effectiveness of speculation, it is not the only factor. The level of instruction throughput in programs without speculation is also important, because it determines how easily speculative overhead can be absorbed. With sufficient instruction issue bandwidth (low IPC), incorrect speculations may cause no harm; with higher per-thread ILP or more threads, software speculation should be less profitable, because incorrectly-speculated instructions are more likely to compete with useful instructions for processor resources (in particular, fetch bandwidth and functional unit issue). Table 6 contains the instruction throughput for each of the applications. For some programs IPC is higher with software speculation, indicating some degree of absorption of the speculation overhead. In others, it is lower, because of additional hardware resource conflicts, most notably L1 cache misses.   Speculative instruction overhead (related to static branch prediction accuracy) and instruction throughput together explain the speedups (or lack thereof) illustrated in Figure 4. When both factors were high (the non-loop-based fft, li, and LU), speedups without software speculation were greatest, ranging up to 22%. If one factor was low or only moderate, speedups were minimal or nonexistent (the SPECfp95 applications, radix and water-nsquared had only high IPC; go, m88ksim and perl had only speculation overhead). Without either factor, software speculation helped performance, and for the same reasons it benefits other architectures -- it hid latencies and executed the speculative instructions in otherwise idle functional units.   The bottom line is that, while loop-based applications should be compiled with software speculative execution, non-loop applications should be compiled without it. Doing so either improves SMT program performance or maintains its current level -- performance is never hurt.   4. 邓文二第4节 :   MTRM focused on supporting the effective sharing of registers in an EDSMT processor, using register renaming to permit multiple threads to share a single global register file. In this way, one thread with high register pressure can benefit when other threads have low register demands. Unfortunately, existing register renaming techniques cannot fully exploit the potential of a shared register file. In particular, while existing hardware is effective at allocating physical registers, it has only limited ability to identify register deallocation points; therefore hardware must free registers conservatively, possibly wasting registers that could be better utilized.   There are two types of dead registers can be deallocated: (1) registers allocated to idle hardware contexts, and (2) registers in active contexts whose last use has already retired.   To address the second type of dead registers, those in active threads, we investigate two mechanisms that allow the compiler to communicate last-use information to the processor, so that the renaming hardware can deallocate registers more aggressively. Without this information, the hardware must conservatively deallocate registers only after they are redefined.   1) Special Bit communicates last-use information to the hardware via dedicated instruction bits (...), with the dual benefits of immediately identifying last uses and requiring no instruction overhead. It can serve as an upper bound on performance improvements that can be attained with the compiler's static last-use information.   2) Special Instruction is a more realistic implementation of Special Bit. Rather than specifying last uses in the instruction itself, it uses a separate instruction to specify one or two registers to be freed. Our compiler generates a Free Register instruction (an unused opcode in the IA-64 ISA) immediately after any instruction containing a last register use (if the register is not also redefined by the same instruction). Like Special Bit, it frees registers as soon as possible, but with an additional cost in dynamic instruction overhead.   Current renaming hardware provides mechanisms for register deallocation (i.e., returning physical registers to the Idle Physic Register Number Queue) and can perform many deallocations each cycle.   抄自Lo 文二第1节:   Our techniques focus on supporting the effective sharing of registers in an SMT processor, using register renaming to permit multiple threads to share a single global register file. In this way, one thread with high register pressure can benefit when other threads have low register demands. Unfortunately, existing register renaming techniques cannot fully exploit the potential of a benefit when other threads have low register demands. Unfortunately, existing register renaming techniques cannot fully exploit the potential of a shared register file. In particular, while existing hardware is effective at allocating physical registers, it has only limited ability to identify register deallocation points; therefore hardware must free registers conservatively, possibly wasting registers that could be better utilized.   We propose software support to expedite the deallocation of two types of dead registers: (1) registers allocated to idle hardware contexts, and (2) registers in active contexts whose last use has already retired.   To address the second type of dead registers, those in active threads, we investigate five mechanisms that allow the compiler to communicate last-use information to the processor, so that the renaming hardware can deallocate registers more aggressively. Without this information, the hardware must conservatively deallocate registers only after they are redefined.   抄自Lo 文二第4.2节:   1. Free Register Bit communicates last-use information to the hardware via dedicated instruction bits, with the dual benefits of immediately identifying last uses and requiring no instruction overhead. ... it can serve as an upper bound on performance improvements that can be attained with the compilers static last-use information.   2. Free Register is a more realistic implementation of Free Register Bit.   Rather than specifying last uses in the instruction itself, it uses a separate instruction to specify one or two registers to be freed. Our compiler generates a Free Register instruction (an unused opcode in the Alpha ISA) immediately after any instruction containing a last register use (if the register is not also redefined by the same instruction). Like Free Register Bit, it frees registers as soon as possible, but with an additional cost in dynamic instruction overhead.   ...   Current renaming hardware provides mechanisms for register deallocation (i.e., returning physical registers to the free register list) and can perform many deallocations each cycle.   5. 邓文二第6节 :   The two register deallocation schemes are compared in Fig. 6 and Fig. 7, which charts their speedup versus no explicit register deallocation under different thread configurations. The Special Bit bars show that register deallocation can (potentially) improve performance significantly for small register files (80% on average, but ranging as high as 120%). The Special Bit results highlight the most attractive outcome of register deallocation: by improving register utilization, an EDSMT processor with small register files can achieve large register file performance. With multiple register contexts, an EDSMT processor need not double its architectural registers if they are effectively shared. Our results show that an 4-context EDSMT with the effective RSE and compiler assistant can alleviate physical register pressure.   Special Bit is more effective at reducing the number of dead registers, because it deallocates them more promptly, at their last uses. When registers are a severe bottleneck with small register files, Special Instruction has a bottleneck with small register files, Special Instruction has a good result; while instruction overhead will cause a low performance with larger register files and applications with low register usage.   抄自Lo 文二第4.2节:   The five register deallocation schemes are compared in Figure 6, which charts their speedup versus no explicit register deallocation. The Free Register Bit bars show that register deallocation can (potentially) improve performance significantly for small register files (77% on average, but ranging as high as 195%). The Free Register Bit results highlight the most attractive outcome of register deallocation: by improving register utilization, an SMT processor with small register files can achieve large register file performance, as shown in Figure 7. ... With multiple register contexts, an SMT processor need not double its architectural registers if they are effectively shared.   Our results show that an 8-context SMT with an FSR register file ... needs only 96 additional registers to alleviate physical register pressure ... Free Register is more effective at reducing the number of dead registers, because it deallocates them more promptly, at their last uses. When registers are a severe bottleneck, as in ... with small register files, Free Register outperforms Free Register Mask. Free Register Mask, on the other hand, incurs less instruction overhead; therefore it is preferable with larger register files and applications with low register usage.   邓文写道其研究由863计划和国家自然科学基金资助。人民的血汗钱竟然用 来培养这些学者做这些“研究” …快把钱还给国家! (XYS20071031) ◇◇新语丝(www.xys.org)(xys.dxiong.com)(xys.dropin.org)(xys-reader.org)◇◇