Buddy Threading in Distributed Applications on Simultaneous Multi-Threading Processors

No Thumbnail Available

Date

2005-04-19

Journal Title

Series/Report No.

Journal ISSN

Volume Title

Publisher

Abstract

Modern processors provide a multitude of opportunities for instruction-level parallelism that most current applications cannot fully utilize. To increase processor core execution efficiency, modern processors can execute instructions from two or more tasks simultaneously in the functional units in order to increase the execution rate of instructions per cycle (IPC). These processors implement simultaneous multi-threading (SMT), which increases processor efficiency through thread-level parallelism, but problems can arise due to cache conflicts and CPU resource starvation. Consider high end applications typically running on clusters of commodity computers. Each compute node is sending, receiving and calculating data for some application. Non-SMT processors must compute data, context switch, communicate that data, context switch, compute more data, and so on. The computation phases often utilize floating point functional units while integer functional units for communication. Until recently, modern communication libraries were not able to take complete advantage of this parallelism due to the lack of SMT hardware. This thesis explores the feasibility of exploiting this natural compute/communicate parallelism in distributed applications, especially for applications that are not optimized for the constraints imposed by SMT hardware. This research explores hardware and software thread synchronization primitives to reduce inter-thread communication latency and operating system context switch time in order to maximize a program's ability to compute and communicate simultaneously. This work investigates the reduction of inter-thread communication through hardware synchronization primitives. These primitives allow threads to 'instantly' notify each other of changes in program state. We also describe a thread-promoting buddy scheduler that allows threads to always be co-scheduled together, thereby providing an application the exclusive use of all processor resources, reducing context switch overhead, inter-thread communication latency and scheduling overhead. Finally, we describe the design and implementation of a modified MPI over Channel (MPICH) MPI library that allows legacy applications to take advantage of SMT processor parallelism. We conclude with an evaluation of these techniques using several ASCI benchmarks. Overall, we show that compute-communicate application performance can be further improved by taking advantage of the native parallelism provided by SMT processors. To fully exploit this advantage, these applications must be written to overlap communication with computation as much as possible.

Description

Keywords

Kernel Mode Linux, Pentium 4, ASCI Purple Benchmarks, MWAIT, MONITOR, Thread Synchronization

Citation

Degree

MS

Discipline

Computer Science

Collections