Browsing by Author "Dr. Yan Solihin, Committee Chair"
Now showing 1 - 5 of 5
- Results Per Page
- Sort Options
- Characterization of Context Switch Effects on L2 Cache(2007-04-16) Eker, Abdulaziz; Dr. Yan Solihin, Committee Chair; Dr. Suleyman Sair, Committee Member; Dr. Ed Gehringer, Committee MemberMultitasking is common in most systems. In order to use the processor resources efficiently, a multitasking system schedules processes to run for certain intervals by switching (saving and restoring) their contexts. However, since processes bring their own data to the cache when they are running, context switching causes each process to suffer from more misses. Behavior of L2 cache misses due to context switches with different cache configurations, working-set sizes, and process priorities is not well-understood. Analysis of this behavior will give insights about the reasons and ways to mitigate these misses. The first contribution of this paper is the characterization of the context switch effect on L2 cache relating to the process priorities. The paper also characterizes the context switch effect with various cache configurations, including the size and associativity of the cache. Finally, it defines two types of misses that occur due to context switches.Replacement context switch misses occur when a process' working set is replaced by an interfering process. Reorder context switch misses occur due to reordering of lines by an interfering process, i.e. moving lines from more recently used to less recently used position. Based on the characterization results, we found that the number of context switch misses increases with lower priorities. On average, a process with the lowest priority suffers 15.4 times more L2 cache misses due to the context switch effect than the case there is no time-sharing, while the process with the highest priority suffers only 1.2 times more misses. We also observed that the impact of context switch is affected more by the priority of the process itself, rather than the priority of the interfering process. We also found that increase in associativity increases reorder context switch misses. Finally, the highest number of context switch misses occur when the size of a process' working set is close to the cache size.
- Design and Analysis of Lock-free Data Structures(2007-07-31) Sarkar, Abhik; Dr. Yan Solihin, Committee Chair; Dr. Suleyman Sair, Committee Member; Dr. Ed Gehringer, Committee MemberThe advent of multi-processor systems has motivated programmers to develop multi-threaded and multi-process applications on shared memory data structures. In these applications, multiple processes read and update shared data structure concurrently, which may lead to race conditions resulting in incoherent memory. To ensure exclusivity of access to this shared memory, programmers have been using locks. Lock-based concurrency is a pessimistic approach that assumes conflicts among concurrent processes to occur frequently.However, if few conflicts occur, lock-based concurrency unnecessarily reduce concurrency. One solution to improve concurrency, is to allow non-conflicting processes to execute in parallel. This can be gained by using fine locks, but programming with them is complex and error-prone. Consequently, researchers have devised an optimistic concurrency mechanism known as lock-free algorithms. Various lock-free libraries have been developed that are either data structure specific or universal constructs. However, these lock-free libraries have been restricted to simple data structures that do not meet the requirements of a real-world application. This work focuses on implementing a lock-free data structure suitable for a real-world application. The suitability of various implementations is analyzed. Design choices are made based upon the requirements and the suitability of lock-free implementations to the specific application considered. Finally, the performance of the lock-free implementation versus lock-based implementation are compared. Along with it, certain insights related to reduced complexity of the implementation and atomicity of lock-free implementation that make it robust are discussed.
- Development of a Cycle Level, Full System, x86 Microprocessor Simulator(2008-03-24) Gambhir, Mohit; Dr. Yan Solihin, Committee Chair; Dr. Eric Rotenberg, Committee Member; DR. Vincent W. Freeh, Committee MemberAlthough x86 processors are the most popular processors in commercial and scientific working environment, there is a scarcity of open source microprocessor simulators that can enable researchers to experiment with new x86 based microprocessor and memory system designs. Also, most of the simulators that exist today are user space simulators that do not profile the operating system code that gets executed when interrupts and system calls are invoked while an application is running. This work involves the development of a cycle level, full system, x86 microprocessor simulator called MYSim. One of the biggest challenges involved in developing an x86 based processor simulator is that the x86 instruction set is complex. Its complexities include variable length instructions that may take varying number of cycles to decode. Also, the operands in an x86 instruction may reside in registers or in memory or both. These complexities make the x86 instruction set architecture (ISA) particularly hard to simulate. MYSim is an execution driven simulator that divides the simulation in two parts: the first part is the functional simulator or emulator, which actually executes the simulated application as well as the OS code and the second part is the timing simulator, which models the timing of the application. MYSim uses Bochs (an open source x86 emulator) as the functional simulator which emulates x86 processors, hardware devices, memory, etc. and enables the execution of various operating systems and software within the emulation. MYSim's timing simulator is ported from SESC (SuperESCalar Simulator), a simulator that initially supported MIPS ISA and was modified, as part of this work, to support x86 ISA. The functional simulator executes the next x86 instruction, breaks it into μops and feeds those μµops to the timing simulator. The timing simulator models a full out of order pipeline with branch prediction, caches, buses and most major components that are required to be simulated in order to model accurate timing of modern microprocessors.
- Predicting Compiler Optimization Performance for High-Performance Computing Applications(2005-08-30) Venkatagiri, Radha; Dr. Yan Solihin, Committee Chair; Dr. Gregory T. Byrd, Committee Member; Dr. Rada Y. Chirkova, Committee MemberHigh performance computing application developers often spend a large amount of time in tuning their applications. Despite the advances in compilers and compiler optimization techniques, tuning efforts are still largely manual and require many trials and errors. One of the reasons for this is that many compiler optimizations do not always provide performance gain in all cases. Complicating the problem further is the fact that many compiler optimizations help performance in some cases, but hurt performance in other cases in the same application. To make it worse, it may help performance when it runs with a specific input set, but hurt the performance of the same application when it runs with a different input set. The central idea that this work deals with is whether machine learning techniques can be used to automate compiler optimization selection. Artificial Neural Networks (ANN), and Decision Trees (DT) are modelled, trained and used to predict whether Loop Unrolling optimizations should be applied or not for loops of serial programs. Simple loop characteristics such as iteration count, nesting level, and body size, are collected and used as input to the ANN or DT. A very simple microbenchmark is used to train the ANN, and this is used to predict the benefit of loop unrolling across differnt NAS (Serial Version) benchmarks. We find that an ANN trained using the microbenchmark accurately predicts whether loop unrolling is beneficial in 62\% of the cases. BT predicts correctly if loop unrolling is benefial in 82\% of the cases. Furthermore we find that benchmarks such as FT which perform poorly when tested with ANN trained with the microbenchmark yield accurate results in 69\% of the cases when tested using an ANN trained with loops from other NAS benchmarks. Decision trees used to classify loops (as being benefitted from loop unrolling or not) from the NAS benchmarks were found to have an accuracy of 79.54\%. A DT built using the microbenchmark correctly classified NAS loops 53\% of the time. Although the results show promise, we believe that to accurately automate compiler optimization selection, more complex loops may need to be modeled in the microbenchmark and many other factors may need to be taken into account in characterizing each loop nest.
- Predicting Loop Unrolling Impact in OpenMP Programs Using Machine Learning(2005-08-16) Poojary, Vikram; Dr. Gregory Byrd, Committee Member; Dr. Edward Gehringer, Committee Member; Dr. Yan Solihin, Committee ChairPerformance tuning of high performance numerical code is an important process which is still largely performed manually. While recent research in automated performance tuning has proposed run-time application configuration and compilation, most compilers in use today do not support such run-time features. As a result, a performance tuner's role is limited to selecting the right compiler optimizations for a given application and environment in which the application runs. Because many compiler optimizations do not give performance benefits in all cases, performance tuners must tediously test each optimization on their applications under a wide range of scenarios. Therefore, it is desirable to automate compiler optimization selection in order to avoid or at least reduce the tuning effort. This thesis deals with the question of whether machine learning techniques can be used to automate compiler optimization selection. It presents a case study in which an Artificial Neural Network (ANN) and a Decision Tree (DT) are constructed, trained, and used to predict whether, for a given loop nest in a shared memory parallel program, loop unrolling optimization should be applied or not. Simple characteristics of the loop nests, such as the nesting level, iteration count, and body size, are collected and used as input to the ANN or DT. The ANN and DT were trained with loop nests from some OpenMP-based NAS parallel benchmarks, and are used to predict the benefit of loop unrolling across different benchmarks, and across different numbers of parallel threads. Various training methods were tried, and in the best case, ANN predicts correctly whether loop unrolling is beneficial in 62\% of the cases, whereas DT predicts correctly whether loop unrolling is beneficial in 56\% of the cases. Although the results show promise, we believe that to accurately automate compiler optimization selection, many other factors may need to be taken into account in characterizing each loop nest, due to complex interactions of loop unrolling with memory hierarchy, data layout, thread partitioning, and instruction-level parallelism.
