Mascot

Research Projects:

Publications


Performance and Organization of Modern DRAM Architectures

In response to the growing gap between memory access time and processor speed, DRAM manufacturers have created several new DRAM architectures. We have simulated the architectures of several commercial examples to determine their performance characteristics. We have simulated seven different commercial DRAM architectures in a high-performance setting, connected to a fast, out-of-order, 8-way superscalar processor with lockup-free caches. We modeled Fast Page Mode, Extended Data Out, Synchronous, Enhanced Synchronous, Synchronous Link, Rambus, and Direct Rambus DRAMs. Among other conclusions, we have found and quantified the following: (a) contemporary DRAM technologies are addressing the memory bandwidth problem but not the memory latency problem; (b) the memory latency problem is closely tied to current mid- to high-performance memory bus speeds (100MHz), which are inadequate for current high-performance DRAM designs; (c) there is a significant degree of locality in the addresses that are presented to the primary memory system--this locality seems to be exploited well by DRAM designs that are multi-banked internally and therefore have more than one row buffer; and (d) exploiting this locality will become a critical factor in future systems when memory buses widen, exposing the row access time as a limiting factor.

More on this topic can be found here.


Hardware-Software Co-Design of an Experimental Real-Time Operating System and Microcontroller Architecture

A hardware-software co-design strategy is being used to create an environment for real-time embedded systems, comprising a real-time operating system (RTOS) for microcontrollers, and a set of hardware mechanisms that enhance a microarchitecture's real-time capabilities. The project validates this experimental system through extensive testing on a simulated processor, and measures the cost-effectiveness of the hardware architecture extensions over a wide range of design choices. The central contribution on the hardware side is the addition of multiple "nanoprocessors" and software-managed caches to the main processor core; each of these is a software-configurable finite state machine (FSM) with limited processing capabilities, but each has access to most of the processor state. These hardware mechanisms are orthogonal to the underlying microarchitecture and can be implemented inexpensively even in a low-cost microcontroller. The RTOS implements all of its high-overhead operating system functions as FSMs that are mapped onto the configurable hardware. Each of the nanoprocessors becomes a personal assistant to the operating system, performing mundane tasks such as prioritizing interrupts, filtering input/output, or acting as high-resolution timers. This leaves the full 100% of CPU processing time on the main processor core for applications.

This work is in collaboration with Dr. Dave Stewart.


High-Performance Microarchitectures

At the University of Michigan, I was involved in the
PUMA Processor Project, a DARPA-funded project with 4 faculty and roughly 25 graduate students. The goal was to build a 1GHz PowerPC processor in GaAs (gallium arsenide), an exotic material that supports very high clock speeds.

The architecture portion of the project looked at high-performance architecture techniques such as software-managed address translation and runahead processing that would map well to the limited resources available in GaAs. We also looked at using advanced packaging techniques such as multi-chip modules to allow high-speed interconnect between multiple chips, which would allow advanced, multi-chip cache designs.

At Maryland, I am interested in furthering this research; there are a large number of high-performance architecture techniques that can be applied toward burgeoning research areas. One such area is embedded systems; the problems that embedded systems designers face are similar to the problems of working in exotic materials such as GaAs--one must limit the use of die-area resources. Whereas the reason is physical limitations in using gallium arsenide, the reason is cost constraints in embedded systems; the cost of manufacturing a processor goes as the cube of the die area, and embedded systems must be inexpensive if they are to be pervasive.


Flexible, Software-Defined Architectures

Software-defined architectures allow system software to reconfigure implementation details of the hardware platform. For example, caches can be under the control of software, allowing software to determine on a page-by-page basis what to cache and where to place it in the cache. The interrupt model can be under software control to allow software to specify the level of support required for each application. For instance, real-time systems can require more support from the hardware to guarantee reaction times; such hardware support can significantly reduce overhead over equivalent software implementations. Other examples include software-defined address translation, for embedded systems that use memory management.

The benefits include reduced power consumption and simple support for backward compatibility. Power consumption is reduced through simplification of the hardware; one often arrives at a software-defined design by replacing a fixed hardware structure with a simple hardware hook into software routines. Backward compatibility is similar to architecture emulation; both are supported by a software-oriented design. A software-oriented design is a least common denominator of hardware support for some function. As such, it represents an ideal platform for the emulation of other designs, making architecture emulation and backward compatibility with older hardware (which is simply a special case of architecture emulation) trivial to implement.

The impact of this research is to allow hardware designs to be cheaper, lower-power, and more flexible, and to enable system software to more easily span heterogeneous hardware platforms.


Software & Hardware for Real-Time Memory Management

Thirty years ago, the state-of-the-art in general-purpose computer systems typically had less physical memory than that needed by a programmer to run his or her program; programmers explicitly hand-coded memory overlays to re-use sections of physical memory once the program no longer needed them. Besides being inconvenient, programming for an explicit memory configuration made programs less portable and more susceptible to error. In response to this problem, virtual memory was invented to automate the allocation of physical memory, making hand-coded overlays unnecessary. This programming paradigm was very successful; an application programmer could write programs independently of the memory configuration. The programs ran on systems with extremely limited memories, and ran as fast as programs with hand-coded memory overlays. The time to write a program decreased dramatically, as did programming errors. Today, most modern systems support memory management in both software and hardware. Most processors include a translation lookaside buffer (TLB), an on-chip memory structure that caches mapping information to speed up translation to the most-recently-used pages.

Today's embedded systems resemble the state-of-the-art in general-purpose computing thirty years ago; memory is scarce and programmers either code overlays by hand, or do not use overlays at all. Since embedded applications are typically written from the ground up for each hardware platform, application development takes longer than in general-purpose computing, and programming errors are much more frequent.

The use of real-time memory management for embedded applications could solve these problems as it did the problems in general-purpose computing thirty years ago. However, there are several reasons why memory management has not become widespread in embedded systems:

Despite these apparent drawbacks, memory management can be implemented efficiently and inexpensively on a system with a limited amount of physical memory, without auxiliary storage, and in real time. Translation lookaside buffers are not necessary and can be replaced by software-managed virtual caches, whose timing analysis is far simpler than a hardware-managed physical cache. Backing store is not needed; unused pages can be compressed or discarded rather than paged to disk. We have designed a memory management system that requires less than 0.1% of the available memory for a page table. We have developed analytical cache models and software memory-management techniques that dramatically affect timing analysis. We have also developed hardware caching mechanisms that give the software total control over the cache, providing worst-case determinism and greatly simplifying analysis.


Publications