ISMM 2016
ACM SIGPLAN International Symposium on Memory Management (ISMM 2016)
Powered by
Conference Publishing Consulting

ACM SIGPLAN International Symposium on Memory Management (ISMM 2016), June 14, 2016, Santa Barbara, CA, USA

ISMM 2016 – Proceedings

Contents - Abstracts - Authors

Frontmatter

Title Page


Frontmatter
It is our great pleasure to welcome you to Santa Barbara, and ISMM 2016, the premier forum for research in memory management.

Concurrent Memory Management

Block-Free Concurrent GC: Stack Scanning and Copying
Erik Österlund ORCID logo and Welf Löwe
(Linnaeus University, Sweden)
On-the-fly Garbage Collectors (GCs) are the state-of-the-art concurrent GC algorithms today. Everything is done concurrently, but phases are separated by blocking handshakes. Hence, progress relies on the scheduler to let application threads (mutators) run into GC checkpoints to reply to the handshakes. For a non-blocking GC, these blocking handshakes need to be addressed. Therefore, we propose a new non-blocking handshake to replace previous blocking handshakes. It guarantees scheduling-independent operation level progress without blocking. It is scheduling independent but requires some other OS support. It allows bounded waiting for threads that are currently running on a processor, regardless of threads that are not running on a processor. We discuss this non-blocking handshake in two GC algorithms for stack scanning and copying objects. They pave way for a future completely non-blocking GC by solving hard open theory problems when OS support is permitted. The GC algorithms were integrated to the G1 GC of OpenJDK for Java. GC pause times were reduced to 12.5% compared to the original G1 on average in DaCapo. For a memory intense benchmark, latencies were reduced from 174 ms to 0.67 ms for the 99.99% percentile. The improved latency comes at a cost of 15% lower throughput.

Characterizing Emerging Heterogeneous Memory
Du Shen, Xu Liu, and Felix Xiaozhu Lin
(College of William and Mary, USA; Purdue University, USA)
Heterogeneous memory (HM, also known as hybrid memory) has become popular in emerging parallel architectures due to its programming flexibility and energy efficiency. Unlike the traditional memory subsystem, HM consists of fast and slow components. Usually, the fast memory lacks hardware support, which puts extra burdens on programmers and compilers for explicit data placement. Thus, HM provides both opportunities and challenges with programming parallel codes. It is important to understand how to utilize HM and set expectations on the benefits of HM. Prior work principally uses simulators to study HM, which lacks the analysis on a real hardware. To address this issue, this paper experiments with a real system—the TI KeyStone II—to study HM. We make three contributions. First, we develop a set of parallel benchmarks to characterize the performance and power efficiency of HM. It is the first benchmark suite with OpenMP 4.0 features that is functional on real HM architectures. Second, we build a profiling tool to provide guidance for placing data in HM. Our tool analyzes memory access patterns and provides high-level feedback at the source-code level for optimization. Third, we apply the data placement optimization to our benchmarks and evaluate the effectiveness of HM in boosting performance and saving energy.

Hardware Support for Protective and Collaborative Cache Sharing
Raj Parihar, Jacob Brock, Chen DingORCID logo, and Michael C. Huang
(University of Rochester, USA)
Shared caches are generally optimized to maximize the overall throughput, fairness, or both, among multiple competing programs. In shared environments and compute clouds, users are often unrelated to each other. In such circumstances, an overall gain in throughput does not justify an individual loss. This paper explores cache management policies that allow conservative sharing to protect the cache occupancy for individual programs, yet enable full cache utilization whenever there is an opportunity to do so. We propose a hardware-based mechanism called cache rationing. Each program is assigned a portion of the shared cache as its ration. The hardware support protects the ration so it cannot be taken away by peer programs while in use. However, a program can exceed its pre-allocated ration, but only if another program has unused space in its allocated portion of ration. We show that rationing provides good resource protection and full cache utilization of the shared cache for a variety of co-runs.

Fast Non-intrusive Memory Reclamation for Highly-Concurrent Data Structures
Dave Dice, Maurice Herlihy, and Alex Kogan
(Oracle Labs, USA; Brown University, USA)
Current memory reclamation mechanisms for highly-concurrent data structures present an awkward trade-off. Techniques such as epoch-based reclamation perform well when all threads are running on dedicated processors, but the delay or failure of a single thread will prevent any other thread from reclaiming memory. Alternatives such as hazard pointers are highly robust, but they are expensive because they require a large number of memory barriers. This paper proposes three novel ways to alleviate the costs of the memory barriers associated with hazard pointers and related techniques. These new proposals are backward-compatible with existing code that uses hazard pointers. They move the cost of memory management from the principal code path to the infrequent memory reclamation procedure, significantly reducing or eliminating memory barriers executed on the principal code path. These proposals include (1) exploiting the operating system's memory protection ability, (2) exploiting certain x86 hardware features to trigger memory barriers only when needed, and (3) a novel hardware-assisted mechanism, called a hazard lookaside buffer (HLB) that allows a reclaiming thread to query whether there are hazardous pointers that need to be flushed to memory. We evaluate our proposals using a few fundamental data structures (linked lists and skiplists) and libcuckoo, a recent high-throughput hash-table library, and show significant improvements over the hazard pointer technique.

Non-traditional and Datacenter Scale Memory System

Understanding and Improving JVM GC Work Stealing at the Data Center Scale
Wessam Hassanein
(Google, USA)
Garbage collection (GC) is a critical part of performance in managed run-time systems such as the OpenJDK Java Virtual Machine (JVM). With a large number of latency sensitive applications written in Java the performance of the JVM is essential. Java application servers run in data centers on a large number of multi-core servers, thus load balancing in multi-threaded GC phases is critical. Dynamic load balancing in the JVM GC is achieved through work stealing, a well known and effective method to balance tasks across threads. This paper analyzes the JVM work stealing behaviour, and introduces a novel work stealing technique that improves performance, GC CPU utilization, scalability, and reduces the cost of jobs running on Google’s data-centers. We analyze both the DaCapo benchmark suite as well as Google’s data-center jobs. Our results show that the Gmail front-end server shows a 15-20% GC CPU reduction, and a 5% CPU performance improvement. Our analysis of a sample of ~59K jobs shows that GC CPU utilization improves by 38% geomean, 12% weighted geomean. GC pause time improves by 16% geomean, 20% weighted geomean. Full GC pause time improves by 34% geomean, 12% weighted geomean.

Persistence Programming Models for Non-volatile Memory
Hans-J. Boehm and Dhruva R. Chakrabarti
(HP Labs, USA)
It is expected that DRAM memory will be augmented, and perhaps eventually replaced, by one of several up-and-coming memory technologies. These are all non-volatile, in that they retain their contents without power. This allows primary memory to be used as a fast disk replacement. It also enables more aggressive programming models that directly leverage persistence of primary memory. However, it is challenging to maintain consistency of memory in such an environment. There is no consensus on the right programming model for doing so, and subtle differences can have large, and sometimes surprising, effects on the implementation and its performance.
The existing literature describes multiple programming systems that provide point solutions to the selective persistence for user data structures. Real progress in this area requires a choice of programming model, which we cannot reasonably make without a real understanding of the design space. Point solutions are insufficient. We systematically explore what we consider to be the most promising part of the space, precisely defining semantics and identifying implementation costs. This allows us to be much more explicit and precise about semantic and implementation trade-offs that were usually glossed over in prior work. It also exposes some promising new design alternatives.

CBufs: Efficient, System-Wide Memory Management and Sharing
Yuxin Ren, Gabriel Parmer, Teo Georgiev, and Gedare Bloom
(George Washington University, USA; Howard University, USA)
Modern systems are composed of many different protection domains separating privilege levels, subsystems, users, clients, and software of differing levels of assurance. System-wide memory management must consider not only allocation to single processes, but also efficient sharing of data across protection domains, and the allocation of memory based on the performance of applications that span multiple protection domains. This paper introduces the CBuf system for the global management of virtual and physical memory, including zero-copy sharing between protection domains. We present the design and implementation of both garbage collection techniques to enable efficient sharing, and policies that balance memory between protection domains specifically to satisfy system and application constraints such as quality of service. We show that a CBuf-enabled webserver achieves over a factor of 2.5 throughput speedup while using less processing time than Apache on Linux, and that the system can intentionally control system throughput through intelligent memory allocation.

A Bounded Memory Allocator for Software-Defined Global Address Spaces
François Gindraud, Fabrice Rastello, Albert Cohen, and François Broquedis
(Joseph Fourier University, France; Inria, France; Grenoble INP, France)
This paper presents a memory allocator targeting manycore architec- tures with distributed memory. Among the family of Multi Processor System on Chip (MPSoC), these devices are composed of multiple nodes linked by an on-chip network; most nodes have multiple processors sharing a small local memory. While MPSoC typically excel on their performance-per-Watt ratio, they remain hard to program due to multilevel parallelism, explicit resource and memory management, and hardware constraints (limited memory, network topology). Typical programming frameworks for MPSoC leave much target-specific work to the programmer: combining threads or node-local OpenMP, software caching, explicit message passing (and sometimes, routing), with non-standard interfaces. More abstract, automatic frameworks exist, but they target large-scale clusters and do not model the hardware constraints of MPSoC. The memory allocator described in this paper is one component of a larger runtime system, called Givy, to support dynamic task graphs with automatic software caching and data-driven execution on MPSoC. To simplify the programmer’s view of memory, both runtime and program data objects live in a Global Address Space (GAS). To avoid address collisions when objects are dynamically allocated, and to manage virtual memory mappings across nodes, a GAS-aware memory allocator is required. This paper proposes such an allocator with the following properties: (1) it is free of inter-node synchronizations; (2) its node-local performance match that of state-of-the-art shared-memory allocators; (3) it provides node-local mechanisms to implement inter-node software caching within a GAS; (4) it is well suited for small memory systems (a few MB per node).

Modeling, Characterization, and Tools

Rust as a Language for High Performance GC Implementation
Yi Lin, Stephen M. BlackburnORCID logo, Antony L. Hosking ORCID logo, and Michael Norrish ORCID logo
(Australian National University, Australia; Data61, Australia; Purdue University, USA)
High performance garbage collectors build upon performance-critical low-level code, typically exhibit multiple levels of concurrency, and are prone to subtle bugs. Implementing, debugging and maintaining such collectors can therefore be extremely challenging. The choice of implementation language is a crucial consideration when building a collector. Typically, the drive for performance and the need for efficient support of low-level memory operations leads to the use of low-level languages like C or C++, which offer little by way of safety and software engineering benefits. This risks undermining the robustness and flexibility of the collector design. Rust's ownership model, lifetime specification, and reference borrowing deliver safety guarantees through a powerful static checker with little runtime overhead. These features make Rust a compelling candidate for a collector implementation language, but they come with restrictions that threaten expressiveness and efficiency. We describe our experience implementing an Immix garbage collector in Rust and C. We discuss the benefits of Rust, the obstacles encountered, and how we overcame them. We show that our Immix implementation has almost identical performance on micro benchmarks, compared to its implementation in C, and outperforms the popular BDW collector on the gcbench micro benchmark. We find that Rust's safety features do not create significant barriers to implementing a high performance collector. Though memory managers are usually considered low-level, our high performance implementation relies on very little unsafe code, with the vast majority of the implementation benefiting from Rust's safety. We see our experience as a compelling proof-of-concept of Rust as an implementation language for high performance garbage collection.

Prescient Memory: Exposing Weak Memory Model Behavior by Looking into the Future
Man Cao, Jake Roemer, Aritra Sengupta, and Michael D. Bond ORCID logo
(Ohio State University, USA)
Shared-memory parallel programs are hard to get right. A major challenge is that language and hardware memory models allow unexpected, erroneous behaviors for executions containing data races. Researchers have introduced dynamic analyses that expose weak memory model behaviors, but these approaches cannot expose behaviors due to loading a "future value" -- a value written by a program store that executes after the program load that uses the value. This paper presents prescient memory (PM), a novel dynamic analysis that exposes behaviors due to future values. PM speculatively returns a future value at a program load, and tries to validate the speculative value at a later store. To enable PM to expose behaviors due to future values in real application executions, we introduce a novel approach that increases the chances of using and successfully validating future values, by profiling and predicting future values and guiding execution. Experiments show that our approach is able to uncover a few previously unknown behaviors due to future values in benchmarked versions of real applications. Overall, PM overcomes a key limitation of existing approaches, broadening the scope of program behaviors that dynamic analyses can expose.

Rethinking a Heap Hierarchy as a Cache Hierarchy: A Higher-Order Theory of Memory Demand (HOTM)
Pengcheng Li, Hao Luo, and Chen DingORCID logo
(University of Rochester, USA)
Modern memory allocators divide the available memory between different threads and object size classes. They use many parameters that are related and mutually affecting. Existing solutions are based on heuristics which cannot serve all applications equally well. This paper presents a theory of memory demand. The theory enables the global optimization of heap parameters for an application. The paper evaluates the theory and the optimization using multi-threaded micro-benchmarks as well as real applications including Apache, Ghostscript interpreter, and a database benchmarking tool and shows that the global optimization theoretically outperforms three typical heuristics by 15% to 113%.

Liveness-Based Garbage Collection for Lazy Languages
Prasanna Kumar K., Amitabha Sanyal, and Amey Karkare
(IIT Bombay, India; IIT Kanpur, India)
We consider the problem of reducing the memory required to run lazy first-order functional programs. Our approach is to analyze programs for liveness of heap-allocated data. The result of the analysis is used to preserve only live data—a subset of reachable data—during garbage collection. The result is an increase in the garbage reclaimed and a reduction in the peak memory requirement of programs. Whereas this technique has already been shown to yield benefits for eager first-order languages, the lack of a statically determinable execution order and the presence of closures pose new challenges for lazy languages. These require changes both in the liveness analysis itself and in the design of the garbage collector. To show the effectiveness of our method, we implemented a copying collector that uses the results of the liveness analysis to preserve live objects, both evaluated and closures. Our experiments confirm that for programs running with a liveness-based garbage collector, there is a significant decrease in peak memory requirements. In addition, a sizable reduction in the number of collections ensures that in spite of using a more complex garbage collector, the execution times of programs running with liveness and reachability-based collectors remain comparable.

proc time: 1.18