ISMM 2017 – Proceedings

Bridging the Gap between Memory Performance and Massive Parallelism: The Critical Role of Programming Systems Innovations (Keynote)
Xipeng Shen

(North Carolina State University, USA)
This talk examines some trends in the modern developments of memory systems and their relations with the massive parallelism in processors and applications. It then draws on some recent work on GPU to explain the important role of programming systems in bridging the gap; it particularly emphasizes the importance of innovations for enabling better software controllability, more software elasticity, and inter-thread data locality enhancements. The talk further discusses the implications brought to programming systems by the increasingly blurred boundaries among memory, storage, and processing.

Garbage Collection

NG2C: Pretenuring Garbage Collection with Dynamic Generations for HotSpot Big Data Applications
Rodrigo Bruno, Luís Picciochi Oliveira, and Paulo Ferreira
(INESC-ID, Portugal; University of Lisbon, Portugal; Feedzai, Portugal)
Big Data applications suffer from unpredictable and unacceptably high pause times due to Garbage Collection (GC). This is the case in latency-sensitive applications such as on-line credit-card fraud detection, graph-based computing for analysis on social networks, etc. Such pauses compromise latency requirements of the whole application stack and result from applications' aggressive buffering/caching of data, exposing an ill-suited GC design, which assumes that most objects will die young and does not consider that applications hold large amounts of middle-lived data in memory.
To avoid such pauses, we propose NG2C, a new GC algorithm that combines pretenuring with user-defined dynamic generations. By being able to allocate objects into different generations, NG2C is able to group objects with similar lifetime profiles in the same generation. By allocating objects with similar lifetime profiles close to each other, i.e. in the same generation, we avoid object promotion (copying between generations) and heap fragmentation (which leads to heap compactions) both responsible for most of the duration of HotSpot GC pause times.
NG2C is implemented for the OpenJDK 8 HotSpot Java Virtual Machine, as an extension of the Garbage First GC. We evaluate NG2C using Cassandra, Lucene, and GraphChi with three different GCs: Garbage First (G1), Concurrent Mark Sweep (CMS), and NG2C. Results show that NG2C decreases the worst observable GC pause time by up to 94.8% for Cassandra, 85.0% for Lucene and 96.45% for GraphChi, when compared to current collectors (G1 and CMS). In addition, NG2c has no negative impact on application throughput or memory usage.

Type-Assisted Automatic Garbage Collection for Lock-Free Data Structures
Albert Mingkun Yang and Tobias Wrigstad

(Uppsala University, Sweden)
We introduce Isolde, an automatic garbage collection scheme designed specifically for managing memory in lock-free data structures, such as stacks, lists, maps and queues. Isolde exists as a plug-in memory manager, designed to sit on-top of another memory manager, and use it’s allocator and reclaimer (if exists). Isolde treats a lock-free data structure as a logical heap, isolated from the rest of the program. This allows garbage collection outside of Isolde to take place without affecting the lock-free data structure. Isolde further manages objects allocated on a Isolde heap in a fully concurrent manner, allowing garbage collection to incrementally remove garbage without stopping other threads doing work.

Clever Data Tricks

A Marshalled Data Format for Pointers in Relocatable Data Blocks
Nick Vrvilo, Lechen Yu, and Vivek Sarkar
(Rice University, USA)
As future computing hardware progresses towards extreme-scale technology, new challenges arise for addressing heterogeneous compute and memory resources, for providing application resilience in the presence of more frequent failures, and for working within strict energy constraints. While C++ has gained popularity in recent years within the HPC community, some concepts of object-oriented program design may be at odds with the techniques we use to address the challenges of extreme-scale computing. In this work, we focus on the challenges related to using aggregate data structures that include pointer values within a programming model where the runtime may frequently relocate data, and traditional serialization techniques are not practical. We propose and evaluate a marshalled encoding for relocatable data blocks, and present a C++ library and other tools to simplify the work of the application programmer developing new applications or porting existing applications to such emerging programming models.

Flexible and Efficient Memory Object Metadata
Zhengyang Liu and John Criswell

(Beijing University of Posts and Telecommunications, China; University of Rochester, USA)
Compiler-based tools can protect software from attack and find bugs within programs. To support programs written in type-unsafe languages such as C, such tools need to add code into a program that must, at run-time, take a pointer into a memory object and locate metadata for that memory object. Current methods of locating metadata are either flexible (supporting metadata of varying sizes) at the expense of speed and scalability or are fast (e.g., by using shadow tables) at the cost of flexibility (metadata is small and must always be the same size).
This paper presents a new method of attaching metadata to memory objects, named Padding Area MetaData (PAMD), that is both flexible and efficient. Metadata can be any size, and different memory objects can have different sized metadata. While flexible, the algorithm for finding the metadata given a pointer into the memory object takes constant time. Our method extends Baggy Bounds with Accurate Checking (BBAC) which attaches constant-sized metadata to memory objects for performing precise dynamic bounds checks. Our design supports variable-sized metadata, and our implementation supports larger programs.
We evaluated the performance and scalability of PAMD using dynamic bounds checking as an exemplar of our method. Our results show that our method adds at most 33% overhead to an identical dynamic bounds checking tool that trades precision for performance by using a simple shadow table. Our results also show that our method, while having the same flexibility as splay trees, performs significantly faster and scales better as a program allocates more memory.

Shadow State Encoding for Efficient Monitoring of Block-Level Properties
Kostyantyn Vorobyov, Julien Signoles

, and Nikolai Kosmatov
(CEA LIST, France)
Memory shadowing associates addresses from an application's memory to values stored in a disjoint memory space called shadow memory. At runtime shadow values store metadata about application memory locations they are mapped to. Shadow state encodings -- the structure of shadow values and their interpretation -- vary across different tools. Encodings used by the state-of-the-art monitoring tools have been proven useful for tracking memory at a byte-level, but cannot address properties related to memory block boundaries. Tracking block boundaries is however crucial for spatial memory safety analysis, where a spatial violation such as out-of-bounds access, may dereference an allocated location belonging to an adjacent block or a different struct member.
This paper describes two novel shadow state encodings which capture block-boundary-related properties. These encodings have been implemented in E-ACSL - a runtime verification tool for C programs. Initial experiments involving checking validity of pointer and array accesses in computationally intensive runs of programs selected from SPEC CPU benchmarks demonstrate runtime and memory overheads comparable to state-of-the-art memory debuggers.

Hybrid Memory Systems

Analyzing Memory Management Methods on Integrated CPU-GPU Systems
Mohammad Dashti and Alexandra Fedorova
(University of British Columbia, Canada)
Heterogeneous systems that integrate a multicore CPU and a GPU on the same die are ubiquitous. On these systems, both the CPU and GPU share the same physical memory as opposed to using separate memory dies. Although integration eliminates the need to copy data between the CPU and the GPU, arranging transparent memory sharing between the two devices can carry large overheads. Memory on CPU/GPU systems is typically managed by a software framework such as OpenCL or CUDA, which includes a runtime library, and communicates with a GPU driver. These frameworks offer a range of memory management methods that vary in ease of use, consistency guarantees and performance. In this study, we analyze some of the common memory management methods of the most widely used software frameworks for heterogeneous systems: CUDA, OpenCL 1.2, OpenCL 2.0, and HSA, on NVIDIA and AMD hardware. We focus on performance/functionality trade-offs, with the goal of exposing their performance impact and simplifying the choice of memory management methods for programmers.

Continuous Checkpointing of HTM Transactions in NVM
Ellis Giles, Kshitij Doshi

, and Peter Varman
(Rice University, USA; Intel, USA)
This paper addresses the challenges of coupling byte addressable non-volatile memory (NVM) and hardware transaction memory (HTM) in high-performance transaction processing. We first show that HTM transactions can be ordered using existing processor instructions without any hardware changes. In contrast, existing solutions posit changes to HTM mechanisms in the form of special instructions or modified functionality. We exploit the ordering mechanism to design a novel persistence method that decouples HTM concurrency from back-end NVM operations. Failure atomicity is achieved using redo logging coupled with aliasing to guard against mistimed cache evictions. Our algorithm uses efficient lock-free mechanisms with bounded static memory requirements. We evaluated our approach using both micro-benchmarks, and, benchmarks in the STAMP suite, and showed that it compares well with standard (volatile) HTM transactions. We also showed that it yields significant gains in throughput and latency in comparison with persistent transactional locking.

RTHMS: A Tool for Data Placement on Hybrid Memory System
Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Pietro Cicotti, Erwin Laure, and Stefano Markidis
(KTH, Sweden; Pacific Northwest National Laboratory, USA; University of California at San Diego, USA)
Traditional scientific and emerging data analytics applications require fast, power-efficient, large, and persistent memories. Combining all these characteristics within a single memory technology is expensive and hence future supercomputers will feature different memory technologies side-by-side. However, it is a complex task to program hybrid-memory systems and to identify the best object-to-memory mapping. We envision that programmers will probably resort to use default configurations that only require minimal interventions on the application code or system settings. In this work, we argue that intelligent, fine-grained data placement can achieve higher performance than default setups.
We present an algorithm for data placement on hybrid-memory systems. Our algorithm is based on a set of single-object allocation rules and global data placement decisions. We also present RTHMS, a tool that implements our algorithm and provides recommendations about the object-to-memory mapping. Our experiments on a hybrid memory system, an Intel Knights Landing processor with DRAM and HBM, show that RTHMS is able to achieve higher performance than the default configuration. We believe that RTHMS will be a valuable tool for programmers working on complex hybrid-memory systems.

A Deeper Look

"What's in a Name?" Going Beyond Allocation Site Names in Heap Analysis
Vini Kanvar and Uday P. Khedker
(IIT Bombay, India)
A points-to analysis computes a sound abstraction of heap memory conventionally using a name-based abstraction that summarizes runtime memory by grouping locations using the names of allocation sites: All concrete heap locations allocated by the same statement are grouped together. The locations in the same group are treated alike i.e., a pointer to any one location of the group is assumed to point to every location in the group leading to an over-approximation of points-to relations. We propose an access-based abstraction that partitions each name-based group of locations into equivalence classes at every program point using an additional criterion of the sets of access paths (chains of pointer indirections) reaching the locations in the memory. The intuition is that the locations that are both allocated and accessed alike should be grouped into the same equivalence class. Since the access paths in the memory could reach different locations at different program points, our groupings change flow sensitively unlike the name-based groupings. This creates a more precise view of the memory. Theoretically, it is strictly more precise than the name-based abstraction except in some trivial cases; practically it is far more precise. Our empirical measurements show the benefits of our tool Access-Based Heap Analyzer (ABHA) on SPEC CPU 2006 and heap manipulating SV-COMP benchmarks. ABHA, which is field-, flow-, and context-sensitive, scales to 20 kLoC and can improve the precision even up to 99% (in terms of the number of aliases). Additionally, ABHA allows any user-defined summarization of an access path to be plugged in; we have implemented and evaluated four summarization techniques. ABHA can also act as a front-end to TVLA, a parametrized shape analyzer, in order to automate its parametrization by generating predicates that capture the program behaviour more accurately.

Info

A Refinement Hierarchy for Free List Memory Allocators
Bin Fang and Mihaela Sighireanu
(East China Normal University, China; University Paris Diderot, France; CNRS, France)
Existing implementations of dynamic memory allocators (DMA) employ a large spectrum of policies and techniques. The formal specifications of these techniques are quite complicated in isolation and very complex when combined. Therefore, the formal reasoning on a specific DMA implementation is difficult for automatic tools and mostly single-use. This paper proposes a solution to this problem by providing formal models for a full class of DMA, the free list class. To obtain manageable formal reasoning and reusable formal models, we organize these models in a hierarchy ranked by refinement relations. We prove the soundness of models and refinement relations using an off-the-shelf theorem prover. We demonstrate that our hierarchy is a basis for an algorithm theory for the class of free list DMA: it abstracts various existing implementations of DMA and leads to new DMA implementations. We illustrate its application to model-based code generation, testing, run-time verification, and static analysis.

Avoiding Consistency Exceptions Under Strong Memory Models
Minjia Zhang

, Swarnendu Biswas, and Michael D. Bond

(Microsoft Research, USA; University of Texas at Austin, USA; Ohio State University, USA)
Shared-memory languages and systems generally provide weak or undefined semantics for executions with data races. Prior work has proposed memory consistency models that ensure well-defined, easy-to-understand semantics based on region serializability (RS), but the resulting system may throw a consistency exception in the presence of a data race. Consistency exceptions can occur unexpectedly even in well-tested programs, hurting availability and thus limiting the practicality of RS-based memory models.
To our knowledge, this paper is the first to consider the problem of availability for memory consistency models that throw consistency exceptions. We first extend existing approaches that enforce RSx, a memory model based on serializability of synchronization-free regions (SFRs), to avoid region conflicts and thus consistency exceptions. These new approaches demonstrate both the potential for and limitations of avoiding consistency exceptions under RSx. To improve availability further, we introduce (1) a new memory model called RIx based on isolation of SFRs and (2) a new approach called Avalon that provides RIx. We demonstrate two variants of Avalon that offer different performance–availability tradeoffs for RIx.
An evaluation on real Java programs shows that this work’s novel approaches are able to reduce consistency exceptions, thereby improving the applicability of strong memory consistency models. Furthermore, the approaches provide compelling points in the performance–availability tradeoff space for memory consistency enforcement. RIx and Avalon thus represent a promising direction for tackling the challenge of availability under strong consistency models that throw consistency exceptions.

ISMM 2017 – Proceedings

Frontmatter

Keynote

Garbage Collection

Clever Data Tricks

Hybrid Memory Systems

A Deeper Look