Powered by
ACM SIGPLAN International Symposium on Memory Management (ISMM 2015),
June 14, 2015,
Portland, OR, USA
ACM SIGPLAN International Symposium on Memory Management (ISMM 2015)
Frontmatter
Foreword
It is our great pleasure to welcome you to Portland, and ISMM 2015, the premier specialized forum for research in memory management. ISMM 2015 solicited both long and short papers on topics including but not limited to: • Memory allocation and de-allocation • Garbage Collection algorithms and implementations • Compiler analyses and tools to aid memory management • Empirical analysis of heap intensive programs • Formal analysis and verification of heap intensive programs • Memory system design and analysis • Verification of memory management algorithms • Memory management for new or non-traditional systems • Development and evaluation of open-source implementations related to memory management Prospective authors submitted 25 papers. Each paper received 4 or more reviews, at least 2 of which were from program committee (PC) members. All authors were invited to submit a response to the initial reviews, which reviewers considered during online and in-person discussions. PC members were allowed to submit papers, which were evaluated entirely by external review committee (ERC) members. Only one of the 25 submissions was co- authored by a PC member. The PC met in person to decide on the remaining 24 papers. Submissions were double blind, and author identity was not revealed during the meeting, except in cases where the PC chair felt that anonymity might be harming the paper's chances. Overall, the PC and ERC accepted 12 papers, 2 of which were accepted conditionally with an assigned shepherd. We thank those who volunteered their time to ISMM’15, and the organizers of the ACM Federated Computing Research Conference, with which ISMM’15 is affiliated. We thank especially the members of the PC and ERC, for their time and effort to review and consider the submitted papers. We also thank Karin Strauss, our keynote speaker on Tolerating Holes in Wearable Memories. Enjoy! Antony Hosking, ISMM’15 General Chair, Purdue University, USA Mike Bond, ISMM’13 Program Chair, Ohio State University, USA
Mobile Systems
Controlling Physical Memory Fragmentation in Mobile Systems
Sang-Hoon Kim, Sejun Kwon, Jin-Soo Kim, and Jinkyu Jeong
(KAIST, South Korea; Sungkyunkwan University, South Korea)
Since the adoption of hardware-accelerated features (e.g., hardware codec) improves the performance and quality of mobile devices, it revives the need for contiguous memory allocation. However, physical memory in mobile systems is highly fragmented due to the frequent spawn and exit of processes and the lack of proactive anti-fragmentation scheme. As a result, the memory allocation for large and contiguous I/O buffers suffer from the highly fragmented memory, thereby incurring high CPU usage and power consumption.
This paper presents a proactive anti-fragmentation approach that groups pages with the same lifetime, and stores them contiguously in fixed-size contiguous regions. When a process is killed to secure free memory, a set of contiguous regions are freed and subsequent contiguous memory allocations can be easily satisfied without incurring additional overhead. Our prototype implementation on a Nexus 10 tablet with the Android kernel shows that the proposed scheme greatly alleviates fragmentation, thereby reducing the I/O buffer allocation time, associated CPU usage, and energy consumption.
@InProceedings{ISMM15p1,
author = {Sang-Hoon Kim and Sejun Kwon and Jin-Soo Kim and Jinkyu Jeong},
title = {Controlling Physical Memory Fragmentation in Mobile Systems},
booktitle = {Proc.\ ISMM},
publisher = {ACM},
pages = {1--14},
doi = {},
year = {2015},
}
Don't Race the Memory Bus: Taming the GC Leadfoot
Ahmed Hussein, Antony L. Hosking, Mathias Payer, and Christopher A. Vick
(Purdue University, USA; Qualcomm, USA)
Dynamic voltage and frequency scaling (DVFS) is ubiquitous on mobile devices as a mechanism for saving energy. Reducing the clock frequency of a processor allows a corresponding reduction in power consumption, as does turning off idle cores. Garbage collection is a canonical example of the sort of memory-bound workload that best responds to such scaling. Here, we explore the impact of frequency scaling for garbage collection in a real mobile device running Android's Dalvik virtual machine, which uses a concurrent collector. By controlling the frequency of the core on which the concurrent collector thread runs we can reduce power significantly. Running established multi-threaded benchmarks shows that total processor energy can be reduced up to 30%, with end-to-end performance loss of at most 10%.
@InProceedings{ISMM15p15,
author = {Ahmed Hussein and Antony L. Hosking and Mathias Payer and Christopher A. Vick},
title = {Don't Race the Memory Bus: Taming the GC Leadfoot},
booktitle = {Proc.\ ISMM},
publisher = {ACM},
pages = {15--27},
doi = {},
year = {2015},
}
New Memory Management Algorithms
Data Structure Aware Garbage Collector
Nachshon Cohen and
Erez Petrank
(Technion, Israel)
Garbage collection may benefit greatly from knowledge about program behavior, but most managed languages do not provide means for the programmer to deliver such knowledge. In this work we propose a very simple interface that requires minor programmer effort and achieves substantial performance and scalability improvements. In particular, we focus on the common use of data structures or collections for organizing data on the heap. We let the program notify the collector which classes represent nodes of data structures and also when such nodes are being removed from their data structures. The data-structure aware (DSA) garbage collector uses this information to improve performance, locality, and load balancing. Experience shows that this interface requires a minor modification of the application. Measurements show that for some significant benchmarks this interface can dramatically reduce the time spent on garbage collection and also improve the overall program performance.
@InProceedings{ISMM15p28,
author = {Nachshon Cohen and Erez Petrank},
title = {Data Structure Aware Garbage Collector},
booktitle = {Proc.\ ISMM},
publisher = {ACM},
pages = {28--40},
doi = {},
year = {2015},
}
SuperMalloc: A Super Fast Multithreaded Malloc for 64-bit Machines
Bradley C. Kuszmaul
(Massachusetts Institute of Technology, USA)
SuperMalloc is an implementation of malloc(3) originally designed for X86 Hardware Transactional Memory (HTM)@. It turns out that the same design decisions also make it fast even without HTM@. For the malloc-test benchmark, which is one of the most difficult workloads for an allocator, with one thread SuperMalloc is about 2.1 times faster than the best of DLmalloc, JEmalloc, Hoard, and TBBmalloc; with 8 threads and HTM, SuperMalloc is 2.75 times faster; and on 32 threads without HTM SuperMalloc is 3.4 times faster. SuperMalloc generally compares favorably with the other allocators on speed, scalability, speed variance, memory footprint, and code size. SuperMalloc achieves these performance advantages using less than half as much code as the alternatives. SuperMalloc exploits the fact that although physical memory is always precious, virtual address space on a 64-bit machine is relatively cheap. It allocates 2 chunks which contain objects all the same size. To translate chunk numbers to chunk metadata, SuperMalloc uses a simple array (most of which is uncommitted to physical memory). SuperMalloc takes care to avoid associativity conflicts in the cache: most of the size classes are a prime number of cache lines, and nonaligned huge accesses are randomly aligned within a page. Objects are allocated from the fullest non-full page in the appropriate size class. For each size class, SuperMalloc employs a 10-object per-thread cache, a per-CPU cache that holds about a level-2-cache worth of objects per size class, and a global cache that is organized to allow the movement of many objects between a per-CPU cache and the global cache using O(1) instructions. SuperMalloc prefetches everything it can before starting a critical section, which makes the critical sections run fast, and for HTM improves the odds that the transaction will commit.
@InProceedings{ISMM15p41,
author = {Bradley C. Kuszmaul},
title = {SuperMalloc: A Super Fast Multithreaded Malloc for 64-bit Machines},
booktitle = {Proc.\ ISMM},
publisher = {ACM},
pages = {41--55},
doi = {},
year = {2015},
}
Concurrent Compaction using a Field Pinning Protocol
Erik Österlund and Welf Löwe
(Linnaeus University, Sweden)
Compaction of memory in long running systems has always been important. The latency of compaction increases in today’s systems with high memory demands and large heaps. To deal with this problem, we present a lock-free protocol allowing for copying concurrent with the application running, which reduces the latencies of compaction radically. It provides theoretical progress guarantees for copying and application threads without making it practically infeasible, with performance overheads of 15% on average. The algorithm paves the way for a future lock-free Garbage Collector.
@InProceedings{ISMM15p56,
author = {Erik Österlund and Welf Löwe},
title = {Concurrent Compaction using a Field Pinning Protocol},
booktitle = {Proc.\ ISMM},
publisher = {ACM},
pages = {56--69},
doi = {},
year = {2015},
}
Managed Languages
Stop and Go: Understanding Yieldpoint Behavior
Yi Lin, Kunshan Wang,
Stephen M. Blackburn, Antony L. Hosking, and Michael Norrish
(Australian National University, Australia; Purdue University, USA; NICTA, Australia)
Yieldpoints are critical to the implementation of high performance garbage collected languages, yet the design space is not well understood. Yieldpoints allow a running program to be interrupted at well-defined points in its execution, facilitating exact garbage collection, biased locking, on-stack replacement, profiling, and other important virtual machine behaviors. In this paper we identify and evaluate yieldpoint design choices, including previously undocumented designs and optimizations. One of the designs we identify opens new opportunities for very low overhead profiling. We measure the frequency with which yieldpoints are executed and establish a methodology for evaluating the common case execution time overhead. We also measure the median and worst case time-to-yield. We find that Java benchmarks execute about 100M yieldpoints per second, of which about 1/20000 are taken. The average execution time overhead for untaken yieldpoints on the VM we use ranges from 2.5% to close to zero on modern hardware, depending on the design, and we find that the designs trade off total overhead with worst case time-to-yield. This analysis gives new insight into a critical but overlooked aspect of garbage collector implementation, and identifies a new optimization and new opportunities for very low overhead profiling.
@InProceedings{ISMM15p70,
author = {Yi Lin and Kunshan Wang and Stephen M. Blackburn and Antony L. Hosking and Michael Norrish},
title = {Stop and Go: Understanding Yieldpoint Behavior},
booktitle = {Proc.\ ISMM},
publisher = {ACM},
pages = {70--80},
doi = {},
year = {2015},
}
Safe and Efficient Hybrid Memory Management for Java
Codruţ Stancu,
Christian Wimmer, Stefan Brunthaler, Per Larsen, and
Michael Franz
(University of California at Irvine, USA; Oracle Labs, USA)
Java uses automatic memory management, usually implemented as a garbage-collected heap. That lifts the burden of manually allocating and deallocating memory, but it can incur significant runtime overhead and increase the memory footprint of applications. We propose a hybrid memory management scheme that utilizes region-based memory management to deallocate objects automatically on region exits. Static program analysis detects allocation sites that are safe for region allocation, i.e., the static analysis proves that the objects allocated at such a site are not reachable after the region exit. A regular garbage-collected heap is used for objects that are not region allocatable.
The region allocation exploits the temporal locality of object allocation. Our analysis uses coarse-grain source code annotations to disambiguate objects with non-overlapping lifetimes, and maps them to different memory scopes. Region-allocated memory does not require garbage collection as the regions are simply deallocated when they go out of scope. The region allocation technique is backed by a garbage collector that manages memory that is not region allocated.
We provide a detailed description of the analysis, provide experimental results showing that as much as 78% of the memory is region allocatable and discuss how our hybrid memory management system can be implemented efficiently with respect to both space and time.
@InProceedings{ISMM15p81,
author = {Codruţ Stancu and Christian Wimmer and Stefan Brunthaler and Per Larsen and Michael Franz},
title = {Safe and Efficient Hybrid Memory Management for Java},
booktitle = {Proc.\ ISMM},
publisher = {ACM},
pages = {81--92},
doi = {},
year = {2015},
}
A Partial Read Barrier for Efficient Support of Live Object-Oriented Programming
Eliot Miranda and Clément Béra
(Cadence Design Systems, USA; INRIA, France)
Live programming, originally introduced by Smalltalk and Lisp, and now gaining popularity in contemporary systems such as Swift, requires on-the-fly support for object schema migration, such that the layout of objects may be changed while the program is at one and the same time being run and developed. In Smalltalk schema migration is supported by two primitives, one that answers a collection of all instances of a class, and one that exchanges the identities of pairs of objects, called the become primitive. Existing instances are collected, copies using the new schema created, state copied from old to new, and the two exchanged with become, effecting the schema migration. Historically the implementation of become has either required an extra level of indirection between an object's address and its body, slowing down slot access, or has required a sweep of all objects, a very slow operation on large heaps. Spur, a new object representation and memory manager for Smalltalk-like languages, has neither of these deficiencies. It uses direct pointers but still provides a fast become operation in large heaps, thanks to forwarding objects that when read conceptually answer another object and a partial read barrier that avoids the cost of explicitly checking for forwarding objects on the vast majority of object accesses.
@InProceedings{ISMM15p93,
author = {Eliot Miranda and Clément Béra},
title = {A Partial Read Barrier for Efficient Support of Live Object-Oriented Programming},
booktitle = {Proc.\ ISMM},
publisher = {ACM},
pages = {93--104},
doi = {},
year = {2015},
}
Video
Memento Mori: Dynamic Allocation-Site-Based Optimizations
Daniel Clifford, Hannes Payer, Michael Stanton, and Ben L. Titzer
(Google, Germany)
Languages that lack static typing are ubiquitous in the world of mobile and web applications. The rapid rise of larger applications like interactive web GUIs, games, and cryptography presents a new range of implementation challenges for modern virtual machines to close the performance gap between typed and untyped languages. While all languages can benefit from efficient automatic memory management, languages like JavaScript present extra thrill with innocent-looking but difficult features like dynamically-sized arrays, deletable properties, and prototypes. Optimizing such languages requires complex dynamic techniques with more radical object layout strategies such as dynamically evolving representations for arrays. This paper presents a general approach for gathering temporal allocation site feedback that tackles both the general problem of object lifetime estimation and improves optimization of these problematic language features. We introduce a new implementation technique where allocation mementos processed by the garbage collector and runtime system efficiently tie objects back to allocation sites in the program and dynamically estimate object lifetime, representation, and size to inform three optimizations: pretenuring, pretransitioning, and presizing. Unlike previous work on pretenuring, our system utilizes allocation mementos to achieve fully dynamic allocation-site-based pretenuring in a production system. We implement all of our techniques in V8, a high performance virtual machine for JavaScript, and demonstrate solid performance improvements across a range of benchmarks.
@InProceedings{ISMM15p105,
author = {Daniel Clifford and Hannes Payer and Michael Stanton and Ben L. Titzer},
title = {Memento Mori: Dynamic Allocation-Site-Based Optimizations},
booktitle = {Proc.\ ISMM},
publisher = {ACM},
pages = {105--117},
doi = {},
year = {2015},
}
Optimizing Garbage Collection
Recycling Trash in Cache
Jonathan Shidal, Ari J. Spilo, Paul T. Scheid, Ron K. Cytron, and Krishna M. Kavi
(Washington University at St. Louis, USA; University of North Texas, USA)
The disparity between processing and storage speeds can be bridged in part by reducing the traffic into and out of the slower memory components. Some recent studies reduce such traffic by determining dead data in cache, showing that a significant fraction of writes can be squashed before they make the trip toward slower memory. In this paper, we examine a technique for eliminating traffic in the other direction, specifically the traffic induced by dynamic storage allocation. We consider recycling dead storage in cache to satisfy a program's storage-allocation requests. We first evaluate the potential for recycling under favorable circumstances, where the associated logic can run at full speed with no impact on the cache's normal behavior. We then consider a more practical implementation, in which the associated logic executes independently from the cache's critical path. Here, the cache's performance is unfettered by recycling, but the operations necessary to determine dead storage and recycle such storage execute as time is available. Finally, we present the design and analysis of a hardware implementation that scales well with cache size without sacrificing too much performance.
@InProceedings{ISMM15p118,
author = {Jonathan Shidal and Ari J. Spilo and Paul T. Scheid and Ron K. Cytron and Krishna M. Kavi},
title = {Recycling Trash in Cache},
booktitle = {Proc.\ ISMM},
publisher = {ACM},
pages = {118--130},
doi = {},
year = {2015},
}
Reducing Pause Times with Clustered Collection
Cody Cutler and Robert Morris
(Massachusetts Institute of Technology, USA)
Each full garbage collection in a program with millions of objects can pause the program for multiple seconds. Much of this work is typically repeated, as the collector re-traces parts of the object graph that have not changed since the last collection. Clustered Collection reduces full collection pause times by eliminating much of this repeated work. Clustered Collection identifies clusters: regions of the object graph that are reachable from a single "head" object, so that reachability of the head implies reachability of the whole cluster. As long as it is not written, a cluster need not be re-traced by successive full collections. The main design challenge is coping with program writes to clusters while ensuring safe, complete, and fast collections. In some cases program writes require clusters to be dissolved, but in most cases Clustered Collection can handle writes without having to re-trace the affected cluster. Clustered Collection chooses clusters likely to suffer few writes and to yield high savings from re-trace avoidance. Clustered Collection is implemented as modifications to the Racket collector. Measurements of the code and data from the Hacker News web site (which suffers from significant garbage collection pauses) and a Twitter-like application show that Clustered Collection decreases full collection pause times by a factor of three and six respectively. This improvement is possible because both applications have gigabytes of live data, modify only a small fraction of it, and usually write in ways that do not result in cluster dissolution. Identifying clusters takes more time than a full collection, but happens much less frequently than full collection.
@InProceedings{ISMM15p131,
author = {Cody Cutler and Robert Morris},
title = {Reducing Pause Times with Clustered Collection},
booktitle = {Proc.\ ISMM},
publisher = {ACM},
pages = {131--142},
doi = {},
year = {2015},
}
The Judgment of Forseti: Economic Utility for Dynamic Heap Sizing of Multiple Runtimes
Callum Cameron,
Jeremy Singer, and David Vengerov
(University of Glasgow, UK; Oracle, USA)
We introduce the FORSETI system, which is a principled approach for holistic memory management. It permits a sysadmin to specify the total physical memory resource that may be shared between all concurrent virtual machines on a physical node. FORSETI models the heap size versus application throughput for each virtual machine, and seeks to maximize the combined throughput of the set of VMs based on concepts from economic utility theory. We evaluate the FORSETI system using a standard Java managed runtime, i.e. OpenJDK. Our results demonstrate that FORSETI enables dramatic reductions (up to 5x) in heap footprint without compromising application execution times.
@InProceedings{ISMM15p143,
author = {Callum Cameron and Jeremy Singer and David Vengerov},
title = {The Judgment of Forseti: Economic Utility for Dynamic Heap Sizing of Multiple Runtimes},
booktitle = {Proc.\ ISMM},
publisher = {ACM},
pages = {143--156},
doi = {},
year = {2015},
}
proc time: 0.79