Powered by
2018 ACM SIGPLAN International Symposium on Memory Management (ISMM 2018),
June 18, 2018,
Philadelphia, PA, USA
Frontmatter
Message from the Chairs
It is our great pleasure to welcome you to the ACM SIGPLAN International Symposium on Memory Management (ISMM) 2018. ISMM will be held in Philadelphia, Pennsylvania, U.S.A., on June 18th, 2018, and is co-located with the Programming Languages, Design, and Implementation (PLDI) conference. ISMM 2018 consists of one keynote speech by Richard L. Hudson entitled “Getting to Go”, and 9 research presentations divided into three sessions.
Committees
This document details the people who were responsible for ISMM 2018's organization. We list the Program and General Chairs, and members of the Steering Committee, the Program Committee, the External Review Committee, and the external reviewers.
Reference Counting and Techniques for C-Family Languages
Detailed Heap Profiling
Stuart Byma and James R. Larus
(EPFL, Switzerland)
Modern software systems heavily use the memory heap. As systems grow more complex and compute with increasing amounts of data, it can be difficult for developers to understand how their programs actually use the bytes that they allocate on the heap and whether improvements are possible. To answer this question of heap usage efficiency, we have built a new, detailed heap profiler called Memoro. Memoro uses a combination of static instrumentation, subroutine interception, and runtime data collection to build a clear picture of exactly when and where a program performs heap allocation, and crucially how it actually uses that memory. Memoro also introduces a new visualization application that can distill collected data into scores and visual cues that allow developers to quickly pinpoint and eliminate inefficient heap usage in their software. Our evaluation and experience with several applications demonstrates that Memoro can reduce heap usage and produce runtime improvements of 10%.
@InProceedings{ISMM18p1,
author = {Stuart Byma and James R. Larus},
title = {Detailed Heap Profiling},
booktitle = {Proc.\ ISMM},
publisher = {ACM},
pages = {1--13},
doi = {10.1145/3210563.3210564},
year = {2018},
}
Publisher's Version
FRC: A High-Performance Concurrent Parallel Deferred Reference Counter for C++
Charles Tripp, David Hyde, and Benjamin Grossman-Ponemon
(Terrain Data, USA; Stanford University, USA)
We present FRC, a high-performance concurrent parallel reference counter for unmanaged languages. It is well known that high-performance garbage collectors help developers write memory-safe, highly concurrent systems and data structures. While C++, C, and other unmanaged languages are used in high-performance applications, adding concurrent memory management to these languages has proven to be difficult. Unmanaged languages like C++ use pointers instead of references, and have uncooperative mutators which do not pause easily at a safe point. Thus, scanning mutator stack root references is challenging.
FRC only defers decrements and does not require mutator threads to pause during collection. By deferring only decrements, FRC avoids much of the synchronization overhead of a fully-deferred implementation. Root references are scanned without interrupting the mutator by publishing these references to a thread-local array. FRC's performance can exceed that of the C++ standard library's shared pointer by orders of magnitude. FRC's thread-safety guarantees and low synchronization overhead enable significant throughput gains for concurrently-readable shared data structures.
We describe the components of FRC, including our static tree router data structure: a novel barrier which improves the scalability of parallel collection workers. FRC's performance is evaluated on several concurrent data structures. We release FRC and our tests as open-source code and expect FRC will be useful for many concurrent C++ software systems.
@InProceedings{ISMM18p14,
author = {Charles Tripp and David Hyde and Benjamin Grossman-Ponemon},
title = {FRC: A High-Performance Concurrent Parallel Deferred Reference Counter for C++},
booktitle = {Proc.\ ISMM},
publisher = {ACM},
pages = {14--28},
doi = {10.1145/3210563.3210569},
year = {2018},
}
Publisher's Version
Info
Distributed Garbage Collection for General Graphs
Steven R. Brandt, Hari Krishnan, Costas Busch, and Gokarna Sharma
(Louisiana State University, USA; Kent State University, USA)
We propose a scalable, cycle-collecting, decentralized, reference
counting garbage collector with partial tracing. The
algorithm is based on the Brownbridge system but uses four
different types of references to label edges. Memory usage
is O (log n) bits per node, where n is the number of nodes in
the graph. The algorithm assumes an asynchronous network
model with a reliable reordering channel. It collects garbage
in O (E a ) time, where E a is the number of edges in the in-
duced subgraph. The algorithm uses termination detection
to manage the distributed computation, a unique identifier
to break the symmetry among multiple collectors, and a
transaction-based approach when multiple collectors conflict.
Unlike existing algorithms, ours is not centralized, does
not require barriers, does not require migration of nodes,
does not require back-pointers on every edge, and is stable
against concurrent mutation.
@InProceedings{ISMM18p29,
author = {Steven R. Brandt and Hari Krishnan and Costas Busch and Gokarna Sharma},
title = {Distributed Garbage Collection for General Graphs},
booktitle = {Proc.\ ISMM},
publisher = {ACM},
pages = {29--44},
doi = {10.1145/3210563.3210572},
year = {2018},
}
Publisher's Version
Info
Optimizing for the Web and the Cloud
Hardware-Software Co-optimization of Memory Management in Dynamic Languages
Mohamed Ismail and
G. Edward Suh
(Cornell University, USA)
Dynamic programming languages are becoming increasingly popular, yet often show a significant performance slowdown compared to static languages. In this paper, we study the performance overhead of automatic memory management in dynamic languages. We propose to improve the performance and memory bandwidth usage of dynamic languages by co-optimizing garbage collection overhead and cache performance for newly-initialized and dead objects. Our study shows that less frequent garbage collection results in a large number of cache misses for initial stores to new objects. We solve this problem by directly placing uninitialized objects into on-chip caches without off-chip memory accesses. We further optimize the garbage collection by reducing unnecessary cache pollution and write-backs through partial tracing that invalidates dead objects between full garbage collections. Experimental results on PyPy and V8 show that less frequent garbage collection along with our optimizations can significantly improve the performance of dynamic languages.
@InProceedings{ISMM18p45,
author = {Mohamed Ismail and G. Edward Suh},
title = {Hardware-Software Co-optimization of Memory Management in Dynamic Languages},
booktitle = {Proc.\ ISMM},
publisher = {ACM},
pages = {45--58},
doi = {10.1145/3210563.3210566},
year = {2018},
}
Publisher's Version
Dynamic Vertical Memory Scalability for OpenJDK Cloud Applications
Rodrigo Bruno, Paulo Ferreira, Ruslan Synytsky, Tetiana Fydorenchyk, Jia Rao, Hang Huang, and
Song Wu
(INESC-ID, Portugal; University of Lisbon, Portugal; Jelastic, USA; University of Texas at Arlington, USA; Huazhong University of Science and Technology, China)
The cloud is an increasingly popular platform to deploy applications as it
lets cloud users to provide resources to their applications as needed.
Furthermore, cloud providers are now starting to offer
a "pay-as-you-use" model in which users are only charged for the resources that are
really used instead of paying for a statically sized instance. This new model
allows cloud users to save money, and cloud providers to better utilize their
hardware.
However, applications running on top of runtime environments such as the Java Virtual
Machine (JVM) cannot benefit from this new model because they cannot dynamically
adapt the amount of used resources at runtime. In particular, if an application needs
more memory than what was initially predicted at launch time, the JVM will not allow the
application to grow its memory beyond the maximum value defined at launch time. In addition,
the JVM will hold memory that is no longer being used by the application. This
lack of dynamic vertical scalability completely prevents the benefits of the
"pay-as-you-use" model, and forces users to over-provision resources, and to lose
money on unused resources.
We propose a new JVM heap sizing strategy that allows the JVM to dynamically scale
its memory utilization according to the application's needs. First,
we provide a configurable limit on how much the application can grow its memory.
This limit is dynamic and can be changed at runtime, as opposed to the current
static limit that can only be set at launch time. Second, we adapt current
Garbage Collection policies that control how much the heap can grow and shrink to
better fit what is currently being used by the application.
The proposed solution is implemented in the OpenJDK 9 HotSpot JVM, the new release
of OpenJDK. Changes were also introduced inside the Parallel Scavenge collector and
the Garbage First collector
(the new by-default collector in HotSpot). Evaluation experiments using real workloads
and data show that, with negligible throughput and memory overhead, dynamic vertical memory
scalability can be achieved. This allows users to save significant amounts of money by
not paying for unused resources, and cloud providers to better utilize their physical
machines.
@InProceedings{ISMM18p59,
author = {Rodrigo Bruno and Paulo Ferreira and Ruslan Synytsky and Tetiana Fydorenchyk and Jia Rao and Hang Huang and Song Wu},
title = {Dynamic Vertical Memory Scalability for OpenJDK Cloud Applications},
booktitle = {Proc.\ ISMM},
publisher = {ACM},
pages = {59--70},
doi = {10.1145/3210563.3210567},
year = {2018},
}
Publisher's Version
OMR: Out-of-Core MapReduce for Large Data Sets
Gurneet Kaur, Keval Vora, Sai Charan Koduru, and
Rajiv Gupta
(University of California at Riverside, USA; Simon Fraser University, Canada)
While single machine MapReduce systems can squeeze out maximum performance from available multi-cores, they are often limited by the size of main memory and can thus only process small datasets. Our experience shows that the state-of-the-art single-machine in-memory MapReduce system Metis frequently experiences out-of-memory crashes. Even though today's computers are equipped with efficient secondary storage devices, the frameworks do not utilize these devices mainly because disk access latencies are much higher than those for main memory. Therefore, the single-machine setup of the Hadoop system performs much slower when it is presented with the datasets which are larger than the main memory. Moreover, such frameworks also require tuning a lot of parameters which puts an added burden on the programmer. In this paper we present OMR, an Out-of-core MapReduce system that not only successfully handles datasets that are far larger than the size of main memory, it also guarantees linear scaling with the growing data sizes. OMR actively minimizes the amount of data to be read/written to/from disk via on-the-fly aggregation and it uses block sequential disk read/write operations whenever disk accesses become necessary to avoid running out of memory. We theoretically prove OMR's linear scalability and empirically demonstrate it by processing datasets that are up to 5x larger than main memory. Our experiments show that in comparison to the standalone single-machine setup of the Hadoop system, OMR delivers far higher performance. Also in contrast to Metis, OMR avoids out-of-memory crashes for large datasets as well as delivers higher performance when datasets are small enough to fit in main memory.
@InProceedings{ISMM18p71,
author = {Gurneet Kaur and Keval Vora and Sai Charan Koduru and Rajiv Gupta},
title = {OMR: Out-of-Core MapReduce for Large Data Sets},
booktitle = {Proc.\ ISMM},
publisher = {ACM},
pages = {71--83},
doi = {10.1145/3210563.3210568},
year = {2018},
}
Publisher's Version
Analyzing the Cache and Scheduling
mPart: Miss-Ratio Curve Guided Partitioning in Key-Value Stores
Daniel Byrne, Nilufer Onder, and Zhenlin Wang
(Michigan Technological University, USA)
Web applications employ key-value stores to cache the data that is most commonly accessed.
The cache improves an web application's performance by serving its requests from memory, avoiding
fetching them from the backend database. Since the memory space is limited, maximizing the
memory utilization is a key to delivering the best performance possible. This has lead to
the use of multi-tenant systems, allowing applications to share cache space. In addition, application
data access patterns change over time, so the system should be adaptive in its memory allocation.
In this work, we address both multi-tenancy (where a single cache is used for multiple applications)
and dynamic workloads (changing access patterns) using a model that relates the cache size to the
application miss ratio, known as a miss ratio curve. Intuitively, the larger the cache, the less likely the
system will need to fetch the data from the database.
Our efficient, online construction of the miss ratio curve allows us
to determine a near optimal memory allocation given the available system memory, while
adapting to changing data access patterns.
We show that our model outperforms an existing
state-of-the-art sharing model, Memshare, in terms of overall cache hit ratio
and does so at a lower time cost.
We show that for a typical system, overall hit ratio is consistently 1 percentage point greater and
99.9th percentile latency is reduced by as much as 2.9% under standard web application
workloads containing millions of requests.
@InProceedings{ISMM18p84,
author = {Daniel Byrne and Nilufer Onder and Zhenlin Wang},
title = {mPart: Miss-Ratio Curve Guided Partitioning in Key-Value Stores},
booktitle = {Proc.\ ISMM},
publisher = {ACM},
pages = {84--95},
doi = {10.1145/3210563.3210571},
year = {2018},
}
Publisher's Version
Prediction and Bounds on Shared Cache Demand from Memory Access Interleaving
Jacob Brock,
Chen Ding, Rahman Lavaee, Fangzhou Liu, and Liang Yuan
(University of Rochester, USA; Institute of Computing Technology at Chinese Academy of Sciences, China)
Cache in multicore machines is often shared, and the cache performance depends on how memory accesses belonging to different programs interleave with one another. The full range of performance possibilities includes all possible interleavings, which are too numerous to be studied by experiments for any mix of non-trivial programs.
This paper presents a theory to characterize the effect of memory access interleaving due to parallel execution of non-data-sharing programs. The theory uses an established metric called the footprint (which can be used to calculate miss ratios in fully-associative LRU caches) to measure cache demand, and considers the full range of interleaving possibilities. The paper proves a lower bound for footprints of interleaved traces, and then formulates an upper bound in terms of the footprints of the constituent traces. It also shows the correctness of footprint composition used in a number of existing techniques, and places precise bounds on its accuracy.
@InProceedings{ISMM18p96,
author = {Jacob Brock and Chen Ding and Rahman Lavaee and Fangzhou Liu and Liang Yuan},
title = {Prediction and Bounds on Shared Cache Demand from Memory Access Interleaving},
booktitle = {Proc.\ ISMM},
publisher = {ACM},
pages = {96--108},
doi = {10.1145/3210563.3210565},
year = {2018},
}
Publisher's Version
Balanced Double Queues for GC Work-Stealing on Weak Memory Models
Michihiro Horie, Hiroshi Horii, Kazunori Ogata, and Tamiya Onodera
(IBM Research, Japan)
Work-stealing is promising for scheduling and balancing parallel workloads. It has a wide range of applicability on middleware, libraries, and runtime systems of programming languages. OpenJDK uses work-stealing for copying garbage collection (GC) to balance copying tasks among GC threads. Each thread has its own queue to store tasks. When a thread has no task in its queue, it acts as a thief and attempts to steal a task from another thread's queue. However, this work-stealing algorithm requires expensive memory fences for pushing, popping, and stealing tasks, especially on weak memory models such as POWER and ARM. To address this problem, we propose a work-stealing algorithm that uses double queues. Each GC thread has a public queue that is accessible from other GC threads and a private queue that is only accessible by itself. Pushing and popping tasks in the private queue are free from expensive memory fences. The most significant point in our algorithm is providing a mechanism to maintain the load balance on the basis of the use of double queues. We developed a prototype implementation for parallel GC in OpenJDK8 for ppc64le. We evaluated our algorithm by using SPECjbb2015, SPECjvm2008, TPC-DS, and Apache DayTrader.
@InProceedings{ISMM18p109,
author = {Michihiro Horie and Hiroshi Horii and Kazunori Ogata and Tamiya Onodera},
title = {Balanced Double Queues for GC Work-Stealing on Weak Memory Models},
booktitle = {Proc.\ ISMM},
publisher = {ACM},
pages = {109--119},
doi = {10.1145/3210563.3210570},
year = {2018},
}
Publisher's Version
proc time: 3.02