Workshop ISMM 2026 – Author Index |
Contents -
Abstracts -
Authors
|
| Amit, Nadav |
Nadav Amit (Technion, Israel) While huge pages can dramatically reduce address translation overhead, their use for executable code remains limited by structural barriers in binary formats, page cache management, and loader mechanisms. Existing solutions copy code into anonymous memory, abandoning file-backed semantics and breaking cross-process sharing, memory reclamation, and debugging tools. Emerging kernel support for file-backed huge pages remains insufficient without relinking and modifying loaders, impractical for deployed and closed-source software. We present Hugifier, a userspace solution that enables huge page mappings for existing executables while preserving file-backed memory semantics--executable memory sharing, paging, and debugging all continue to work correctly. Hugifier combines a binary transformation that aligns code segments to huge page boundaries without disassembly or relinking, with a runtime component that ensures huge page mappings on current kernels. Evaluation shows over 90% iTLB miss reduction and up to 11.3% speedup--exceeding libhugetlbfs by up to 5.3% and Linux's current RO-THP by 7.3%; against a stronger "aligned" baseline modeling future kernel/loader support without segment separation, the binary transformation still adds 5.5%. |
|
| Arnold, Anthony |
Anthony Arnold and Mark Marron (University of Kentucky, USA) Garbage Collectors (GCs) are a critical component of a modern application stack. Long pauses, large memory consumption, and high CPU usage can unexpectedly occur with certain workloads or series of events. These behaviors can make systems unresponsive, make it impossible to run them in resource-constrained environments, and are often very difficult to debug and fix – as their appearance may be intermittent. In fact recent theoretical work has shown what, for existing mainstream languages, these issues are unavoidable and, regardless of the GC design or implementation, there will always be workloads that cause them to occur! Intriguingly, the pathological and scenarios that are needed to cause these GC behavioral issues are a result of a specific set of language features – mutability and cyclic data structures. In this paper we present a novel language and garbage-collector co-design, The collector is designed leverage programming language features to construct a system with provably bounded collection pauses, incurs a fixed-constant memory overhead, and ensures starvation freedom for the application. |
|
| Bagchi, Soham |
Soham Bagchi, Sanya Srivastava, Reese Levine, Tyler Sorensen, Ryan Stutsman, and Vijay Nagarajan (University of Utah, USA; Duke University, USA; University of California at Santa Cruz, USA) Modern heterogeneous processors like the NVIDIA Grace-Hopper Superchip tightly integrate CPU and GPU cores across a cache-coherent interconnect, with an implicit assumption that independently compiled CPU and GPU code can safely interact via shared memory. Yet the memory consistency and coherence of such systems remain empirically unvalidated. This paper presents the first systematic study of consistency and coherence on the Grace-Hopper. We empirically validate that the system enforces the Compound Memory Consistency Model (CMCM)---a theoretical prerequisite for correct independent compilation---using a novel heterogeneous litmus testing methodology spanning 1,960 test variants. We further introduce Value Propagation tests to reverse-engineer the underlying coherence mechanisms, revealing that internal GPU coherence relies on write-throughs and self-invalidations rather than classical writer-initiated invalidations, while global CPU-GPU coherence is maintained via directory-based invalidations consistent with an AMBA CHI-like protocol. These results establish the CMCM as a concrete architectural target for heterogeneous systems and provide the first empirical characterization of GPU and CPU-GPU coherence mechanisms in a commercial heterogeneous processor. |
|
| Berger, Emery D. |
Nicolas van Kempen and Emery D. Berger (University of Massachusetts at Amherst, USA; Amazon Web Services, USA) Programmers using native languages such as C, C++, or Rust can implement custom memory allocation strategies to improve execution time. In their paper titled "Reconsidering Custom Memory Allocation" almost 25 years ago, Berger et al. showed that while per-class allocators provide no significant speedups over a state-of-the-art general-purpose allocator, region-based allocators can improve execution time by allocating and freeing objects in bulk. This paper revisits that work on a modern hardware platform with modern general-purpose allocators to evaluate whether their conclusions still hold. It also augments the benchmark suite with two large real-world applications (Clang and Blender), and introduces a methodology to explore the effect of memory fragmentation on locality in general-purpose allocators. Our results support and extend the original conclusions, demonstrating the locality advantages of region-based custom memory allocators. |
|
| Blackburn, Stephen M. |
Hayley Patton and Stephen M. Blackburn (Australian National University, Australia; Google, Australia) Offset-Vector Compaction (OVC) algorithms, such as the Compressor, are among the most widely deployed garbage collectors today, yet they have remained largely unexplored by the literature since the earliest algorithms were published two decades ago. Although an implementation in the Android Runtime is publicly available and a number of other variations on the algorithm have been developed or proposed, these have not been analyzed or evaluated with recent benchmarks or compared with modern server collectors. This gap in the literature may curtail innovation and exploration of new ideas while reinforcing unchallenged conventional wisdom. This paper makes six major contributions. We: i) introduce a taxonomy of offset-vector compaction (OVC) algorithms; ii) review the history of OVC algorithms, including the Compressor, discovering designs that were previously unreported; iii) implement a family of offset-vector compactors in OpenJDK; iv) implement and evaluate for the first time the One Pass (OP) Compactor; v) introduce a branch-free algorithm for computing forwarding information; and vi) conduct the first evaluation of offset-vector compactors on OpenJDK with modern workloads. We find that some of the variations we explore offer improvements over the original Compressor algorithm. We also find that, interestingly, the optimization proposed by the one pass (OP) Compactor does not improve over the Compressor. Our implementations are open source, allowing others to build on them. We hope that our analysis and evaluation of this important family of garbage collectors will enable future innovation and improved designs. |
|
| D'Alessio, Edoardo |
Edoardo D'Alessio, Mohamed Husain Noor Mohamed, Xiaoguang Wang, and Binoy Ravindran (University of Illinois at Chicago, USA; Virginia Tech, USA) Recent Linux memory-management interfaces make it practical to revisit distributed shared memory (DSM) as a deployable runtime substrate for conventional multithreaded software. We present Stretch, a userspace fault-driven page-granularity DSM runtime that combines userfaultfd-based fault interception, centralized MSI-style coherence (i.e., Modified, Shared, or Invalid), and CRIU-based thread placement to extend a process across multiple machines. Missing-page and write-protection faults are translated into fetch, invalidation, and ownership-transfer operations, while distributed barriers and coarse-grained mutexes reuse the same mechanism. Stretch supports both automatic tracking of anonymous regions and an explicit tracked-region mode that focuses coherence on genuinely shared memory. We evaluate Stretch on coherence microbenchmarks and seven Phoenix workloads on four CloudLab servers. The results show that the computation-to-page-fault ratio is the dominant performance predictor. Compute-intensive workloads such as Matrix Multiply achieve up to 3.39×speedup, whereas fine-grained write sharing in KMeans triggers invalidation storms that defeat page-granularity DSM. RDMA reduces key coherence operations by up to 4×relative to TCP when the server has sufficient CPU provisioning, but its advantage largely disappears when the centralized server is CPU-bound. These results identify both the practical regime in which a Linux-based fault-driven DSM is effective and the limitations that remain fundamental at page granularity. |
|
| Feng, Kai |
Kai Feng, Huanting Wang, Jeremy Singer, and Zheng Wang (University of Glasgow, UK; University of Leeds, UK) Memory safety is a critical issue in embedded systems. Although high-level languages like MicroPython simplify IoT development, their C-based runtimes remain vulnerable to memory errors triggered by Python code or native extensions. The CHERI (Capability Hardware Enhanced RISC Instructions) architecture offers hardware-enforced memory safety, but its effectiveness for exposing latent bugs in real-world interpreters has not yet been fully explored. We present diffCHERI:FruitFly, a novel differential testing framework for systematically uncovering memory defects in MicroPython across conventional (x86/ARM) and CHERI-enabled (Arm Morello) platforms. We mine historic vulnerabilities from diverse Python runtimes to extract recurring stress patterns, then use a large language model to generate new test programs, and apply Concrete Syntax Tree (CST) mutation to diversify inputs. On 24-hour automated testing, our framework executed 8,189 generated programs on MicroPython v1.20 and the development branch, identifying 40 distinct defects in the conventional build and 51 in the CHERI port. Memory errors that caused silent corruption or weak symptoms on conventional hardware were converted into precise capability faults on CHERI. These results show that CHERI not only shrinks the attack surface but also serves as an effective memory safety oracle for revealing latent vulnerabilities in embedded interpreters. |
|
| Jonnalagadda, Ravi Shankar |
Bijan Tabatabai, Eishan Mirakhur, Ravi Shankar Jonnalagadda, Vinicius Petrucci, Rohit Sehgal, Jus Singh, and Michael M. Swift (University of Wisconsin-Madison, USA; Micron Technology, USA) CXL memory devices increase the memory capacity and bandwidth available to a server, at the cost of higher access latency. Prior research focused on how to make use of the expanded memory capacity provided by CXL while minimizing the impact of its higher access latency. However, most current memory management techniques to achieve this, such as memory tiering, do not make effective use of the expanded bandwidth provided by CXL memory. These techniques place frequently accessed data in local memory, so bandwidth intensive applications will saturate the local bandwidth and leave the remote bandwidth unutilized. We instead focus on making use of the expanded bandwidth provided by CXL memory. We design a system, Dynamic Memory Interleaving (DMI), that monitors the bandwidth utilization of the machine and chooses when and how to dynamically interleave data between local and remote memory to maximize bandwidth utilization under changing application demands for bandwidth. With DMI, bandwidth intensive applications perform up to 26% faster than the state-of-the-art bandwidth-aware tiering system. DMI also performs similarly to state-of-the-art tiering systems when running latency sensitive applications. |
|
| Levine, Reese |
Soham Bagchi, Sanya Srivastava, Reese Levine, Tyler Sorensen, Ryan Stutsman, and Vijay Nagarajan (University of Utah, USA; Duke University, USA; University of California at Santa Cruz, USA) Modern heterogeneous processors like the NVIDIA Grace-Hopper Superchip tightly integrate CPU and GPU cores across a cache-coherent interconnect, with an implicit assumption that independently compiled CPU and GPU code can safely interact via shared memory. Yet the memory consistency and coherence of such systems remain empirically unvalidated. This paper presents the first systematic study of consistency and coherence on the Grace-Hopper. We empirically validate that the system enforces the Compound Memory Consistency Model (CMCM)---a theoretical prerequisite for correct independent compilation---using a novel heterogeneous litmus testing methodology spanning 1,960 test variants. We further introduce Value Propagation tests to reverse-engineer the underlying coherence mechanisms, revealing that internal GPU coherence relies on write-throughs and self-invalidations rather than classical writer-initiated invalidations, while global CPU-GPU coherence is maintained via directory-based invalidations consistent with an AMBA CHI-like protocol. These results establish the CMCM as a concrete architectural target for heterogeneous systems and provide the first empirical characterization of GPU and CPU-GPU coherence mechanisms in a commercial heterogeneous processor. |
|
| Marron, Mark |
Anthony Arnold and Mark Marron (University of Kentucky, USA) Garbage Collectors (GCs) are a critical component of a modern application stack. Long pauses, large memory consumption, and high CPU usage can unexpectedly occur with certain workloads or series of events. These behaviors can make systems unresponsive, make it impossible to run them in resource-constrained environments, and are often very difficult to debug and fix – as their appearance may be intermittent. In fact recent theoretical work has shown what, for existing mainstream languages, these issues are unavoidable and, regardless of the GC design or implementation, there will always be workloads that cause them to occur! Intriguingly, the pathological and scenarios that are needed to cause these GC behavioral issues are a result of a specific set of language features – mutability and cyclic data structures. In this paper we present a novel language and garbage-collector co-design, The collector is designed leverage programming language features to construct a system with provably bounded collection pauses, incurs a fixed-constant memory overhead, and ensures starvation freedom for the application. |
|
| Mirakhur, Eishan |
Bijan Tabatabai, Eishan Mirakhur, Ravi Shankar Jonnalagadda, Vinicius Petrucci, Rohit Sehgal, Jus Singh, and Michael M. Swift (University of Wisconsin-Madison, USA; Micron Technology, USA) CXL memory devices increase the memory capacity and bandwidth available to a server, at the cost of higher access latency. Prior research focused on how to make use of the expanded memory capacity provided by CXL while minimizing the impact of its higher access latency. However, most current memory management techniques to achieve this, such as memory tiering, do not make effective use of the expanded bandwidth provided by CXL memory. These techniques place frequently accessed data in local memory, so bandwidth intensive applications will saturate the local bandwidth and leave the remote bandwidth unutilized. We instead focus on making use of the expanded bandwidth provided by CXL memory. We design a system, Dynamic Memory Interleaving (DMI), that monitors the bandwidth utilization of the machine and chooses when and how to dynamically interleave data between local and remote memory to maximize bandwidth utilization under changing application demands for bandwidth. With DMI, bandwidth intensive applications perform up to 26% faster than the state-of-the-art bandwidth-aware tiering system. DMI also performs similarly to state-of-the-art tiering systems when running latency sensitive applications. |
|
| Nagarajan, Vijay |
Soham Bagchi, Sanya Srivastava, Reese Levine, Tyler Sorensen, Ryan Stutsman, and Vijay Nagarajan (University of Utah, USA; Duke University, USA; University of California at Santa Cruz, USA) Modern heterogeneous processors like the NVIDIA Grace-Hopper Superchip tightly integrate CPU and GPU cores across a cache-coherent interconnect, with an implicit assumption that independently compiled CPU and GPU code can safely interact via shared memory. Yet the memory consistency and coherence of such systems remain empirically unvalidated. This paper presents the first systematic study of consistency and coherence on the Grace-Hopper. We empirically validate that the system enforces the Compound Memory Consistency Model (CMCM)---a theoretical prerequisite for correct independent compilation---using a novel heterogeneous litmus testing methodology spanning 1,960 test variants. We further introduce Value Propagation tests to reverse-engineer the underlying coherence mechanisms, revealing that internal GPU coherence relies on write-throughs and self-invalidations rather than classical writer-initiated invalidations, while global CPU-GPU coherence is maintained via directory-based invalidations consistent with an AMBA CHI-like protocol. These results establish the CMCM as a concrete architectural target for heterogeneous systems and provide the first empirical characterization of GPU and CPU-GPU coherence mechanisms in a commercial heterogeneous processor. |
|
| Nikolopoulos, Dimitrios |
Yunqi Shen and Dimitrios Nikolopoulos (Virginia Tech, USA) GPU unified memory simplifies programming by automatically migrating pages between CPUs and GPUs, but page faults trigger migrations with hundreds of microseconds to millisecond-scale latency, stalling thousands of threads. We target this bottleneck with a page prefetching framework that predicts future faults by modeling stride (delta) transitions in addition to addresses. Across GPU workloads, we observe that while individual fault addresses may be unique, the sequence of strides between them exhibits strong regularity, including recurring multi-step and oscillatory patterns. We design a hybrid prefetcher centered on a novel stride Markov predictor that learns transition probabilities between consecutive strides, and an address Markov predictor acts as a fallback to capture direct page-to-page locality when stride-based patterns are insufficient. Both predictors share a similar data structure and throttling and pruning strategy, minimizing additional complexity while bounding predictor state and bandwidth pollution. We prototype our prefetcher as a non-invasive runtime layer requiring no modifications to GPU kernels or applications. Evaluation on diverse workloads shows that the hybrid predictor achieves up to 88% accuracy with modest pollution, and conservative speedups of up to 1.48×. These results demonstrate that stride-aware Markov prediction is a practical and effective mechanism for mitigating unified-memory bottlenecks while preserving programming simplicity. |
|
| Noor Mohamed, Mohamed Husain |
Edoardo D'Alessio, Mohamed Husain Noor Mohamed, Xiaoguang Wang, and Binoy Ravindran (University of Illinois at Chicago, USA; Virginia Tech, USA) Recent Linux memory-management interfaces make it practical to revisit distributed shared memory (DSM) as a deployable runtime substrate for conventional multithreaded software. We present Stretch, a userspace fault-driven page-granularity DSM runtime that combines userfaultfd-based fault interception, centralized MSI-style coherence (i.e., Modified, Shared, or Invalid), and CRIU-based thread placement to extend a process across multiple machines. Missing-page and write-protection faults are translated into fetch, invalidation, and ownership-transfer operations, while distributed barriers and coarse-grained mutexes reuse the same mechanism. Stretch supports both automatic tracking of anonymous regions and an explicit tracked-region mode that focuses coherence on genuinely shared memory. We evaluate Stretch on coherence microbenchmarks and seven Phoenix workloads on four CloudLab servers. The results show that the computation-to-page-fault ratio is the dominant performance predictor. Compute-intensive workloads such as Matrix Multiply achieve up to 3.39×speedup, whereas fine-grained write sharing in KMeans triggers invalidation storms that defeat page-granularity DSM. RDMA reduces key coherence operations by up to 4×relative to TCP when the server has sufficient CPU provisioning, but its advantage largely disappears when the centralized server is CPU-bound. These results identify both the practical regime in which a Linux-based fault-driven DSM is effective and the limitations that remain fundamental at page granularity. |
|
| Patton, Hayley |
Hayley Patton and Stephen M. Blackburn (Australian National University, Australia; Google, Australia) Offset-Vector Compaction (OVC) algorithms, such as the Compressor, are among the most widely deployed garbage collectors today, yet they have remained largely unexplored by the literature since the earliest algorithms were published two decades ago. Although an implementation in the Android Runtime is publicly available and a number of other variations on the algorithm have been developed or proposed, these have not been analyzed or evaluated with recent benchmarks or compared with modern server collectors. This gap in the literature may curtail innovation and exploration of new ideas while reinforcing unchallenged conventional wisdom. This paper makes six major contributions. We: i) introduce a taxonomy of offset-vector compaction (OVC) algorithms; ii) review the history of OVC algorithms, including the Compressor, discovering designs that were previously unreported; iii) implement a family of offset-vector compactors in OpenJDK; iv) implement and evaluate for the first time the One Pass (OP) Compactor; v) introduce a branch-free algorithm for computing forwarding information; and vi) conduct the first evaluation of offset-vector compactors on OpenJDK with modern workloads. We find that some of the variations we explore offer improvements over the original Compressor algorithm. We also find that, interestingly, the optimization proposed by the one pass (OP) Compactor does not improve over the Compressor. Our implementations are open source, allowing others to build on them. We hope that our analysis and evaluation of this important family of garbage collectors will enable future innovation and improved designs. |
|
| Petrucci, Vinicius |
Bijan Tabatabai, Eishan Mirakhur, Ravi Shankar Jonnalagadda, Vinicius Petrucci, Rohit Sehgal, Jus Singh, and Michael M. Swift (University of Wisconsin-Madison, USA; Micron Technology, USA) CXL memory devices increase the memory capacity and bandwidth available to a server, at the cost of higher access latency. Prior research focused on how to make use of the expanded memory capacity provided by CXL while minimizing the impact of its higher access latency. However, most current memory management techniques to achieve this, such as memory tiering, do not make effective use of the expanded bandwidth provided by CXL memory. These techniques place frequently accessed data in local memory, so bandwidth intensive applications will saturate the local bandwidth and leave the remote bandwidth unutilized. We instead focus on making use of the expanded bandwidth provided by CXL memory. We design a system, Dynamic Memory Interleaving (DMI), that monitors the bandwidth utilization of the machine and chooses when and how to dynamically interleave data between local and remote memory to maximize bandwidth utilization under changing application demands for bandwidth. With DMI, bandwidth intensive applications perform up to 26% faster than the state-of-the-art bandwidth-aware tiering system. DMI also performs similarly to state-of-the-art tiering systems when running latency sensitive applications. |
|
| Ravindran, Binoy |
Edoardo D'Alessio, Mohamed Husain Noor Mohamed, Xiaoguang Wang, and Binoy Ravindran (University of Illinois at Chicago, USA; Virginia Tech, USA) Recent Linux memory-management interfaces make it practical to revisit distributed shared memory (DSM) as a deployable runtime substrate for conventional multithreaded software. We present Stretch, a userspace fault-driven page-granularity DSM runtime that combines userfaultfd-based fault interception, centralized MSI-style coherence (i.e., Modified, Shared, or Invalid), and CRIU-based thread placement to extend a process across multiple machines. Missing-page and write-protection faults are translated into fetch, invalidation, and ownership-transfer operations, while distributed barriers and coarse-grained mutexes reuse the same mechanism. Stretch supports both automatic tracking of anonymous regions and an explicit tracked-region mode that focuses coherence on genuinely shared memory. We evaluate Stretch on coherence microbenchmarks and seven Phoenix workloads on four CloudLab servers. The results show that the computation-to-page-fault ratio is the dominant performance predictor. Compute-intensive workloads such as Matrix Multiply achieve up to 3.39×speedup, whereas fine-grained write sharing in KMeans triggers invalidation storms that defeat page-granularity DSM. RDMA reduces key coherence operations by up to 4×relative to TCP when the server has sufficient CPU provisioning, but its advantage largely disappears when the centralized server is CPU-bound. These results identify both the practical regime in which a Linux-based fault-driven DSM is effective and the limitations that remain fundamental at page granularity. |
|
| Sehgal, Rohit |
Bijan Tabatabai, Eishan Mirakhur, Ravi Shankar Jonnalagadda, Vinicius Petrucci, Rohit Sehgal, Jus Singh, and Michael M. Swift (University of Wisconsin-Madison, USA; Micron Technology, USA) CXL memory devices increase the memory capacity and bandwidth available to a server, at the cost of higher access latency. Prior research focused on how to make use of the expanded memory capacity provided by CXL while minimizing the impact of its higher access latency. However, most current memory management techniques to achieve this, such as memory tiering, do not make effective use of the expanded bandwidth provided by CXL memory. These techniques place frequently accessed data in local memory, so bandwidth intensive applications will saturate the local bandwidth and leave the remote bandwidth unutilized. We instead focus on making use of the expanded bandwidth provided by CXL memory. We design a system, Dynamic Memory Interleaving (DMI), that monitors the bandwidth utilization of the machine and chooses when and how to dynamically interleave data between local and remote memory to maximize bandwidth utilization under changing application demands for bandwidth. With DMI, bandwidth intensive applications perform up to 26% faster than the state-of-the-art bandwidth-aware tiering system. DMI also performs similarly to state-of-the-art tiering systems when running latency sensitive applications. |
|
| Shen, Yunqi |
Yunqi Shen and Dimitrios Nikolopoulos (Virginia Tech, USA) GPU unified memory simplifies programming by automatically migrating pages between CPUs and GPUs, but page faults trigger migrations with hundreds of microseconds to millisecond-scale latency, stalling thousands of threads. We target this bottleneck with a page prefetching framework that predicts future faults by modeling stride (delta) transitions in addition to addresses. Across GPU workloads, we observe that while individual fault addresses may be unique, the sequence of strides between them exhibits strong regularity, including recurring multi-step and oscillatory patterns. We design a hybrid prefetcher centered on a novel stride Markov predictor that learns transition probabilities between consecutive strides, and an address Markov predictor acts as a fallback to capture direct page-to-page locality when stride-based patterns are insufficient. Both predictors share a similar data structure and throttling and pruning strategy, minimizing additional complexity while bounding predictor state and bandwidth pollution. We prototype our prefetcher as a non-invasive runtime layer requiring no modifications to GPU kernels or applications. Evaluation on diverse workloads shows that the hybrid predictor achieves up to 88% accuracy with modest pollution, and conservative speedups of up to 1.48×. These results demonstrate that stride-aware Markov prediction is a practical and effective mechanism for mitigating unified-memory bottlenecks while preserving programming simplicity. |
|
| Singer, Jeremy |
Kai Feng, Huanting Wang, Jeremy Singer, and Zheng Wang (University of Glasgow, UK; University of Leeds, UK) Memory safety is a critical issue in embedded systems. Although high-level languages like MicroPython simplify IoT development, their C-based runtimes remain vulnerable to memory errors triggered by Python code or native extensions. The CHERI (Capability Hardware Enhanced RISC Instructions) architecture offers hardware-enforced memory safety, but its effectiveness for exposing latent bugs in real-world interpreters has not yet been fully explored. We present diffCHERI:FruitFly, a novel differential testing framework for systematically uncovering memory defects in MicroPython across conventional (x86/ARM) and CHERI-enabled (Arm Morello) platforms. We mine historic vulnerabilities from diverse Python runtimes to extract recurring stress patterns, then use a large language model to generate new test programs, and apply Concrete Syntax Tree (CST) mutation to diversify inputs. On 24-hour automated testing, our framework executed 8,189 generated programs on MicroPython v1.20 and the development branch, identifying 40 distinct defects in the conventional build and 51 in the CHERI port. Memory errors that caused silent corruption or weak symptoms on conventional hardware were converted into precise capability faults on CHERI. These results show that CHERI not only shrinks the attack surface but also serves as an effective memory safety oracle for revealing latent vulnerabilities in embedded interpreters. |
|
| Singh, Jus |
Bijan Tabatabai, Eishan Mirakhur, Ravi Shankar Jonnalagadda, Vinicius Petrucci, Rohit Sehgal, Jus Singh, and Michael M. Swift (University of Wisconsin-Madison, USA; Micron Technology, USA) CXL memory devices increase the memory capacity and bandwidth available to a server, at the cost of higher access latency. Prior research focused on how to make use of the expanded memory capacity provided by CXL while minimizing the impact of its higher access latency. However, most current memory management techniques to achieve this, such as memory tiering, do not make effective use of the expanded bandwidth provided by CXL memory. These techniques place frequently accessed data in local memory, so bandwidth intensive applications will saturate the local bandwidth and leave the remote bandwidth unutilized. We instead focus on making use of the expanded bandwidth provided by CXL memory. We design a system, Dynamic Memory Interleaving (DMI), that monitors the bandwidth utilization of the machine and chooses when and how to dynamically interleave data between local and remote memory to maximize bandwidth utilization under changing application demands for bandwidth. With DMI, bandwidth intensive applications perform up to 26% faster than the state-of-the-art bandwidth-aware tiering system. DMI also performs similarly to state-of-the-art tiering systems when running latency sensitive applications. |
|
| Sorensen, Tyler |
Soham Bagchi, Sanya Srivastava, Reese Levine, Tyler Sorensen, Ryan Stutsman, and Vijay Nagarajan (University of Utah, USA; Duke University, USA; University of California at Santa Cruz, USA) Modern heterogeneous processors like the NVIDIA Grace-Hopper Superchip tightly integrate CPU and GPU cores across a cache-coherent interconnect, with an implicit assumption that independently compiled CPU and GPU code can safely interact via shared memory. Yet the memory consistency and coherence of such systems remain empirically unvalidated. This paper presents the first systematic study of consistency and coherence on the Grace-Hopper. We empirically validate that the system enforces the Compound Memory Consistency Model (CMCM)---a theoretical prerequisite for correct independent compilation---using a novel heterogeneous litmus testing methodology spanning 1,960 test variants. We further introduce Value Propagation tests to reverse-engineer the underlying coherence mechanisms, revealing that internal GPU coherence relies on write-throughs and self-invalidations rather than classical writer-initiated invalidations, while global CPU-GPU coherence is maintained via directory-based invalidations consistent with an AMBA CHI-like protocol. These results establish the CMCM as a concrete architectural target for heterogeneous systems and provide the first empirical characterization of GPU and CPU-GPU coherence mechanisms in a commercial heterogeneous processor. |
|
| Srivastava, Sanya |
Soham Bagchi, Sanya Srivastava, Reese Levine, Tyler Sorensen, Ryan Stutsman, and Vijay Nagarajan (University of Utah, USA; Duke University, USA; University of California at Santa Cruz, USA) Modern heterogeneous processors like the NVIDIA Grace-Hopper Superchip tightly integrate CPU and GPU cores across a cache-coherent interconnect, with an implicit assumption that independently compiled CPU and GPU code can safely interact via shared memory. Yet the memory consistency and coherence of such systems remain empirically unvalidated. This paper presents the first systematic study of consistency and coherence on the Grace-Hopper. We empirically validate that the system enforces the Compound Memory Consistency Model (CMCM)---a theoretical prerequisite for correct independent compilation---using a novel heterogeneous litmus testing methodology spanning 1,960 test variants. We further introduce Value Propagation tests to reverse-engineer the underlying coherence mechanisms, revealing that internal GPU coherence relies on write-throughs and self-invalidations rather than classical writer-initiated invalidations, while global CPU-GPU coherence is maintained via directory-based invalidations consistent with an AMBA CHI-like protocol. These results establish the CMCM as a concrete architectural target for heterogeneous systems and provide the first empirical characterization of GPU and CPU-GPU coherence mechanisms in a commercial heterogeneous processor. |
|
| Stutsman, Ryan |
Soham Bagchi, Sanya Srivastava, Reese Levine, Tyler Sorensen, Ryan Stutsman, and Vijay Nagarajan (University of Utah, USA; Duke University, USA; University of California at Santa Cruz, USA) Modern heterogeneous processors like the NVIDIA Grace-Hopper Superchip tightly integrate CPU and GPU cores across a cache-coherent interconnect, with an implicit assumption that independently compiled CPU and GPU code can safely interact via shared memory. Yet the memory consistency and coherence of such systems remain empirically unvalidated. This paper presents the first systematic study of consistency and coherence on the Grace-Hopper. We empirically validate that the system enforces the Compound Memory Consistency Model (CMCM)---a theoretical prerequisite for correct independent compilation---using a novel heterogeneous litmus testing methodology spanning 1,960 test variants. We further introduce Value Propagation tests to reverse-engineer the underlying coherence mechanisms, revealing that internal GPU coherence relies on write-throughs and self-invalidations rather than classical writer-initiated invalidations, while global CPU-GPU coherence is maintained via directory-based invalidations consistent with an AMBA CHI-like protocol. These results establish the CMCM as a concrete architectural target for heterogeneous systems and provide the first empirical characterization of GPU and CPU-GPU coherence mechanisms in a commercial heterogeneous processor. |
|
| Swift, Michael M. |
Bijan Tabatabai, Eishan Mirakhur, Ravi Shankar Jonnalagadda, Vinicius Petrucci, Rohit Sehgal, Jus Singh, and Michael M. Swift (University of Wisconsin-Madison, USA; Micron Technology, USA) CXL memory devices increase the memory capacity and bandwidth available to a server, at the cost of higher access latency. Prior research focused on how to make use of the expanded memory capacity provided by CXL while minimizing the impact of its higher access latency. However, most current memory management techniques to achieve this, such as memory tiering, do not make effective use of the expanded bandwidth provided by CXL memory. These techniques place frequently accessed data in local memory, so bandwidth intensive applications will saturate the local bandwidth and leave the remote bandwidth unutilized. We instead focus on making use of the expanded bandwidth provided by CXL memory. We design a system, Dynamic Memory Interleaving (DMI), that monitors the bandwidth utilization of the machine and chooses when and how to dynamically interleave data between local and remote memory to maximize bandwidth utilization under changing application demands for bandwidth. With DMI, bandwidth intensive applications perform up to 26% faster than the state-of-the-art bandwidth-aware tiering system. DMI also performs similarly to state-of-the-art tiering systems when running latency sensitive applications. |
|
| Tabatabai, Bijan |
Bijan Tabatabai, Eishan Mirakhur, Ravi Shankar Jonnalagadda, Vinicius Petrucci, Rohit Sehgal, Jus Singh, and Michael M. Swift (University of Wisconsin-Madison, USA; Micron Technology, USA) CXL memory devices increase the memory capacity and bandwidth available to a server, at the cost of higher access latency. Prior research focused on how to make use of the expanded memory capacity provided by CXL while minimizing the impact of its higher access latency. However, most current memory management techniques to achieve this, such as memory tiering, do not make effective use of the expanded bandwidth provided by CXL memory. These techniques place frequently accessed data in local memory, so bandwidth intensive applications will saturate the local bandwidth and leave the remote bandwidth unutilized. We instead focus on making use of the expanded bandwidth provided by CXL memory. We design a system, Dynamic Memory Interleaving (DMI), that monitors the bandwidth utilization of the machine and chooses when and how to dynamically interleave data between local and remote memory to maximize bandwidth utilization under changing application demands for bandwidth. With DMI, bandwidth intensive applications perform up to 26% faster than the state-of-the-art bandwidth-aware tiering system. DMI also performs similarly to state-of-the-art tiering systems when running latency sensitive applications. |
|
| Van Kempen, Nicolas |
Nicolas van Kempen and Emery D. Berger (University of Massachusetts at Amherst, USA; Amazon Web Services, USA) Programmers using native languages such as C, C++, or Rust can implement custom memory allocation strategies to improve execution time. In their paper titled "Reconsidering Custom Memory Allocation" almost 25 years ago, Berger et al. showed that while per-class allocators provide no significant speedups over a state-of-the-art general-purpose allocator, region-based allocators can improve execution time by allocating and freeing objects in bulk. This paper revisits that work on a modern hardware platform with modern general-purpose allocators to evaluate whether their conclusions still hold. It also augments the benchmark suite with two large real-world applications (Clang and Blender), and introduces a methodology to explore the effect of memory fragmentation on locality in general-purpose allocators. Our results support and extend the original conclusions, demonstrating the locality advantages of region-based custom memory allocators. |
|
| Wang, Huanting |
Kai Feng, Huanting Wang, Jeremy Singer, and Zheng Wang (University of Glasgow, UK; University of Leeds, UK) Memory safety is a critical issue in embedded systems. Although high-level languages like MicroPython simplify IoT development, their C-based runtimes remain vulnerable to memory errors triggered by Python code or native extensions. The CHERI (Capability Hardware Enhanced RISC Instructions) architecture offers hardware-enforced memory safety, but its effectiveness for exposing latent bugs in real-world interpreters has not yet been fully explored. We present diffCHERI:FruitFly, a novel differential testing framework for systematically uncovering memory defects in MicroPython across conventional (x86/ARM) and CHERI-enabled (Arm Morello) platforms. We mine historic vulnerabilities from diverse Python runtimes to extract recurring stress patterns, then use a large language model to generate new test programs, and apply Concrete Syntax Tree (CST) mutation to diversify inputs. On 24-hour automated testing, our framework executed 8,189 generated programs on MicroPython v1.20 and the development branch, identifying 40 distinct defects in the conventional build and 51 in the CHERI port. Memory errors that caused silent corruption or weak symptoms on conventional hardware were converted into precise capability faults on CHERI. These results show that CHERI not only shrinks the attack surface but also serves as an effective memory safety oracle for revealing latent vulnerabilities in embedded interpreters. |
|
| Wang, Xiaoguang |
Edoardo D'Alessio, Mohamed Husain Noor Mohamed, Xiaoguang Wang, and Binoy Ravindran (University of Illinois at Chicago, USA; Virginia Tech, USA) Recent Linux memory-management interfaces make it practical to revisit distributed shared memory (DSM) as a deployable runtime substrate for conventional multithreaded software. We present Stretch, a userspace fault-driven page-granularity DSM runtime that combines userfaultfd-based fault interception, centralized MSI-style coherence (i.e., Modified, Shared, or Invalid), and CRIU-based thread placement to extend a process across multiple machines. Missing-page and write-protection faults are translated into fetch, invalidation, and ownership-transfer operations, while distributed barriers and coarse-grained mutexes reuse the same mechanism. Stretch supports both automatic tracking of anonymous regions and an explicit tracked-region mode that focuses coherence on genuinely shared memory. We evaluate Stretch on coherence microbenchmarks and seven Phoenix workloads on four CloudLab servers. The results show that the computation-to-page-fault ratio is the dominant performance predictor. Compute-intensive workloads such as Matrix Multiply achieve up to 3.39×speedup, whereas fine-grained write sharing in KMeans triggers invalidation storms that defeat page-granularity DSM. RDMA reduces key coherence operations by up to 4×relative to TCP when the server has sufficient CPU provisioning, but its advantage largely disappears when the centralized server is CPU-bound. These results identify both the practical regime in which a Linux-based fault-driven DSM is effective and the limitations that remain fundamental at page granularity. |
|
| Wang, Zheng |
Kai Feng, Huanting Wang, Jeremy Singer, and Zheng Wang (University of Glasgow, UK; University of Leeds, UK) Memory safety is a critical issue in embedded systems. Although high-level languages like MicroPython simplify IoT development, their C-based runtimes remain vulnerable to memory errors triggered by Python code or native extensions. The CHERI (Capability Hardware Enhanced RISC Instructions) architecture offers hardware-enforced memory safety, but its effectiveness for exposing latent bugs in real-world interpreters has not yet been fully explored. We present diffCHERI:FruitFly, a novel differential testing framework for systematically uncovering memory defects in MicroPython across conventional (x86/ARM) and CHERI-enabled (Arm Morello) platforms. We mine historic vulnerabilities from diverse Python runtimes to extract recurring stress patterns, then use a large language model to generate new test programs, and apply Concrete Syntax Tree (CST) mutation to diversify inputs. On 24-hour automated testing, our framework executed 8,189 generated programs on MicroPython v1.20 and the development branch, identifying 40 distinct defects in the conventional build and 51 in the CHERI port. Memory errors that caused silent corruption or weak symptoms on conventional hardware were converted into precise capability faults on CHERI. These results show that CHERI not only shrinks the attack surface but also serves as an effective memory safety oracle for revealing latent vulnerabilities in embedded interpreters. |
30 authors
proc time: 0.89