ASPLOS 2023
28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 (ASPLOS 2023)
Powered by
Conference Publishing Consulting

28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 (ASPLOS 2023), March 25–29, 2023, Vancouver, BC, Canada

ASPLOS 2023 – Proceedings

Contents - Abstracts - Authors

Frontmatter

Title Page


ASPLOS 2023 Message from the General Chair
It is an honor to welcome you to Vancouver, British Columbia for the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2023). We hope those able to attend in person will enjoy their time in Vancouver!

ASPLOS 2023 Volume III Program Chairs’ Message
It is our pleasure to introduce the third and final volume of ASPLOS ’23. For the first time, ASPLOS has embarked on a new multi-deadline review model. ASPLOS ’23 featured 3 deadlines spaced throughout the year and published papers in three volumes. Multiple deadlines are meant to encourage authors to submit their papers when ready and to facilitate the selection of some papers for revision. In this, our final program chairs’ message, we will provide details on the execution of the third submission cycle along with a detailed discussion of the entire ASPLOS ’23 process.

Committees


Sponsors


Keynotes

Direct Mind-Machine Teaming (Keynote)
Abhishek Bhattacharjee ORCID logo
(Yale University, USA)
Direct mind-machine teaming will help us treat brain disorders, augment the healthy brain, and shed light on how the brain as an organ gives rise to the mind. Delivering on this promise requires the design of computer systems that delicately balance the tight power, latency, and bandwidth trade-offs needed to decode brain activity, stimulate biological neurons, and control assistive devices most effectively.
This talk presents my group’s design of a standardized and general computer architecture for future brain interfacing. Our design enables the treatment of several neurological disorders (most notably, epilepsy and movement disorders) and lays the groundwork for brain interfacing techniques that can help augment cognitive control and decision-making in the healthy brain. Central to our design is end-to-end hardware acceleration, from the microarchitectural to the distributed system level. Key insights are undergirded via detailed physical synthesis models and chip tape-outs in a 12nm CMOS process.

Publisher's Version
Language Models: The Most Important Compute Challenge of Our Time (Keynote)
Bryan Catanzaro ORCID logo
(NVIDIA, USA)
ChatGPT recently became one of the fastest growing new applications in history, thanks to its intriguing text generation capabilities that are able to answer questions, write poetry, and even problem solve. Large Language Models are now being integrated in fundamental ways into products around the tech industry. The possibilities are extraordinary, but much research remains to make these systems reliable and trustworthy, as well as integrate them into applications seamlessly. Additionally, the computational challenges behind large language modeling are also quite important. Systems for training and deploying these models must be highly scalable and run at extreme efficiency, because the amount of work necessary to converge a model can be extraordinarily large. The cost of deploying these models is a barrier to their deployment and must be lowered significantly. In this talk, I’ll discuss the work we have been doing at NVIDIA to optimize systems for Large Language Model training and inference, and highlight some of the challenges that remain for future work.

Publisher's Version

Papers

ABNDP: Co-optimizing Data Access and Load Balance in Near-Data Processing
Boyu Tian ORCID logo, Qihang Chen ORCID logo, and Mingyu Gao ORCID logo
(Tsinghua University, China; Shanghai Qi Zhi Institute, Shanghai, China)
Near-Data Processing (NDP) has been a promising architectural paradigm to address the memory wall challenge for data-intensive applications. Typical NDP systems based on 3D-stacked memories contain massive parallel processing units, each of which can access its local memory as well as other remote memory regions in the system. In such an architecture, minimizing remote data accesses and achieving computation load balance exhibit a fundamental tradeoff, where existing solutions can only improve one but sacrifice the other. We propose ABNDP, which leverages novel hardware-software co-optimizations to simultaneously alleviate these two issues without making tradeoffs. ABNDP uses a novel and efficient distributed DRAM cache to allow additional data caching locations in the system, where the computation load at the original data hotspots can be distributed and balanced, without significantly increasing remote accesses. ABNDP also adopts a hybrid task scheduling policy that considers both the remote access cost and the load imbalance impact, and exploits the flexibility of the multiple data caching locations to decide the best computation place. Our evaluation shows that ABNDP successfully achieves the two goals of minimizing remote access cost and maintaining load balance, and significantly outperforms the baseline systems in terms of both performance (1.7×) and energy consumption (25%).

Publisher's Version
Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling
Toluwanimi O. Odemuyiwa ORCID logo, Hadi Asghari-Moghaddam ORCID logo, Michael Pellauer ORCID logo, Kartik Hegde ORCID logo, Po-An Tsai ORCID logo, Neal C. Crago ORCID logo, Aamer Jaleel ORCID logo, John D. Owens ORCID logo, Edgar Solomonik ORCID logo, Joel S. Emer ORCID logo, and Christopher W. FletcherORCID logo
(University of California at Davis, Davis, USA; University of Illinois at Urbana-Champaign, USA; NVIDIA, USA; Massachusetts Institute of Technology, USA)
Tensor algebra involving multiple sparse operands is severely memory bound, making it a challenging target for acceleration. Furthermore, irregular sparsity complicates traditional techniques—such as tiling—for ameliorating memory bottlenecks. Prior sparse tiling schemes are sparsity unaware: they carve tensors into uniform coordinate-space shapes, which leads to low-occupancy tiles and thus lower exploitable reuse. To address these challenges, this paper proposes dynamic reflexive tiling (DRT), a novel tiling method that improves data reuse over prior art for sparse tensor kernels, unlocking significant performance improvement opportunities. DRT’s key idea is dynamic sparsity-aware tiling. DRT continuously re-tiles sparse tensors at runtime based on the current sparsity of the active regions of all input tensors, to maximize accelerator buffer utilization while retaining the ability to co-iterate through tiles of distinct tensors.
Through an extensive evaluation over a set of SuiteSparse matrices, we show how DRT can be applied to multiple prior accelerators with different dataflows (ExTensor, OuterSPACE, MatRaptor), improving their performance (by 3.3×, 5.1× and 1.6×, respectively) while adding negligible area overhead. We apply DRT to higher-order tensor kernels to reduce DRAM traffic by 3.9× and 16.9× over a CPU implementation and prior-art tiling scheme, respectively. Finally, we show that the technique is portable to software, with an improvement of 7.29× and 2.94× in memory overhead compared to untiled sparse-sparse matrix multiplication (SpMSpM).

Publisher's Version
APEX: A Framework for Automated Processing Element Design Space Exploration using Frequent Subgraph Analysis
Jackson Melchert ORCID logo, Kathleen Feng ORCID logo, Caleb Donovick ORCID logo, Ross Daly ORCID logo, Ritvik Sharma ORCID logo, Clark Barrett ORCID logo, Mark A. Horowitz ORCID logo, Pat Hanrahan ORCID logo, and Priyanka Raina ORCID logo
(Stanford University, USA)
The architecture of a coarse-grained reconfigurable array (CGRA) processing element (PE) has a significant effect on the performance and energy-efficiency of an application running on the CGRA. This paper presents APEX, an automated approach for generating specialized PE architectures for an application or an application domain. APEX first analyzes application domain benchmarks using frequent subgraph mining to extract commonly occurring computational subgraphs. APEX then generates specialized PEs by merging subgraphs using a datapath graph merging algorithm. The merged datapath graphs are translated into a PE specification from which we automatically generate the PE hardware description in Verilog along with a compiler that maps applications to the PE. The PE hardware and compiler are inserted into a flexible CGRA generation and compilation toolchain that allows for agile evaluation of CGRAs. We evaluate APEX for two domains, machine learning and image processing. For image processing applications, our automatically generated CGRAs with specialized PEs achieve from 5% to 30% less area and from 22% to 46% less energy compared to a general-purpose CGRA. For machine learning applications, our automatically generated CGRAs consume 16% to 59% less energy and 22% to 39% less area than a general-purpose CGRA. This work paves the way for creation of application domain-driven design-space exploration frameworks that automatically generate efficient programmable accelerators, with a much lower design effort for both hardware and compiler generation.

Publisher's Version
Beyond Static Parallel Loops: Supporting Dynamic Task Parallelism on Manycore Architectures with Software-Managed Scratchpad Memories
Lin Cheng ORCID logo, Max Ruttenberg ORCID logo, Dai Cheol Jung ORCID logo, Dustin Richmond ORCID logo, Michael Taylor ORCID logo, Mark Oskin ORCID logo, and Christopher Batten ORCID logo
(Cornell University, USA; University of Washington, USA; University of California at Santa Cruz, Santa Cruz, USA)
Manycore architectures integrate hundreds of cores on a single chip by using simple cores and simple memory systems usually based on software-managed scratchpad memories (SPMs). However, such architectures are notoriously challenging to program, since the programmers need to manually manage all aspects of data movement and synchronization for both correctness and performance. We argue that this manycore programmability challenge is one of the key barriers to achieving the promise of manycore architectures. At the same time, the dynamic task parallel programming model is enjoying considerable success in addressing the programmability challenge of multi-core processors with tens of complex cores and hardware cache coherence.
Conventional wisdom suggests a work-stealing runtime, which forms the core of most dynamic task parallel programming models, is ill-suited for manycore architectures. In this work, we demonstrate that a work-stealing runtime is not just feasible on manycore architectures with SPMs, but such a runtime can also significantly improve the performance of irregular workloads when executing on these architectures. We also explore three optimizations that allow the runtime to leverage unused SPM space for further performance benefit. Our dynamic task parallel programming framework achieves 1.2–28.5× speedup on workloads that benefit from our techniques, and only induces minimal overhead for workloads that do not.

Publisher's Version
CaQR: A Compiler-Assisted Approach for Qubit Reuse through Dynamic Circuit
Fei Hua ORCID logo, Yuwei Jin ORCID logo, Yanhao Chen ORCID logo, Suhas Vittal ORCID logo, Kevin Krsulich ORCID logo, Lev S. Bishop ORCID logo, John Lapeyre ORCID logo, Ali Javadi-Abhari ORCID logo, and Eddy Z. Zhang ORCID logo
(Rutgers University, USA; Georgia Institute of Technology, USA; IBM, USA)
Quantum measurement is important to quantum computing as it extracts out the outcome of the circuit at the end of the computation. Previously, all measurements have to be done at the end of the circuit. Otherwise, it will incur significant errors. But it is not the case now. Recently IBM starts supporting dynamic circuit through hardware (instead of software by simulator). With mid-circuit hardware measurement, we can improve circuit efficacy and fidelity from three aspects: (a) reduced qubit usage, (b) reduced swap insertion, and (c) improved fidelity. We demonstrate this using real-world applications Bernstein Verizani on real hardware and show that circuit resource usage can be improved by 60%, and circuit fidelity can be improved by 15%. We design a compiler-assisted tool that can find and exploit the tradeoff between qubit reuse, fidelity, gate count, and circuit duration. We also developed a method for identifying whether qubit reuse will be beneficial for a given application. We evaluated our method on a representative set of important applications. We can reduce resource usage by up to 80% and improve circuit fidelity by up to 20%.

Publisher's Version
CaT: A Solver-Aided Compiler for Packet-Processing Pipelines
Xiangyu Gao ORCID logo, Divya Raghunathan ORCID logo, Ruijie Fang ORCID logo, Tao Wang ORCID logo, Xiaotong Zhu ORCID logo, Anirudh Sivaraman ORCID logo, Srinivas Narayana ORCID logo, and Aarti Gupta ORCID logo
(New York University, USA; Princeton University, USA; Rutgers University, USA)
Compiling high-level programs to high-speed packet-processing pipelines is a challenging combinatorial optimization problem. The compiler must configure the pipeline’s resources to match the semantics of the program’s high-level specification, while packing all of the program’s computation into the pipeline’s limited resources. State of the art approaches tackle individual aspects of this problem. Yet, they miss opportunities to produce globally high-quality outcomes within reasonable compilation times. We develop a framework to decompose the compilation problem for such pipelines into three phases—making extensive use of solver engines (e.g., ILP, SMT, and program synthesis) to simplify the development of these phases. Transformation rewrites programs to use more abundant pipeline resources, avoiding scarce ones. Synthesis breaks complex transactional code into configurations of pipelined compute units. Allocation maps the program’s compute and memory to the pipeline’s hardware resources. We prototype these ideas in a compiler, CaT, which targets (1) the Tofino programmable switch pipeline and (2) Menshen, a cycle-accurate simulator of a Verilog description of the RMT pipeline. CaT can handle programs that existing compilers cannot currently run on pipelines and generates code faster than existing compilers, where the generated code uses fewer pipeline resources.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
Characterizing and Optimizing End-to-End Systems for Private Inference
Karthik Garimella ORCID logo, Zahra Ghodsi ORCID logo, Nandan Kumar Jha ORCID logo, Siddharth Garg ORCID logo, and Brandon Reagen ORCID logo
(New York University, USA; Purdue University, USA)
In two-party machine learning prediction services, the client’s goal is to query a remote server’s trained machine learning model to perform neural network inference in some application domain. However, sensitive information can be obtained during this process by either the client or the server, leading to potential collection, unauthorized secondary use, and inappropriate access to personal information. These security concerns have given rise to Private Inference (PI), in which both the client’s personal data and the server’s trained model are kept confidential. State-of-the-art PI protocols consist of a pre-processing or offline phase and an online phase that combine several cryptographic primitives: Homomorphic Encryption (HE), Secret Sharing (SS), Garbled Circuits (GC), and Oblivious Transfer (OT). Despite the need and recent performance improvements, PI remains largely arcane today and is too slow for practical use.
This paper addresses PI’s shortcomings with a detailed characterization of a standard high-performance protocol to build foundational knowledge and intuition in the systems community. Our characterization pinpoints all sources of inefficiency – compute, communication, and storage. In contrast to prior work, we consider inference request arrival rates rather than studying individual inferences in isolation and we find that the pre-processing phase cannot be ignored and is often incurred online as there is insufficient downtime to hide pre-compute latency. Finally, we leverage insights from our characterization and propose three optimizations to address the storage (Client-Garbler), computation (layer-parallel HE), and communication (wireless slot allocation) overheads. Compared to the state-of-the-art PI protocol, these optimizations provide a total PI speedup of 1.8 × with the ability to sustain inference requests up to a 2.24 × greater rate. Looking ahead, we conclude our paper with an analysis of future research innovations and their effects and improvements on PI latency.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
Cohort: Software-Oriented Acceleration for Heterogeneous SoCs
Tianrui Wei ORCID logo, Nazerke Turtayeva ORCID logo, Marcelo Orenes-Vera ORCID logo, Omkar Lonkar ORCID logo, and Jonathan BalkindORCID logo
(University of California at Berkeley, Berkeley, USA; University of California at Santa Barbara, Santa Barbara, USA; Princeton University, Princeton, USA)
Philosophically, our approaches to acceleration focus on the extreme. We must optimise accelerators to the maximum, leaving software to fix any hardware-software mismatches. Today’s software abstractions for programming accelerators leak hardware details, requiring changes to data formats and manual memory and coherence management, among other issues. This harms generality and requires deep hardware knowledge to efficiently program accelerators, a state which we consider hardware-oriented.
This paper proposes Software-Oriented Acceleration (SOA), where software uses existing abstractions, like software shared-memory queues, to interact with accelerators. We introduce the Cohort engine which exploits these queues’ standard semantics to efficiently connect producers and consumers in software with accelerators with minimal application changes. Accelerators are even usable in chains which can be runtime reconfigured by software. Cohort significantly reduces the burden to add new accelerators while maintaining system-level guarantees. We implement a Cohort FPGA prototype which supports SOA applications running on multicore Linux. Our evaluation shows speedups for Cohort over traditional approaches ranging from 1.83× to 8.38× over MMIO, and from 1.69× to 11.24× for DMA baselines. Our software-oriented batching optimisations within Cohort also improve performance from 2.32× to 8.10×, demonstrating the power of SOA.

Publisher's Version
Coyote: A Compiler for Vectorizing Encrypted Arithmetic Circuits
Raghav Malik ORCID logo, Kabir Sheth ORCID logo, and Milind KulkarniORCID logo
(Purdue University, USA)
Fully Homomorphic Encryption (FHE) is a scheme that allows a computational circuit to operate on encrypted data and produce a result that, when decrypted, yields the result of the unencrypted computation. While FHE enables privacy-preserving computation, it is extremely slow. However, the mathematical formulation of FHE supports a SIMD-like execution style, so recent work has turned to vectorization to recover some of the missing performance. Unfortunately, these approaches do not work well for arbitrary computations: they do not account for the high cost of rotating vector operands to allow data to be used in multiple operations. Hence, the cost of rotation can outweigh the benefits of vectorization.
This paper presents Coyote, a new approach to vectorizing encrypted circuits that specifically aims to optimize the use of rotations. It tackles the scheduling and data layout problems simultaneously, operating at the level of subcircuits that can be vectorized without incurring excessive data movement overhead. By jointly searching for good vectorization and lane placement, Coyote finds schedules that avoid sacrificing one for the other. This paper shows that Coyote is effective at vectorizing computational kernels while minimizing rotations, thus finding efficient vector schedules and smart rotation schemes to achieve substantial speedups.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
DefT: Boosting Scalability of Deformable Convolution Operations on GPUs
Edward Hanson ORCID logo, Mark Horton ORCID logo, Hai (Helen) Li ORCID logo, and Yiran Chen ORCID logo
(Duke University, Durham, USA)
Deformable Convolutional Networks (DCN) have been proposed as a powerful tool to boost the representation power of Convolutional Neural Networks (CNN) in computer vision tasks via adaptive sampling of the input feature map. Much like vision transformers, DCNs utilize a more flexible inductive bias than standard CNNs and have also been shown to improve performance of particular models. For example, drop-in DCN layers were shown to increase the AP score of Mask RCNN by 10.6 points while introducing only 1% additional parameters and FLOPs, improving the state-of-the-art model at the time of publication. However, despite evidence that more DCN layers placed earlier in the network can further improve performance, we have not seen this trend continue with further scaling of deformations in CNNs, unlike for vision transformers. Benchmarking experiments show that a realistically sized DCN layer (64H×64W, 64 in-out channel) incurs a 4× slowdown on a GPU platform, discouraging the more ubiquitous use of deformations in CNNs. These slowdowns are caused by the irregular input-dependent access patterns of the bilinear interpolation operator, which has a disproportionately low arithmetic intensity (AI) compared to the rest of the DCN. To address the disproportionate slowdown of DCNs and enable their expanded use in CNNs, we propose DefT, a series of workload-aware optimizations for DCN kernels. DefT identifies performance bottlenecks in DCNs and fuses specific operators that are observed to limit DCN AI. Our approach also uses statistical information of DCN workloads to adapt the workload tiling to the DCN layer dimensions, minimizing costly out-of-boundary input accesses. Experimental results show that DefT mitigates up to half of DCN slowdown over the current-art PyTorch implementation. This translates to a layerwise speedup of up to 134% and a reduction of normalized training time of 46% on a fully DCN-enabled ResNet model.

Publisher's Version
Disaggregated RAID Storage in Modern Datacenters
Junyi Shu ORCID logo, Ruidong Zhu ORCID logo, Yun Ma ORCID logo, Gang Huang ORCID logo, Hong Mei ORCID logo, Xuanzhe Liu ORCID logo, and Xin Jin ORCID logo
(Peking University, China)
RAID (Redundant Array of Independent Disks) has been widely adopted for decades, as it provides enhanced throughput and redundancy beyond what a single disk can offer. Today, enabled by fast datacenter networks, accessing remote block devices with acceptable overhead (i.e. disaggregated storage) becomes a reality (e.g., for serverless applications). Combining RAID with remote storage can provide the same benefits while creating better fault tolerance and flexibility than its monolithic counterparts. The key challenge of disaggregated RAID is to handle extra network traffic generated by RAID, which can consume a vast amount of NIC bandwidth. We present dRAID, a disaggregated RAID system that achieves near-optimal read and write throughput. dRAID exploits peer-to-peer disaggregated data access to reduce bandwidth consumption in both normal and degraded states. It employs non-blocking multi-stage writes to maximize inter-node parallelism, and applies pipelined I/O processing to maximize inter-device parallelism. We introduce bandwidth-aware reconstruction for better load balancing. We show that dRAID provides up to 3× bandwidth improvement. The results on a lightweight object store show that dRAID brings 1.5×-2.35× throughput improvement on various workloads.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
DrGPUM: Guiding Memory Optimization for GPU-Accelerated Applications
Mao Lin ORCID logo, Keren Zhou ORCID logo, and Pengfei Su ORCID logo
(University of California at Merced, Merced, USA; OpenAI, USA)
GPUs are widely used in today’s computing platforms to accelerate applications in various domains. However, scarce GPU memory resources are often the dominant limiting factor in strengthening the applicability of GPU computing. In this paper, we propose DrGPUM, the first profiler that systematically investigates patterns of memory inefficiencies in GPU-accelerated applications. The strength of DrGPUM, when compared to a large class of existing GPU profilers, is its ability to (1) correlate problematic memory usage with data objects and GPU APIs, (2) identify and categorize object-level and intra-object memory inefficiencies, and (3) provide rich insights to guide memory optimization.
DrGPUM works on fully-optimized and unmodified GPU binaries, requires no modification to hardware or OS, and features a user-friendly GUI, which makes it attractive to use in production. Our evaluation with well-known benchmarks and applications shows DrGPUM’s effectiveness in identifying memory inefficiencies with moderate overhead. Eliminating these inefficiencies requires less than nine source lines of code modifications and yields significant reductions in peak memory usage (up to 83%) and/or significant performance improvements (up to 2.48×). Our optimization patches have been confirmed by application developers and upstreamed to their repositories.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
Efficient Compactions between Storage Tiers with PrismDB
Ashwini Raina ORCID logo, Jianan Lu ORCID logo, Asaf Cidon ORCID logo, and Michael J. Freedman ORCID logo
(Princeton University, USA; Columbia University, USA)
In recent years, emerging storage hardware technologies have focused on divergent goals: better performance or lower cost-per-bit. Correspondingly, data systems that employ these technologies are typically optimized either to be fast (but expensive) or cheap (but slow). We take a different approach: by architecting a storage engine to natively utilize two tiers of fast and low-cost storage technologies, we can achieve a Pareto efficient balance between performance and cost-per-bit.
This paper presents the design and implementation of PrismDB, a novel key-value store that exploits two extreme ends of the spectrum of modern NVMe storage technologies (3D XPoint and QLC NAND) simultaneously. Our key contribution is how to efficiently migrate and compact data between two different storage tiers. Inspired by the classic cost-benefit analysis of log cleaning, we develop a new algorithm for multi-tiered storage compaction that balances the benefit of reclaiming space for hot objects in fast storage with the cost of compaction I/O in slow storage. Compared to the standard use of RocksDB on flash in datacenters today, PrismDB’s average throughput on tiered storage is 3.3× faster, its read tail latency is 2× better, and it is 5× more durable using equivalently-priced hardware.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
Efficient Scheduler Live Update for Linux Kernel with Modularization
Teng Ma ORCID logo, Shanpei Chen ORCID logo, Yihao Wu ORCID logo, Erwei Deng ORCID logo, Zhuo Song ORCID logo, Quan Chen ORCID logo, and Minyi Guo ORCID logo
(Alibaba Group, China; Shanghai Jiao Tong University, China)
The scheduler is a critical component of the operating system (OS)and is tightly coupled with Linux. Production-level clouds often host various workloads, and these workloads require different schedulers to achieve high performance. Thus the capability of updating the scheduler lively without rebooting the OS is crucial for the production environments. However, emerging live update techniques only apply for the fine-grained function-level updates or require extra constraints such as microkernel. It fails to update the entire heavy process scheduler subsystem lively. We therefore propose Plugsched to enable scheduler live update, and there are two key novelties. First of all, with the idea of modularization, Plugsched decouples the scheduler from the Linux kernel to be an independent module; Secondly, Plugsched uses the data rebuild technique to migrate the state from the old scheduler to the new one. This scheme can be directly applied to the Linux kernel scheduler in production environments without modifying kernel code. Unlike current function-level live update solutions, Plugsched allows developers to update the entire scheduler subsystem and modify internal scheduler data via the rebuilding technique. Moreover, an optimized stack inspection method is introduced to further effectively reduce the downtime due to the update. Experimental and production results show that Plugsched can effectively update kernel scheduler lively and the downtime is less than tens of milliseconds.

Publisher's Version Published Artifact Artifacts Available
eHDL: Turning eBPF/XDP Programs into Hardware Designs for the NIC
Alessandro Rivitti ORCID logo, Roberto Bifulco ORCID logo, Angelo Tulumello ORCID logo, Marco Bonola ORCID logo, and Salvatore Pontarelli ORCID logo
(Axbryd, Italy; University of Rome Tor Vergata, Italy; NEC Laboratories Europe, Germany; CNIT, Italy; Sapienza University of Rome, Italy)
Scaling network packet processing performance to meet the increasing speed of network ports requires software programs to carefully leverage the network devices’ hardware features. This is a complex task for network programmers, who need to learn and deal with the heterogeneity of device architectures, and re-think their software to leverage them. In this paper we make first steps to reverse this design process, enabling the automatic generation of tailored hardware designs starting from a network packet processing program. We introduce eHDL, a high-level synthesis tool that automatically generates hardware pipelines from unmodified Linux’s eBPF/XDP programs. eHDL is designed to enable software developers to directly define and implement the hardware functions they need in the NIC. We prototype eHDL targeting a Xilinx Alveo U50 FPGA NIC, and evaluate it with a set of 5 eBPF/XDP programs. Our results show that the generated pipelines are efficient in terms of required hardware resources, using only 6.5%-13.3% of the FPGA, and always achieve the line rate forwarding throughput with about 1 microsecond of per-packet forwarding latency. Compared to other network-specific high-level synthesis tool, eHDL enables software programmers with no hardware expertise to describe stateful functions that operate on the entire packet data. Compared to alternative processor-based solutions that perform eBFP/XDP offloading to a NIC, eHDL provides 10-100x higher throughput.

Publisher's Version
Exit-Less, Isolated, and Shared Access for Virtual Machines
Kenichi Yasukata ORCID logo, Hajime Tazaki ORCID logo, and Pierre-Louis Aublin ORCID logo
(IIJ Research Laboratory, Japan)
This paper explores Exit-Less, Isolated, and Shared Access (ELISA), a novel in-memory object sharing scheme for Virtual Machines (VMs). ELISA has the isolation advantage over the shared memory directly exposed to guest VMs while its overhead is smaller than that of host-interposition relying on the costly exit from the VM context. In a nutshell, ELISA isolates shared in-memory objects by Extended Page Table (EPT) separation, and a guest VM accesses them by switching the EPT context using VMFUNC, a low-overhead CPU instruction of Intel CPUs. Our experiment shows that the overhead of ELISA is 3.5 times smaller than that of VMCALL-oriented host-interposition. We demonstrate the benefits of ELISA through two use cases; by replacing VMCALL with ELISA, a VM networking system and an in-memory key-value store exhibit 163% and 64% higher performance respectively.

Publisher's Version
Finding Unstable Code via Compiler-Driven Differential Testing
Shaohua Li ORCID logo and Zhendong Su ORCID logo
(ETH Zurich, Switzerland)
Unstable code refers to code that has inconsistent or unstable run-time semantics due to undefined behavior (UB) in the program. Compilers exploit UB by assuming that UB never occurs, which allows them to generate efficient but potentially semantically inconsistent binaries. Practitioners have put great research and engineering effort into designing dynamic tools such as sanitizers for frequently occurring UBs. However, it remains a big challenge how to detect UBs that are beyond the reach of current techniques.
In this paper, we introduce compiler-driven differential testing (CompDiff), a simple yet effective approach for finding unstable code in C/C++ programs. CompDiff relies on the fact that when compiling unstable code, different compiler implementations may produce semantically inconsistent binaries. Our main approach is to examine the outputs of different binaries on the same input. Discrepancies in outputs may signify the existence of unstable code. To detect unstable code in real-world programs, we also integrate CompDiff into AFL++, the most widely-used and actively-maintained general-purpose fuzzer.
Despite its simplicity, CompDiff is effective in practice: on the Juliet benchmark programs, CompDiff uniquely detected 1,409 bugs compared to sanitizers; on 23 popular open-source C/C++ projects, CompDiff-AFL++ uncovered 78 new bugs, 52 of which have been fixed by developers and 36 cannot be detected by sanitizers. Our evaluation also reveals the fact that CompDiff is not designed to replace current UB detectors but to complement them.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
Flexagon: A Multi-dataflow Sparse-Sparse Matrix Multiplication Accelerator for Efficient DNN Processing
Francisco Muñoz-Martínez ORCID logo, Raveesh Garg ORCID logo, Michael Pellauer ORCID logo, José L. AbellánORCID logo, Manuel E. Acacio ORCID logo, and Tushar Krishna ORCID logo
(Universidad de Murcia, Spain; Georgia Institute of Technology, USA; NVIDIA, USA)
Sparsity is a growing trend in modern DNN models.
Existing Sparse-Sparse Matrix Multiplication (SpMSpM) accelerators are tailored to a particular SpMSpM dataflow (i.e., Inner Product, Outer Product or Gustavson's), which determines their overall efficiency. We demonstrate that this static decision inherently results in a suboptimal dynamic solution. This is because different SpMSpM kernels show varying features (i.e., dimensions, sparsity pattern, sparsity degree), which makes each dataflow better suited to different data sets.
In this work we present Flexagon, the first SpMSpM reconfigurable accelerator that is capable of performing SpMSpM computation by using the particular dataflow that best matches each case. Flexagon accelerator is based on a novel Merger-Reduction Network (MRN) that unifies the concept of reducing and merging in the same substrate, increasing efficiency. Additionally, Flexagon also includes a new L1 on-chip memory organization, specifically tailored to the different access characteristics of the input and output compressed matrices. Using detailed cycle-level simulation of contemporary DNN models from a variety of application domains, we show that Flexagon achieves average performance benefits of 4.59x, 1.71x, and 1.35x with respect to the state-of-the-art SIGMA-like, SpArch-like and GAMMA-like accelerators (265%, 67%, and 18%, respectively, in terms of average performance/area efficiency).

Publisher's Version
Going beyond the Limits of SFI: Flexible and Secure Hardware-Assisted In-Process Isolation with HFI
Shravan Narayan ORCID logo, Tal Garfinkel ORCID logo, Mohammadkazem Taram ORCID logo, Joey Rudek ORCID logo, Daniel Moghimi ORCID logo, Evan Johnson ORCID logo, Chris Fallin ORCID logo, Anjo Vahldiek-Oberwagner ORCID logo, Michael LeMay ORCID logo, Ravi Sahita ORCID logo, Dean Tullsen ORCID logo, and Deian Stefan ORCID logo
(University of California at San Diego, San Diego, USA; University of Texas at Austin, Austin, USA; Purdue University, USA; Fastly, USA; Intel Labs, Germany; Intel Labs, USA; Rivos, USA)
We introduce Hardware-assisted Fault Isolation (HFI), a simple extension to existing processors to support secure, flexible, and efficient in-process isolation. HFI addresses the limitations of existing software-based isolation (SFI) systems including: runtime overheads, limited scalability, vulnerability to Spectre attacks, and limited compatibility with existing code. HFI can seamlessly integrate with current SFI systems (e.g., WebAssembly), or directly sandbox unmodified native binaries. To ease adoption, HFI relies only on incremental changes to the data and control path of existing high-performance processors. We evaluate HFI for x86-64 using the gem5 simulator and compiler-based emulation on a mix of real and synthetic workloads.

Publisher's Version
GRACE: A Scalable Graph-Based Approach to Accelerating Recommendation Model Inference
Haojie Ye ORCID logo, Sanketh Vedula ORCID logo, Yuhan Chen ORCID logo, Yichen Yang ORCID logo, Alex Bronstein ORCID logo, Ronald Dreslinski ORCID logo, Trevor Mudge ORCID logo, and Nishil Talati ORCID logo
(University of Michigan, USA; Technion, Israel)
The high memory bandwidth demand of sparse embedding layers continues to be a critical challenge in scaling the performance of recommendation models. While prior works have exploited heterogeneous memory system designs and partial embedding sum memoization techniques, they offer limited benefits. This is because prior designs either target a very small subset of embeddings to simplify their analysis or incur a high processing cost to account for all embeddings, which does not scale with the large sizes of modern embedding tables. This paper proposes GRACE-a lightweight and scalable graph-based algorithm-system co-design framework to significantly improve the embedding layer performance of recommendation models. GRACE proposes a novel Item Co-occurrence Graph (ICG) that scalably records item co-occurrences. GRACE then presents a new system-aware ICG clustering algorithm to find frequently accessed item combinations of arbitrary lengths to compute and memoize their partial sums. High-frequency partial sums are stored in a software-managed cache space to reduce memory traffic and improve the throughput of computing sparse features. We further present a cache data layout and low-cost address computation logic to efficiently lookup item embeddings and their partial sums. Our evaluation shows that GRACE significantly outperforms the state-of-the-art techniques SPACE and MERCI by 1.5x and 1.4x, respectively.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
Graphene: An IR for Optimized Tensor Computations on GPUs
Bastian Hagedorn ORCID logo, Bin Fan ORCID logo, Hanfeng Chen ORCID logo, Cris Cecka ORCID logo, Michael Garland ORCID logo, and Vinod Grover ORCID logo
(NVIDIA, Germany; NVIDIA, USA)
Modern GPUs accelerate computations and data movements of multi-dimensional tensors in hardware. However, expressing optimized tensor computations in software is extremely challenging even for experts. Languages like CUDA C++ are centered around flat buffers in one-dimensional memory and lack reasonable abstractions for multi-dimensional data and threads. Existing tensor IRs are not expressive enough to represent the complex data-to-thread mappings required by the GPU tensor instructions.
In this paper, we introduce Graphene, an intermediate representation (IR) for optimized tensor computations on GPUs. Graphene is a low-level target language for tensor compilers and performance experts while being closer to the domain of tensor computations than languages offering the same level of control such as CUDA C++ and PTX. In Graphene, multi-dimensional data and threads are represented as first-class tensors. Graphene’s tensors are hierarchically decomposable into tiles allowing to represent optimized tensor computations as mappings between data and thread tiles.
We evaluate Graphene using some of the most important tensor computations in deep learning today, including GEMM, Multi-Layer Perceptron (MLP), Layernorm, LSTM, and Fused Multi-Head Attention (FMHA). We show that Graphene is capable of expressing all optimizations required to achieve the same practical peak performance as existing library implementations. Fused kernels beyond library routines expressed in Graphene significantly improve the end-to-end inference performance of Transformer networks and match or outperform the performance of cuBLAS(Lt), cuDNN, and custom handwritten kernels.

Publisher's Version
Heron: Automatically Constrained High-Performance Library Generation for Deep Learning Accelerators
Jun Bi ORCID logo, Qi Guo ORCID logo, Xiaqing Li ORCID logo, Yongwei Zhao ORCID logo, Yuanbo Wen ORCID logo, Yuxuan Guo ORCID logo, Enshuai Zhou ORCID logo, Xing Hu ORCID logo, Zidong Du ORCID logo, Ling Li ORCID logo, Huaping Chen ORCID logo, and Tianshi Chen ORCID logo
(University of Science and Technology of China, China; Institute of Computing Technology at Chinese Academy of Sciences, China; Cambricon Technologies, China; Institute of Software at Chinese Academy of Sciences, China)
Deep Learning Accelerators (DLAs) are effective to improve both performance and energy efficiency of compute-intensive deep learning algorithms. A flexible and portable mean to exploit DLAs is using high-performance software libraries with well-established APIs, which are typically either manually implemented or automatically generated by exploration-based compilation approaches. Though exploration-based approaches significantly reduce programming efforts, they fail to find optimal or near-optimal programs from a large but low-quality search space because the massive inherent constraints of DLAs cannot be accurately characterized.
In this paper, we propose Heron, a novel exploration-based approach, to efficiently generate high-performance libraries of DLAs. The key is to automatically (rather than manually) enforce massive sophisticated while accurate constraints through the entire program generation including constrained space generation and constrained space exploration. By conducting static analysis on compute, sophisticated constraints are automatically generated to properly characterize inherent constraints of DLAs, and thus greatly prune invalid program candidates to produce a high-quality constrained search space. To efficiently explore the resultant search space, we further propose a novel constraint-based genetic algorithm, which features that the evolutionary process is conducted on formulated constraint satisfactory problems instead of concrete solutions. Thus, the sophisticated constraints of the search space are strictly preserved during the entire exploration process. We conduct extensive experiments on 3 representative DLAs, i.e., NVIDIA TensorCore, Intel DL Boost Acceleration, and TVM Versatile Tensor Accelerator. Experimental results demonstrate that Heron averagely achieves 2.71x speedup over four state-of-the-art automatic generation approaches. Also, compared to vendor-provided hand-tuned libraries, Heron achieves 2.00x speedup on average.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
Homunculus: Auto-Generating Efficient Data-Plane ML Pipelines for Datacenter Networks
Tushar Swamy ORCID logo, Annus Zulfiqar ORCID logo, Luigi Nardi ORCID logo, Muhammad Shahbaz ORCID logo, and Kunle Olukotun ORCID logo
(Stanford University, USA; Purdue University, USA; Lund University, Sweden)
Support for Machine Learning (ML) applications in networking has significantly improved over the last decade. The availability of public datasets and programmable switching fabrics (including low-level languages to program them) presents a full-stack to the programmer for deploying in-network ML. However, the diversity of tools involved, coupled with complex optimization tasks of ML model design and hyperparameter tuning while complying with the network constraints (like throughput and latency), puts the onus on the network operator to be an expert in ML, network design, and programmable hardware.
We present Homunculus, a high-level framework that enables network operators to specify their ML requirements in a declarative rather than imperative way. Homunculus takes as input the training data and accompanying network and hardware constraints, and automatically generates and installs a suitable model onto the underlying switching target. It performs model design-space exploration, training, and platform code-generation as compiler stages, leaving network operators to focus on acquiring high-quality network data. Our evaluations on real-world ML applications show that Homunculus’s generated models achieve up to 12% better F1 scores compared to hand-tuned alternatives, while operating within the resource limits of the underlying targets. We further demonstrate the high performance and increased reactivity (seconds to nanoseconds) of the generated models on emerging per-packet ML platforms to showcase Homunculus’s timely and practical significance.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
Hyperscale Hardware Optimized Neural Architecture Search
Sheng Li ORCID logo, Garrett Andersen ORCID logo, Tao Chen ORCID logo, Liqun Cheng ORCID logo, Julian Grady ORCID logo, Da Huang ORCID logo, Quoc V. Le ORCID logo, Andrew Li ORCID logo, Xin Li ORCID logo, Yang Li ORCID logo, Chen Liang ORCID logo, Yifeng Lu ORCID logo, Yun Ni ORCID logo, Ruoming Pang ORCID logo, Mingxing Tan ORCID logo, Martin Wicke ORCID logo, Gang Wu ORCID logo, Shengqi Zhu ORCID logo, Parthasarathy Ranganathan ORCID logo, and Norman P. Jouppi ORCID logo
(Google, USA; Apple, USA; Waymo, USA)
Recent advances in machine learning have leveraged dramatic increases in computational power, a trend expected to continue in the future. This paper introduces the first Hyperscale Hardware Optimized Neural Architecture Search (H2O-NAS) to automatically design accurate and performant machine learning models tailored to the underlying hardware architecture. H2O-NAS consists of three key components: a new massively parallel “one-shot” search algorithm with intelligent weight sharing, which can scale to search spaces of O(10280) and handle large volumes of production traffic; hardware-optimized search spaces for diverse ML models on heterogeneous hardware; and a novel two-phase hybrid performance model and a multi-objective reward function optimized for large scale deployments.
H2O-NAS has been implemented around state-of-the-art machine learning models (e.g. convolutional models, vision transformers, and deep learning recommendation models) and deployed at zettaflop scale in production. Our results demonstrate significant improvements in performance (22% ∼ 56%) and energy efficiency (17% ∼25%) at same or better quality. Our solution is designed for largescale deployment, streamlining privacy and security processes and reducing manual overhead. This facilitates a smooth and automated transition from research to production.

Publisher's Version
Infinity Stream: Portable and Programmer-Friendly In-/Near-Memory Fusion
Zhengrong Wang ORCID logo, Christopher Liu ORCID logo, Aman Arora ORCID logo, Lizy John ORCID logo, and Tony NowatzkiORCID logo
(University of California at Los Angeles, Los Angeles, USA; University of Texas at Austin, Austin, USA)
In-memory computing with large last-level caches is promising to dramatically alleviate data movement bottlenecks and expose massive bitline-level parallelization opportunities. However, key challenges from its unique execution model remain unsolved: automated parallelization, transparently orchestrating data transposition/alignment/broadcast for bit-serial logic, and mixing in-/near-memory computing. Most importantly, the solution should be programmer friendly and portable across platforms.
Our key innovation is an execution model and intermediate representation (IR) that enables hybrid CPU-core, in-memory, and near-memory processing. Our IR is the tensor dataflow graph (tDFG), which is a unified representation of in-memory and near-memory computation. The tDFG exposes tensor-data structure information so that the hardware and runtime can automatically orchestrate data management for bitserial execution, including runtime data layout transformations. To enable microarchitecture portability, we use a two-phase, JIT-based compilation approach to dynamically lower the the tDFG to in-memory commands.
Our design, infinity stream, is evaluated on a cycle-accurate simulator. Across data-processing workloads with fp32, it achieves 2.6× speedup and 75% traffic reduction over a state-of-the-art near-memory computing technique, with 2.4× energy efficiency.

Publisher's Version
In-Network Aggregation with Transport Transparency for Distributed Training
Shuo Liu ORCID logo, Qiaoling Wang ORCID logo, Junyi Zhang ORCID logo, Wenfei Wu ORCID logo, Qinliang Lin ORCID logo, Yao Liu ORCID logo, Meng Xu ORCID logo, Marco Canini ORCID logo, Ray C. C. Cheung ORCID logo, and Jianfei He ORCID logo
(Huawei Technologies, China; Peking University, China; Sun Yat-sen University, China; King Abdullah University of Science and Technology, Saudi Arabia; City University of Hong Kong, China)
Recent In-Network Aggregation (INA) solutions offload the all-reduce operation onto network switches to accelerate and scale distributed training (DT). On end hosts, these solutions build custom network stacks to replace the transport layer. The INA-oriented network stack cannot take advantage of the state-of-the-art performant transport layer implementation, and also causes complexity in system development and operation.
We design a transport-transparent INA primitive named NetReduce for modern multi-rack data centers. NetReduce runs beneath the transport layer. The switch performs aggregation operations but preserves data transmission connections. The host uses RoCE as its transport layer to deliver gradient messages and receive aggregation results. NetReduce achieves performance gains from both INA and RoCE: linear scalability, traffic reduction, and bandwidth freeing-up from INA — high throughput, low latency, and low CPU overhead from RoCE. For jobs spanning several multi-GPU machines, we also devise parallel all-reduce based on NetReduce to make use of intra-machine and inter-machine bandwidth efficiently. We prototype NetReduce on an FPGA board attached to an Ethernet switch. We compare NetReduce with existing programmable switch-based solutions and justify the FPGA-based design choice. We evaluate NetReduce’s performance by training typical Deep Neural Network models on single-GPU and multi-GPU testbeds. NetReduce inter-operates with the existing Ethernet transport layer, is training-framework friendly, accelerates network-intensive DT jobs effectively (e.g., 70% for AlexNet), reduces CPU overheads (e.g., only one core for transmission), and is cost-effective (e.g., only 2.40% more capital expense and 0.68% more power consumption making 12.3-57.9% more performance acceleration).

Publisher's Version
Kodan: Addressing the Computational Bottleneck in Space
Bradley Denby ORCID logo, Krishna Chintalapudi ORCID logo, Ranveer Chandra ORCID logo, Brandon Lucia ORCID logo, and Shadi Noghabi ORCID logo
(Carnegie Mellon University, USA; Microsoft Research, USA)
Decreasing costs of deploying space vehicles to low-Earth orbit have fostered an emergence of large constellations of satellites. However, high satellite velocities, large image data quantities, and brief ground station contacts create a data downlink challenge. Orbital edge computing (OEC), which filters data at the space edge, addresses this downlink bottleneck but shifts the challenge to the inelastic computational capabilities onboard satellites. In this work, we present Kodan: an OEC system that maximizes the utility of saturated satellite downlinks while mitigating the computational bottleneck. Kodan consists of two phases. A one-time transformation step uses a reference implementation of a satellite data analysis application, along with a representative dataset, to produce specialized ML models targeted for deployment to the space edge. After deployment to a target satellite, a runtime system dynamically selects the best specialized models for each data sample to maximize valuable data downlinked within the constraints of the computational bottleneck. By intelligently filtering low-value data and prioritizing high-value data for transmit via the saturated downlink, Kodan increases the data value density between 89 and 97 percent.

Publisher's Version
LEGO: Empowering Chip-Level Functionality Plug-and-Play for Next-Generation IoT Devices
Chong Zhang ORCID logo, Songfan Li ORCID logo, Yihang Song ORCID logo, Qianhe Meng ORCID logo, Minghua Chen ORCID logo, YanXu Bai ORCID logo, Li Lu ORCID logo, and Hongzi Zhu ORCID logo
(University of Electronic Science and Technology of China, China; Shanghai Jiao Tong University, China)
Versatile Internet of Things (IoT) applications call for re-configurable IoT devices that can easily extend new functionality on demand. However, the heterogeneity of functional chips brings difficulties in device customization, leading to inadequate flexibility. In this paper, we propose LEGO, a novel architecture for chip-level re-configurable IoT devices that supports plug-and-play with Commercial Off-The-Shelf (COTS) chips. To combat the heterogeneity of functional chips, we first design a novel Unified Chip Description Language (UCDL) with meta-operation and chip specifications to access various types of functional chips uniformly. Then, to achieve chips plug-and-play, we build up a novel platform and shift all chip control logic to the gateway, which makes IoT devices entirely decoupled from specific applications and does not need to make any changes when plugging in new functional chips. Finally, to handle communications overheads, we built up a novel orchestration architecture for gateway instructions, which minimizes instruction transmission frequency in remote chip control. We implement the prototype and conduct extensive evaluations with 100+ types of COTS functional chips. The results show that new functional chips can be automatically accessed by the system within 0.13 seconds after being plugged in, and only bringing 0.53 kb of communication load on average, demonstrating the efficacy of LEGO design.

Publisher's Version
Mapping Very Large Scale Spiking Neuron Network to Neuromorphic Hardware
Ouwen Jin ORCID logo, Qinghui Xing ORCID logo, Ying LiORCID logo, Shuiguang Deng ORCID logo, Shuibing He ORCID logo, and Gang Pan ORCID logo
(Zhejiang University, China)
Neuromorphic hardware is a multi-core computer system specifically designed to run Spiking Neuron Network (SNN) applications. As the scale of neuromorphic hardware increases, it becomes very challenging to efficiently map a large SNN to hardware. In this paper, we proposed an efficient approach to map very large scale SNN applications to neuromorphic hardware, aiming to reduce energy consumption, spike latency, and on-chip network communication congestion. The approach consists of two steps. Firstly, it solves the initial placement using the Hilbert curve, a space-filling curve with unique properties that are particularly suitable for mapping SNNs. Secondly, the Force Directed (FD) algorithm is developed to optimize the initial placement. The FD algorithm formulates the connections of clusters as tension forces, thus converts the local optimization of placement as a force analysis problem. The proposed approach is evaluated with the scale of 4 billion neurons, which is more than 200 times larger than previous research. The results show that our approach achieves state-of-the-art performance, significantly exceeding existing approaches.

Publisher's Version
Mosaic Pages: Big TLB Reach with Small Pages
Krishnan Gosakan ORCID logo, Jaehyun Han ORCID logo, William Kuszmaul ORCID logo, Ibrahim N. Mubarek ORCID logo, Nirjhar Mukherjee ORCID logo, Karthik Sriram ORCID logo, Guido Tagliavini ORCID logo, Evan West ORCID logo, Michael A. Bender ORCID logo, Abhishek Bhattacharjee ORCID logo, Alex Conway ORCID logo, Martin Farach-Colton ORCID logo, Jayneel Gandhi ORCID logo, Rob Johnson ORCID logo, Sudarsun Kannan ORCID logo, and Donald E. Porter ORCID logo
(Rutgers University, USA; University of North Carolina at Chapel Hill, USA; Massachusetts Institute of Technology, USA; Carnegie Mellon University, USA; Yale University, USA; Stony Brook University, USA; VMware Research, USA; Meta, USA)
The TLB is increasingly a bottleneck for big data applications. In most designs, the number of TLB entries are highly constrained by latency requirements, and growing much more slowly than the working sets of applications. Many solutions to this problem, such as huge pages, perforated pages, or TLB coalescing, rely on physical contiguity for performance gains, yet the cost of defragmenting memory can easily nullify these gains. This paper introduces mosaic pages, which increase TLB reach by compressing multiple, discrete translations into one TLB entry. Mosaic leverages virtual contiguity for locality, but does not use physical contiguity. Mosaic relies on recent advances in hashing theory to constrain memory mappings, in order to realize this physical address compression without reducing memory utilization or increasing swapping. This paper presents a full-system prototype of Mosaic, in gem5 and modified Linux. In simulation and with comparable hardware to a traditional design, mosaic reduces TLB misses in several workloads by 6-81%. Our results show that Mosaic’s constraints on memory mappings do not harm performance, we never see conflicts before memory is 98% full in our experiments — at which point, a traditional design would also likely swap. Once memory is over-committed, Mosaic swaps fewer pages than Linux in most cases. Finally, we present timing and area analysis for a verilog implementation of the hashing function required on the critical path for the TLB, and show that on a commercial 28nm CMOS process; the circuit runs at a maximum frequency of 4 GHz, indicating that a mosaic TLB is unlikely to affect clock frequency.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
MP-Rec: Hardware-Software Co-design to Enable Multi-path Recommendation
Samuel Hsia ORCID logo, Udit Gupta ORCID logo, Bilge Acun ORCID logo, Newsha Ardalani ORCID logo, Pan Zhong ORCID logo, Gu-Yeon Wei ORCID logo, David Brooks ORCID logo, and Carole-Jean Wu ORCID logo
(Harvard University, USA; Meta AI, USA)
Deep learning recommendation systems serve personalized content under diverse tail-latency targets and input-query loads. In order to do so, state-of-the-art recommendation models rely on terabyte-scale embedding tables to learn user preferences over large bodies of contents. The reliance on a fixed embedding representation of embedding tables not only imposes significant memory capacity and bandwidth requirements but also limits the scope of compatible system solutions. This paper challenges the assumption of fixed embedding representations by showing how synergies between embedding representations and hardware platforms can lead to improvements in both algorithmic- and system performance. Based on our characterization of various embedding representations, we propose a hybrid embedding representation that achieves higher quality embeddings at the cost of increased memory and compute requirements. To address the system performance challenges of the hybrid representation, we propose MP-Rec — a co-design technique that exploits heterogeneity and dynamic selection of embedding representations and underlying hardware platforms.
On real system hardware, we demonstrate how matching custom accelerators, i.e., GPUs, TPUs, and IPUs, with compatible embedding representations can lead to 16.65× performance speedup. Additionally, in query-serving scenarios, MP-Rec achieves 2.49× and 3.76× higher correct prediction throughput and 0.19% and 0.22% better model quality on a CPU-GPU system for the Kaggle and Terabyte datasets, respectively.

Publisher's Version
NosWalker: A Decoupled Architecture for Out-of-Core Random Walk Processing
Shuke Wang ORCID logo, Mingxing Zhang ORCID logo, Ke Yang ORCID logo, Kang Chen ORCID logo, Shaonan Ma ORCID logo, Jinlei Jiang ORCID logo, and Yongwei Wu ORCID logo
(Tsinghua University, China; Beijing HaiZhi XingTu Technology, China)
Out-of-core random walk system has recently attracted a lot of attention as an economical way to run billions of walkers over large graphs. However, existing out-of-core random walk systems are all built upon general out-of-core graph processing frameworks, and hence do not take advantage of the unique properties of random walk applications. Different from traditional graph analysis algorithms, the sampling process of random walk can be decoupled from the processing of the walkers. It enables the system to reserve only pre-sample results in memory, which are typically much smaller than the entire edge set. Moreover, in random walk, it is not the number of walkers but the number of steps moved per second that dominates the overall performance. Thus, with independent walkers, there is no need to process all the walkers simultaneously.
In this paper, we present NosWalker, an out-of-core random walk system that replaces the graph oriented scheduling with a decoupled system architecture that provides walker oriented scheduling. NosWalker is able to adaptively generate walkers and flexibly adjust the distribution of reserved pre-sample results in memory. Instead of processing all the walkers at once, NosWalker only tries its best to keep a few walkers able to continuously move forward. Experimental results show that NosWalker can achieve up to two orders of magnitude speedup compared to state-of-the-art out-of-core random walk systems. In particular, NosWalker demonstrates superior performance when the memory capacity can only hold about 10%-50% of the graph data, which can be a common case when the user needs to run billions of walkers over large graphs.

Publisher's Version
Occamy: Elastically Sharing a SIMD Co-processor across Multiple CPU Cores
Zhongcheng Zhang ORCID logo, Yan Ou ORCID logo, Ying Liu ORCID logo, Chenxi WangORCID logo, Yongbin Zhou ORCID logo, Xiaoyu Wang ORCID logo, Yuyang Zhang ORCID logo, Yucheng Ouyang ORCID logo, Jiahao Shan ORCID logo, Ying Wang ORCID logo, Jingling Xue ORCID logo, Huimin Cui ORCID logo, and Xiaobing Feng ORCID logo
(Institute of Computing Technology at Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; HiSilicon Technologies, China; UNSW, Sydney, Australia)
SIMD extensions are widely adopted in multi-core processors to exploit data-level parallelism. However, when co-running workloads on different cores, compute-intensive workloads cannot take advantage of the underutilized SIMD lanes allocated to memoryintensive workloads, reducing the overall performance. This paper proposes Occamy, a SIMD co-processor that can be shared by multiple CPU cores, so that their co-running workloads can spatially share its SIMD lanes. The key idea is to enable elastic spatial sharing by dynamically partitioning all the SIMD lanes across different workloads based on their phase behaviors, so that each workload may execute in variable-length SIMD mode. We also introduce an Occamy compiler to support such variable-length vectorization by analyzing such phase behaviors and generating the vectorized code that works with varying vector lengths. We demonstrate that Occamy can improve SIMD utilization, and consequently, performance over three representative SIMD architectures, with negligible chip area cost.

Publisher's Version
Persistent Memory Disaggregation for Cloud-Native Relational Databases
Chaoyi Ruan ORCID logo, Yingqiang Zhang ORCID logo, Chao Bi ORCID logo, Xiaosong Ma ORCID logo, Hao Chen ORCID logo, Feifei Li ORCID logo, Xinjun Yang ORCID logo, Cheng Li ORCID logo, Ashraf Aboulnaga ORCID logo, and Yinlong Xu ORCID logo
(University of Science and Technology of China, China; Alibaba Group, China; Qatar Computing Research Institute, Qatar; Hamad Bin Khalifa University, Qatar)
The recent emergence of commodity persistent memory (PM) hardware has altered the landscape of the storage hierarchy. It brings multi-fold benefits to database systems, with its large capacity, low latency, byte addressability, and persistence. However, PM has not been incorporated into the popular disaggregated architecture of cloud-native databases.
In this paper, we present PilotDB, a cloud-native relational database designed to fully utilize disaggregated PM resources. PilotDB possesses a new disaggregated DB architecture that allows compute nodes to be computation-heavy yet data-light, as enabled by large buffer pools and fast data persistence offered by remote PMs. We then propose a suite of novel mechanisms to facilitate RDMA-friendly remote PM accesses and minimize operations involving CPUs on the computation-light PM nodes. In particular, PilotDB adopts a novel compute-node-driven log organization that reduces network/PM bandwidth consumption and a log-pull design that enables fast, optimistic remote PM reads aggressively bypassing the remote PM node CPUs. Evaluation with both standard SQL benchmarks and a real-world production workload demonstrates that PilotDB (1) achieves excellent performance as compared to the best-performing baseline using local, high-end resources, (2) significantly outperforms a state-of-the-art DRAM-disaggregation system and the PM-disaggregation solution adapted from it, (3) enables faster failure recovery and cache buffer warm-up, and (4) offers superior cost-effectiveness.

Publisher's Version
PipeSynth: Automated Synthesis of Microarchitectural Axioms for Memory Consistency
Chase Norman ORCID logo, Adwait Godbole ORCID logo, and Yatin A. Manerkar ORCID logo
(University of California at Berkeley, Berkeley, USA; University of Michigan, USA)
Formal verification can help ensure the correctness of today’s processors. However, such formal verification requires formal specifications of the processors being verified. Today, these specifications are mostly written by hand, which is tedious and error-prone. Furthermore, architects and hardware engineers generally do not have formal methods experience, making it even harder for them to write formal specifications. Existing methods for the automated synthesis of formal microarchitectural specifications utilise RTL implementations of processors for their synthesis, preventing their usage until RTL implementation of the processor has completed. This hampers the effectiveness of formal verification for processors, as catching design bugs pre-RTL can reduce verification overhead and overall development time.
In response, we present PipeSynth, an automated formal methodology and tool for the synthesis of µspec microarchitectural ordering axioms from small example programs (litmus tests) and microarchitectural execution traces. PipeSynth helps architects automatically generate formal specifications for their microarchitectures before RTL is even written, enabling greater use of formal verification on today’s microarchitectures. We evaluate PipeSynth’s capability to synthesise single axioms and multiple axioms at the same time across four microarchitectures. Our evaluated microarchitectures include an out-of-order processor and one with a non-traditional coherence protocol. In single-axiom synthesis, PipeSynth is capable of synthesising replacement axioms for 42 out of 46 axioms from our evaluated microarchitectures in under 2 hours per axiom. When doing multi-axiom synthesis, we are able to synthesise an entire microarchitectural specification for the in-order Multi-V-scale processor in under 1 hour, and can synthesise at least 4 nontrivial axioms at the same time for our other microarchitectures.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
Protect the System Call, Protect (Most of) the World with BASTION
Christopher Jelesnianski ORCID logo, Mohannad Ismail ORCID logo, Yeongjin Jang ORCID logo, Dan Williams ORCID logo, and Changwoo Min ORCID logo
(Virginia Tech, USA; Oregon State University, USA)
System calls are a critical building block in many serious security attacks, such as control-flow hijacking and privilege escalation attacks. Security-sensitive system calls (e.g., execve, mprotect), especially play a major role in completing attacks. Yet, few defense efforts focus to ensure their legitimate usage, allowing attackers to maliciously leverage system calls in attacks.
In this paper, we propose a novel System Call Integrity, which enforces the correct use of system calls throughout runtime. We propose three new contexts enforcing (1) which system call is called and how it is invoked (Call Type), (2) how a system call is reached (Control Flow), and (3) that arguments are not corrupted (Argument Integrity). Our defense mechanism thwarts attacks by breaking the critical building block in their attack chains.
We implement BASTION, as a compiler and runtime monitor system, to demonstrate the efficacy of the three system call contexts. Our security case study shows that BASTION can effectively stop all the attacks including real-world exploits and recent advanced attack strategies. Deploying BASTION on three popular system call-intensive programs, NGINX, SQLite, and vsFTPd, we show BASTION is secure and practical, demonstrating overhead of 0.60%, 2.01%, and 1.65%, respectively.

Publisher's Version
Re-architecting I/O Caches for Emerging Fast Storage Devices
Mohammadamin Ajdari ORCID logo, Pouria Peykani Sani ORCID logo, Amirhossein Moradi ORCID logo, Masoud Khanalizadeh Imani ORCID logo, Amir Hossein Bazkhanei ORCID logo, and Hossein Asadi ORCID logo
(HPDS Research, Iran; Sharif University of Technology, Iran)
I/O caching has widely been used in enterprise storage systems to enhance the system performance with minimal cost. Using Solid-State Drives (SSDs) as an I/O caching layer on the top of arrays of Hard Disk Drives (HDDs) has been well studied in numerous studies. With emergence of ultra fast storage devices, recent studies suggest to use them as an I/O cache layer on top of mainstream SSDs in I/O intensive applications. Our detailed analysis shows despite significant potential of ultra-fast storage devices, existing I/O cache architectures may act as a major performance bottleneck in enterprise storage systems, which prevents to take advantage of the device full performance potentials.
In this paper, using an enterprise-grade all-flash storage system, we first present a thorough analysis on the performance of I/O cache modules when ultra-fast memories are used as a caching layer on top of mainstream SSDs. Unlike traditional SSD-based caching on HDD arrays, we show the use of ultra-fast memory as an I/O cache device on SSD arrays exhibit completely unexpected performance behavior. As an example, we show two popular cache architectures exhibit similar throughput due to performance bottleneck on the traditional SSD/HDD devices, but with ultra-fast memory on SSD arrays, their true potential is released and show 5× performance difference. We then propose an experimental evaluation framework to systematically examine the behavior of I/O cache modules on emerging ultra-fast devices. Our framework enables system architects to examine performance-critical design choices including multi-threading, locking granularity, promotion logic, cache line size, and flushing policy. We further offer several optimizations using the proposed framework, integrate the proposed optimizations, and evaluate them on real use-cases. The experiments on an industry-grade storage system show our I/O cache architecture optimally configured by the proposed framework provides up to 11× higher throughput and up to 30× improved tail latency over a non-optimal architecture.

Publisher's Version
Reconfigurable Virtual Memory for FPGA-Driven I/O
Joshua Landgraf ORCID logo, Matthew Giordano ORCID logo, Esther Yoon ORCID logo, and Christopher J. Rossbach ORCID logo
(University of Texas at Austin, USA; Katana Graph, USA)
FPGAs are increasingly used to accelerate modern applications, and cloud providers offer FPGA platforms on-demand with a variety of FPGAs, I/O peripherals, and memory options. FPGA vendors expose I/O with low-level interfaces that limit application portability. Current approaches to abstracting these interfaces trade level of abstraction against performance.
We present FSRF, File System for Reconfigurable Fabrics, which abstracts FPGA I/O at a high level without sacrificing performance. Rather than exposing platform-specific I/O interfaces, FSRF enables files to be mapped directly into FPGA virtual memory from the host. On the FPGA, powerful OS-managed virtual memory hardware provides performant access to FPGA-local resources and helps coordinate access to remote data. FSRF leverages reconfigurability to specialize its virtual memory implementation to applications, including selecting between SRAM and DRAM TLBs, adapting FPGA DRAM striping, and tuning DMA I/O. On Amazon F1 FPGAs, FSRF outperforms techniques from FPGA OSes such as Coyote and AmorphOS with improvements of up to 64× and 2.3×, respectively (+75% and +27% geometric mean), and performance close to that of physical addressing (90% geometric mean).

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
RepCut: Superlinear Parallel RTL Simulation with Replication-Aided Partitioning
Haoyuan Wang ORCID logo and Scott Beamer ORCID logo
(University of California at Santa Cruz, Santa Cruz, USA)
Register transfer level (RTL) simulation is an invaluable tool for developing, debugging, verifying, and validating hardware designs. Despite the parallel nature of hardware, existing parallel RTL simulators yield speedups unattractive for practical application due to high communication and synchronization costs incurred by typical circuit topologies.
We present RepCut, a novel parallel RTL simulation methodology. RepCut is enabled by our replication-aided partitioning approach that cuts the circuit into balanced partitions with small overlaps. By replicating the overlaps, RepCut eliminates problematic data dependences between partitions and significantly reduces expensive synchronization overhead between parallel threads. RepCut outperforms state-of-the-art simulators, and when simulating a large system-on-chip with multiple out-of-order cores, it achieves a 27.10× speedup (superlinear) using 24 threads with only a 3.81% replication cost.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
Rosebud: Making FPGA-Accelerated Middlebox Development More Pleasant
Moein Khazraee ORCID logo, Alex Forencich ORCID logo, George C. Papen ORCID logo, Alex C. Snoeren ORCID logo, and Aaron Schulman ORCID logo
(Massachusetts Institute of Technology, USA; University of California at San Diego, San Diego, USA)
We introduce an approach to designing FPGA-accelerated middleboxes that simplifies development, debugging, and performance tuning by decoupling the tasks of hardware-accelerator implementation and software-application programming. Rosebud is a framework that links hardware accelerators to a high-performance packet processing pipeline through a standardized hardware/software interface. This separation of concerns allows hardware developers to focus on optimizing custom accelerators while freeing software programmers to reuse, configure, and debug accelerators in a fashion akin to software libraries. We show the benefits of the Rosebud framework by building a firewall based on a large blacklist and porting the Pigasus IDS pattern-matching accelerator in less than a month. Our experiments demonstrate that Rosebud delivers high performance, serving ∼200 ‍Gbps of traffic while adding only 0.7–7 microseconds of latency.

Publisher's Version Published Artifact Info Artifacts Available Artifacts Functional
Simulator Independent Coverage for RTL Hardware Languages
Kevin LaeuferORCID logo, Vighnesh Iyer ORCID logo, David Biancolin ORCID logo, Jonathan Bachrach ORCID logo, Borivoje Nikolić ORCID logo, and Koushik Sen ORCID logo
(University of California at Berkeley, Berkeley, USA; SiFive, USA; JITX, USA)
We demonstrate a new approach to implementing automated coverage metrics including line, toggle, and finite state machine coverage. Each metric is implemented through a compiler pass with a report generator. They are decoupled from the backend simulation, emulation, or formal verification tool through a simple API designed around a single new cover primitive. Our prototype for the Chisel hardware construction language demonstrates support across three software simulators, the FPGA-accelerated FireSim simulator and a formal tool. We demonstrate collecting line coverage while booting Linux with FireSim at a target frequency of 65MHz. By construction, coverage can be trivially merged across backends.

Publisher's Version Published Artifact Info Artifacts Available Artifacts Functional Results Reproduced
Skybox: Open-Source Graphic Rendering on Programmable RISC-V GPUs
Blaise Tine ORCID logo, Varun Saxena ORCID logo, Santosh Srivatsan ORCID logo, Joshua R. Simpson ORCID logo, Fadi Alzammar ORCID logo, Liam Cooper ORCID logo, and Hyesoon Kim ORCID logo
(Georgia Institute of Technology, USA; California Polytechnic State University, USA)
Graphics rendering remains one of the most compute intensive and memory bound applications of GPUs and has been driving their push for performance and energy efficiency since its inception. Early GPU architectures focused only on accelerating graphics rendering and implemented dedicated fixed- function rasterizer hardware to speed-up their rendering pipeline. As GPUs have become more programmable and ubiquitous in other application domains such as scientific computing, machine learning, graph analytics, and crypto-currency, generalizing GPU microarchitectures for area and power efficiency becomes necessary, especially for mobile and IoT devices. In this work, we present Skybox, a full-stack open-source GPU architecture with integrated software, compiler, hardware, and simulation environment, that enables end-to-end GPU research. Using Skybox, we explore the design space of software versus hardware graphics rendering and propose and hybrid micro-architecture that accelerates the state-of-the art Vulkan graphics API. Skybox also introduces novel compiler and system optimizations to support its unique RISC-V ISA baseline. We evaluated Skybox on high- end Altera and also Xilinx FPGAs. We were able to generate and execute a 32 cores (512 threads) Skybox graphics processor on Altera Stratix 10 FPGA, delivering a peak fill rate of 3.7 GPixels at 230 MHz. Skybox is the first open-source full-stack GPU software and hardware implementation that supports the Vulkan API

Publisher's Version
Snape: Reliable and Low-Cost Computing with Mixture of Spot and On-Demand VMs
Fangkai Yang ORCID logo, Lu Wang ORCID logo, Zhenyu Xu ORCID logo, Jue Zhang ORCID logo, Liqun Li ORCID logo, Bo Qiao ORCID logo, Camille Couturier ORCID logo, Chetan Bansal ORCID logo, Soumya Ram ORCID logo, Si Qin ORCID logo, Zhen Ma ORCID logo, Íñigo Goiri ORCID logo, Eli Cortez ORCID logo, Terry Yang ORCID logo, Victor Rühle ORCID logo, Saravan Rajmohan ORCID logo, Qingwei Lin ORCID logo, and Dongmei Zhang ORCID logo
(Microsoft Research, China; Microsoft 365, France; Microsoft 365, USA; Microsoft Azure, USA; Microsoft 365, China; Microsoft 365, UK)
Cloud providers often have resources that are not being fully utilized, and they may offer them at a lower cost to make up for the reduced availability of these resources. However, customers may be hesitant to use such offerings (such as spot VMs) as making trade-offs between cost and resource availability is not always straightforward. In this work, we propose Snape (Spot On-demand Perfect Mixture), an intelligent framework to optimize the cost and resource availability by dynamically mixing on-demand VMs with spot VMs. Through a detailed characterization based on real production traces, we verify that the eviction of spot VMs is predictable to some extent. Snape also leverages constrained reinforcement learning to adjust the mixture policy online. Experiments across different configurations show that Snape achieves 44% savings compared to using only on-demand VMs while maintaining 99.96% availability, which is 2.77% higher than using only spot VMs.

Publisher's Version
Space-Efficient TREC for Enabling Deep Learning on Microcontrollers
Jiesong Liu ORCID logo, Feng Zhang ORCID logo, Jiawei Guan ORCID logo, Hsin-Hsuan Sung ORCID logo, Xiaoguang Guo ORCID logo, Xiaoyong Du ORCID logo, and Xipeng ShenORCID logo
(Renmin University of China, China; North Carolina State University, USA)
Deploying deep neural networks (DNNs) for a resource-constrained environment and achieving satisfactory performance is challenging. It is especially so on microcontrollers for their stringent space and computing power. This paper focuses on new ways to make TREC, an optimization recently proposed to enable computation reuse in DNNs, space and time efficient on Microcontrollers. The solution maximizes the performance benefits while keeping the DNN accuracy stable. Experiments show that the solution eliminates over 96% computations in DNNs and makes them fit well into microcontrollers, producing 3.4-5× speedups with only marginal accuracy loss.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning
Zihao Ye ORCID logo, Ruihang Lai ORCID logo, Junru Shao ORCID logo, Tianqi Chen ORCID logo, and Luis Ceze ORCID logo
(University of Washington, USA; Carnegie Mellon University, USA; OctoML, USA)
Sparse tensors are rapidly becoming critical components of modern deep learning workloads. However, developing high-performance sparse operators can be difficult and tedious, and existing vendor libraries cannot satisfy the escalating demands from new operators. Sparse tensor compilers simplify the development of operators, but efficient sparse compilation for deep learning remains challenging because a single sparse format cannot maximize hardware efficiency, and single-shot compilers cannot keep up with latest hardware and system advances. In this paper, we observe that the key to addressing both these challenges is to leverage composable formats and composable transformations. We propose SparseTIR, a sparse tensor compilation abstraction that offers composable formats and composable transformations for deep learning workloads. SparseTIR constructs a search space over these composable components for performance tuning. With these improvements, SparseTIR obtains consistent performance speedups vs vendor libraries on GPUs for single operators: 1.20-2.34x for GNN operators, 1.05-2.98x for sparse attention operators, and 0.56-7.45x for sparse convolution operators. SparseTIR also accelerates end-to-end GNNs by 1.08-1.52x for GraphSAGE training, and 4.20-40.18x for RGCN inference.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
SPLENDID: Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development
Zujun Tan ORCID logo, Yebin Chon ORCID logo, Michael Kruse ORCID logo, Johannes Doerfert ORCID logo, Ziyang Xu ORCID logo, Brian Homerding ORCID logo, Simone Campanoni ORCID logo, and David I. August ORCID logo
(Princeton University, USA; Argonne National Laboratory, USA; Northwestern University, USA)
Manually writing parallel programs is difficult and error-prone. Automatic parallelization could address this issue, but profitability can be limited by not having facts known only to the programmer. A parallelizing compiler that collaborates with the programmer can increase the coverage and performance of parallelization while reducing the errors and overhead associated with manual parallelization. Unlike collaboration involving analysis tools that report program properties or make parallelization suggestions to the programmer, decompiler-based collaboration could leverage the strength of existing parallelizing compilers to provide programmers with a natural compiler-parallelized starting point for further parallelization or refinement. Despite this potential, existing decompilers fail to do this because they do not generate portable parallel source code compatible with any compiler of the source language. This paper presents SPLENDID, an LLVM-IR to C/OpenMP decompiler that enables collaborative parallelization by producing standard parallel OpenMP code. Using published manual parallelization of the PolyBench benchmark suite as a reference, SPLENDID's collaborative approach produces programs twice as fast as either Polly-based automatic parallelization or manual parallelization alone. SPLENDID's portable parallel code is also more natural than that from existing decompilers, obtaining a 39x higher average BLEU score.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
TeraHeap: Reducing Memory Pressure in Managed Big Data Frameworks
Iacovos G. KolokasisORCID logo, Giannos Evdorou ORCID logo, Shoaib Akram ORCID logo, Christos Kozanitis ORCID logo, Anastasios Papagiannis ORCID logo, Foivos S. Zakkak ORCID logo, Polyvios Pratikakis ORCID logo, and Angelos Bilas ORCID logo
(University of Crete, Greece; ICS-FORTH, Greece; Australian National University, Australia; Isovalent, USA; Red Hat, UK)
Big data analytics frameworks, such as Spark and Giraph, need to process and cache massive amounts of data that do not always fit on the managed heap. Therefore, frameworks temporarily move long-lived objects outside the managed heap (off-heap) on a fast storage device. However, this practice results in (1) high serialization/deserialization (S/D) cost and (2) high memory pressure when off-heap objects are moved back to the heap for processing.
In this paper, we propose TeraHeap, a system that eliminates S/D overhead and expensive GC scans for a large portion of the objects in big data frameworks. TeraHeap relies on three concepts. (1) It eliminates S/D cost by extending the managed runtime (JVM) to use a second high-capacity heap (H2) over a fast storage device. (2) It offers a simple hint-based interface, allowing big data analytics frameworks to leverage knowledge about objects to populate H2. (3) It reduces GC cost by fencing the garbage collector from scanning H2 objects while maintaining the illusion of a single managed heap.
We implement TeraHeap in OpenJDK and evaluate it with 15 widely used applications in two real-world big data frameworks, Spark and Giraph. Our evaluation shows that for the same DRAM size, TeraHeap improves performance by up to 73% and 28% compared to native Spark and Giraph, respectively. Also, it provides better performance by consuming up to 4.6× and 1.2× less DRAM capacity than native Spark and Giraph, respectively. Finally, it outperforms Panthera, a state-of-the-art garbage collector for hybrid memories, by up to 69%.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional
The Sparse Abstract Machine
Olivia Hsu ORCID logo, Maxwell Strange ORCID logo, Ritvik Sharma ORCID logo, Jaeyeon Won ORCID logo, Kunle Olukotun ORCID logo, Joel S. Emer ORCID logo, Mark A. Horowitz ORCID logo, and Fredrik Kjølstad ORCID logo
(Stanford University, USA; Massachusetts Institute of Technology, USA; NVIDIA, USA)
We propose the Sparse Abstract Machine (SAM), an abstract machine model for targeting sparse tensor algebra to reconfigurable and fixed-function spatial dataflow accelerators. SAM defines a streaming dataflow abstraction with sparse primitives that encompass a large space of scheduled tensor algebra expressions. SAM dataflow graphs naturally separate tensor formats from algorithms and are expressive enough to incorporate arbitrary iteration orderings and many hardware-specific optimizations. We also present Custard, a compiler from a high-level language to SAM that demonstrates SAM's usefulness as an intermediate representation. We automatically bind from SAM to a streaming dataflow simulator. We evaluate the generality and extensibility of SAM, explore the performance space of sparse tensor algebra optimizations using SAM, and show SAM's ability to represent dataflow hardware.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
Towards an Adaptable Systems Architecture for Memory Tiering at Warehouse-Scale
Padmapriya Duraisamy ORCID logo, Wei Xu ORCID logo, Scott Hare ORCID logo, Ravi Rajwar ORCID logo, David Culler ORCID logo, Zhiyi Xu ORCID logo, Jianing Fan ORCID logo, Christopher KennellyORCID logo, Bill McCloskey ORCID logo, Danijela Mijailovic ORCID logo, Brian Morris ORCID logo, Chiranjit Mukherjee ORCID logo, Jingliang Ren ORCID logo, Greg Thelen ORCID logo, Paul Turner ORCID logo, Carlos Villavieja ORCID logo, Parthasarathy Ranganathan ORCID logo, and Amin Vahdat ORCID logo
(Google, USA; University of California at Berkeley, Berkeley, USA)
Fast DRAM increasingly dominates infrastructure spend in large scale computing environments and this trend will likely worsen without an architectural shift. The cost of deployed memory can be reduced by replacing part of the conventional DRAM with lower cost albeit slower memory media, thus creating a tiered memory system where both tiers are directly addressable and cached. But, this poses numerous challenges in a highly multi-tenant warehouse-scale computing setting. The diversity and scale of its applications motivates an application-transparent solution in the general case, adaptable to specific workload demands.
This paper presents TMTS(Transparent Memory Tiering System), an application-transparent memory tiering management system that implements an adaptive, hardware-guided architecture to dynamically optimize access to the various directly-addressed memory tiers without faults. TMTS has been deployed at scale for two years serving thousands of production services, successfully meeting service level objectives (SLOs) across diverse application classes in the fleet. The solution is developed in terms of system level metrics it seeks to optimize and evaluated across the diverse workload mix to guide advanced policies embodied in a user-level agent. It sustains less than 5% overall performance degradation while replacing 25% of DRAM with a much slower medium.

Publisher's Version
TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory
Hasan Al Maruf ORCID logo, Hao Wang ORCID logo, Abhishek Dhanotia ORCID logo, Johannes Weiner ORCID logo, Niket Agarwal ORCID logo, Pallab Bhattacharya ORCID logo, Chris Petersen ORCID logo, Mosharaf Chowdhury ORCID logo, Shobhit Kanaujia ORCID logo, and Prakash Chauhan ORCID logo
(University of Michigan, USA; NVIDIA, USA; Meta, USA)
The increasing demand for memory in hyperscale applications has led to memory becoming a large portion of the overall datacenter spend. The emergence of coherent interfaces like CXL enables main memory expansion and offers an efficient solution to this problem. In such systems, the main memory can constitute different memory technologies with varied characteristics. In this paper, we characterize memory usage patterns of a wide range of datacenter applications across the server fleet of Meta. We, therefore, demonstrate the opportunities to offload colder pages to slower memory tiers for these applications. Without efficient memory management, however, such systems can significantly degrade performance.
We propose a novel OS-level application-transparent page placement mechanism (TPP) for CXL-enabled memory. TPP employs a lightweight mechanism to identify and place hot/cold pages to appropriate memory tiers. It enables a proactive page demotion from local memory to CXL-Memory. This technique ensures a memory headroom for new page allocations that are often related to request processing and tend to be short-lived and hot. At the same time, TPP can promptly promote performance-critical hot pages trapped in the slow CXL-Memory to the fast local memory, while minimizing both sampling overhead and unnecessary migrations. TPP works transparently without any application-specific knowledge and can be deployed globally as a kernel release.
We evaluate TPP with diverse memory-sensitive workloads in the production server fleet with early samples of new x86 CPUs with CXL 1.1 support. TPP makes a tiered memory system performant as an ideal baseline (<1% gap) that has all the memory in the local tier. It is 18% better than today’s Linux, and 5–17% better than existing solutions including NUMA Balancing and AutoTiering. Most of the TPP patches have been merged in the Linux v5.18 release while the remaining ones are just pending for more discussion.

Publisher's Version
Transparent Runtime Change Handling for Android Apps
Zizhan Chen ORCID logo and Zili Shao ORCID logo
(Chinese University of Hong Kong, China)
Mobile devices often face runtime configuration changes, such as screen orientation changes, screen resizing, and language switching. The current Android design adopts a restarting-based solution to load the corresponding resources according to the new configuration. Therefore, application developers must explicitly deal with state preservation and restoration brought about by the activity restarting. Otherwise, the runtime change will cause state loss and even app crash issues. To solve the runtime change handling issues, we propose RCHDroid, a transparent runtime change handling approach for apps at the Android system level. When a configuration change occurs, we do not restart the activity, but instead, create a new activity according to the new configuration and migrate states from the old one to the new one, while putting the old activity into an inactive mode. We propose a lazy-migration scheme to handle asynchronous tasks that remain working on the old activity, which migrates the result events when the asynchronous tasks return. We have implemented a prototype of RCHDroid with real hardware and released the source code for public access. Overall, using RCHDroid, existing apps can handle runtime configuration changes without any modifications and save the runtime change handling time by 25.46%.

Publisher's Version Published Artifact Info Artifacts Available Artifacts Functional Results Reproduced
Untangle: A Principled Framework to Design Low-Leakage, High-Performance Dynamic Partitioning Schemes
Zirui Neil Zhao ORCID logo, Adam MorrisonORCID logo, Christopher W. FletcherORCID logo, and Josep TorrellasORCID logo
(University of Illinois at Urbana-Champaign, USA; Tel Aviv University, Israel)
Partitioning a hardware structure dynamically among multiple security domains leaks some information but can deliver high performance. To understand the performance-security tradeoff of dynamic partitioning, it would be useful to formally quantify the leakage of these schemes. Unfortunately, this is hard, as what partition resizing decisions are made and when they are made are entangled.
In this paper, we present Untangle, a novel framework for constructing low-leakage and high-performance dynamic partitioning schemes. Untangle formally splits the leakage into leakage from deciding what resizing action to perform (action leakage) and leakage from deciding when the resizing action occurs (scheduling leakage). Based on this breakdown, Untangle introduces a set of principles that decouple program timing from the action leakage. Moreover, Untangle introduces a new way to model the scheduling leakage without analyzing program timing. With these techniques, Untangle quantifies the leakage in a dynamic resizing scheme more tightly than prior work. To demonstrate Untangle, we apply it to dynamically partition the last-level cache. On average, workloads leak 78% less under Untangle than under a conventional dynamic partitioning approach, for the same workload performance.

Publisher's Version
Verification of Nondeterministic Quantum Programs
Yuan Feng ORCID logo and Yingte Xu ORCID logo
(University of Technology Sydney, Australia)
Nondeterministic choice is a useful program construct that provides a way to describe the behaviour of a program without specifying the details of possible implementations. It supports the stepwise refinement of programs, a method that has proven useful in software development. Nondeterminism has also been introduced in quantum programming, and termination of nondeterministic quantum programs has been extensively analysed. In this paper, we go beyond termination analysis to investigate the verification of nondeterministic quantum programs where properties are given by sets of hermitian operators on the associated Hilbert space. Hoare-type logic systems for partial and total correctness are proposed which turn out to be both sound and relatively complete with respect to their corresponding semantic correctness. To show the utility of these proof systems, we analyse some quantum algorithms such as quantum error correction scheme, Deutsch algorithm, and a nondeterministic quantum walk. Finally, a proof assistant prototype is implemented to aid in the automated reasoning of nondeterministic quantum programs.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced
Vidi: Record Replay for Reconfigurable Hardware
Gefei ZuoORCID logo, Jiacheng MaORCID logo, Andrew QuinnORCID logo, and Baris KasikciORCID logo
(University of Michigan, USA; University of California at Santa Cruz, Santa Cruz, USA; Google, USA)
Developers are turning to heterogeneous computing devices, such as Field Programmable Gate Arrays (FPGAs), to accelerate data center and cloud computing workloads. FPGAs enable rapid prototyping and should facilitate an agile software-like development workflow to fix correctness bugs, performance issues, and security vulnerabilities. Unfortunately, hardware development still does not have a vast ecosystem of tools needed to support the agile hardware development vision. The capability to record and replay FPGA executions would constitute a key building block that will inspire the development of many tools, similar to what record/replay did for software. However, building a practical record/replay tool for FPGA is challenging; existing approaches either record too much or too little information and cannot support real-world executions.
In this paper, we present VIDI, the first record/replay system for real-world FPGA applications running on hardware. VIDI is based on the observation that widely-used communication protocols have well-defined input/output transactions to hide cycle-specific information from developers, which enables a more efficient design than heavyweight cycle-accurate record/replay approaches. VIDI proposes (1) the transaction determinism insight to track and enforce only necessary orderings of transaction events across record and replay, and (2) the coarse-grained input recording mechanism to record transaction-level information. We evaluate VIDI on Amazon EC2 F1 instances with 10 applications and two use cases (debugging, testing) and find that it incurs on average low performance slowdown (1.98%) and resource overhead (5.48%), making it practical for real-world deployments.

Publisher's Version Published Artifact Artifacts Available Artifacts Functional Results Reproduced

proc time: 11.15