27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2022)
Powered by
Conference Publishing Consulting

27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2022), February 28 – March 4, 2022, Lausanne, Switzerland

ASPLOS 2022 – Preliminary Table of Contents

Contents - Abstracts - Authors


Title Page

Message from the Chairs




Debugging in the Brave New World of Reconfigurable Hardware
Jiacheng MaORCID logo, Gefei ZuoORCID logo, Kevin Loughlin, Haoyang Zhang, Andrew Quinn, and Baris KasikciORCID logo
(University of Michigan, USA; University of California at Santa Cruz, USA)
Software and hardware development cycles have traditionally been quite distinct. Software allows post-deployment patches, which leads to a rapid development cycle. In contrast, hardware bugs that are found after fabrication are extremely costly to fix (and sometimes even unfixable), so the traditional hardware development cycle involves massive investment in extensive simulation and formal verification. Reconfigurable hardware, such as a Field Programmable Gate Array (FPGA), promises to propel hardware development towards an agile software-like development approach, since it enables a hardware developer to patch bugs that are detected during on-chip testing or in production. Unfortunately, FPGA programmers lack bug localization tools amenable to this rapid development cycle, since past tools mainly find bugs via simulation and verification. To develop hardware bug localization tools for a rapid development cycle, a thorough understanding of the symptoms, root causes, and fixes of hardware bugs is needed.
In this paper, we first study bugs in existing FPGA designs and produce a testbed of reliably-reproducible bugs. We classify the bugs according to their intrinsic properties, symptoms, and root causes. We demonstrate that many hardware bugs are comparable to software bug counterparts, and would benefit from similar techniques for bug diagnosis and repair. Based upon our findings, we build a novel collection of hybrid static/dynamic program analysis and monitoring tools for debugging FPGA designs, showing that our tools enable a software-like development cycle by effectively reducing developers' manual efforts for bug localization.

Article Search
GenStore: An In-Storage Processing System for Genome Sequence Analysis
Nika Mansouri Ghiasi, Jisung Park, Harun Mustafa, Jeremie Kim, Arvid Gollwitzer, Ataberk Olgun, Haiyu Mao, Can Firtina, Damla Senol Cali, Nour Almadhoun Alserr, Rachata Ausavarungnirun, Nandita Vijaykumar, Mohammed Alser, and Onur Mutlu
(ETH Zurich, Switzerland; TOBB University of Economics and Technology, Turkey; Carnegie Mellon University, USA; King Mongkut's University of Technology North Bangkok, Thailand; University of Toronto, Canada)

Article Search
HAMMER: Boosting Fidelity of Noisy Quantum Circuits by Exploiting Hamming Behavior of Erroneous Outcomes
Swamit Tannu, Poulami Das, Ramin Ayanzadeh, and Moinuddin QureshiORCID logo
(University of Wisconsin-Madison, USA; Georgia Institute of Technology, USA)

Article Search
SOL: Safe On-Node Learning in Cloud Platforms
Yawen Wang, Daniel Crankshaw, Neeraja J. Yadwadkar, Daniel Berger, Christos Kozyrakis, and Ricardo Bianchini
(Stanford University, USA; Microsoft Research, USA; University of Texas at Austin, USA)
Cloud platforms run many software agents on each server node. These agents manage all aspects of node operation, and in some cases frequently collect data and make decisions. Unfortunately, their behavior is typically based on pre-defined static heuristics or offline analysis; they do not leverage on-node machine learning (ML). In this paper, we first characterize the spectrum of node agents in Azure, and identify the classes of agents that are most likely to benefit from on-node ML. We then propose SOL, an extensible framework for designing ML-based agents that are safe and robust to the range of failure conditions that occur in production. SOL provides a simple API to agent developers and manages the scheduling and running of the agent-specific functions they write. We illustrate the use of SOL by implementing three ML-based agents that manage CPU cores, node power, and memory placement. Our experiments show that (1) ML substantially improves our agents, and (2) SOL ensures that agents operate safely under a variety of failure conditions. We conclude that ML-based agents show significant potential and that SOL can help build them.

Article Search
SparseCore: Stream ISA and Processor Specialization for Sparse Computation
Gengyu Rao, Jingji Chen, Jason Yik, and Xuehai Qian
(University of Southern California, USA)
Computation on sparse data is becoming increasingly important for many applications. Recent sparse computation accelerators are designed for specific algorithm/application, making them inflexible with software optimizations. This paper proposes SparseCore, the first general-purpose processor extension for sparse computation that can flexibly accelerate complex code patterns and fast-evolving algorithms. We extend the instruction set architecture (ISA) to make stream or sparse vector first-class citizens, and develop efficient architectural components to support the stream ISA. The novel ISA extension intrinsically operates on streams, realizing both efficient data movement and computation. The simulation results show that SparseCore achieves significant speedups for sparse tensor computation and graph pattern computation.

Article Search
Accelerating Task-Parallel Workloads by Recovering Program Structure
Vidushi Dadu and Tony Nowatzki
(University of California at Los Angeles, USA)

Article Search
LILLIPUT: A Lightweight Low-Latency Lookup-Table Based Decoder for Near-Term Quantum Error Correction
Poulami Das, Aditya Locharla, and Cody Jones
(Georgia Institute of Technology, USA; Google, USA)

Article Search
ValueExpert: Exploring Value Patterns in GPU-Accelerated Applications
Keren Zhou, Yueming Hao, John Mellor-Crummey, Xiaozhu Meng, and Xu Liu
(Rice University, USA; North Carolina State University, USA; Oak Ridge National Laboratory, USA)
General-purpose GPUs have become common in modern computing systems to accelerate applications in many domains, including machine learning, high-performance computing, and autonomous driving. However, inefficiencies abound in GPU-accelerated applications, which prevent them from obtaining bare-metal performance. Performance tools play an important role in understanding performance inefficiencies in complex code bases. Many GPU performance tools pinpoint time-consuming code and provide high-level performance insights but overlook one important performance issue---value-related inefficiencies, which exist in many GPU code bases. In this paper, we present ValueExpert, a novel tool to pinpoint value-related inefficiencies in GPU applications.
ValueExpert monitors application execution to capture values produced and used by each load and store operation in GPU kernels, recognizes multiple value patterns, and provides intuitive optimization guidance. We address systemic challenges in collecting, maintaining, and analyzing voluminous performance data from many GPU threads to make ValueExpert applicable to complex applications. We evaluate ValueExpert on a wide range of well-tuned benchmarks and applications, including PyTorch, Darknet, LAMMPS, Castro, and many others. ValueExpert is able to identify previously unknown performance issues and provide suggestions for nontrivial performance improvements with typically less than five lines of code changes. We verify our optimizations with application developers and upstream fixes to their repositories.

Article Search
INFless: A Native Serverless System for Low-Latency, High-Throughput Inference
Yanan Yang, Laiping Zhao, Yiming Li, Huanyu Zhang, Jie Li, Mingyang Zhao, Xingzhen Chen, and Keqiu Li
(Tianjin University, China;, n.n.)

Article Search
SRAM Has No Chill: Exploiting Power Domain Separation to Steal On-Chip Secrets
Jubayer Mahmod and Matthew Hicks
(Virginia Tech, USA)
The abundance of embedded systems and smart devices increases the risk of physical memory disclosure attacks. One such classic non-invasive attack exploits dynamic RAM's temperature-dependent ability to retain information across power cycles---known as a cold boot attack. When exposed to low temperatures, DRAM cells preserve their state for a short time without power, mimicking non-volatile memories in that time frame. Attackers exploit this physical phenomenon to gain access to a system's secrets, leading to data theft from encrypted storage. To prevent cold boot attacks, programmers hide secrets on-chip in Static Random-Access Memory (SRAM); by construction, on-chip SRAM is isolated from external probing and has little intrinsic capacitance, making it robust against cold boot attacks.
While it is the case that SRAM protects against traditional cold boot attacks, we show that there is another way to retain information in on-chip SRAM across power cycles and software changes. This paper presents Volt Boot, an attack that demonstrates a vulnerability of on-chip volatile memories due to the physical separation common to modern system-on-chip power distribution networks. Volt Boot leverages asymmetrical power states (e.g., on vs. off) to force SRAM state retention across power cycles, eliminating the need for traditional cold boot attack enablers, such as low-temperature or intrinsic data retention time. Using several modern ARM Cortex-A devices, we demonstrate the effectiveness of the attack in caches, registers, and iRAMs. Unlike other forms of SRAM data retention attacks, Volt Boot retrieves data with 100% accuracy---without any complex post-processing.

Article Search
The Benefits of General-Purpose On-NIC Memory
Boris Pismenny, Liran Liss, Adam MorrisonORCID logo, and Dan Tsafrir
(Technion, Israel; NVIDIA, n.n.; Tel Aviv University, Israel; VMware Research, Israel)

Article Search
MineSweeper: A "clean sweep" for Drop-In Use-after-Free Prevention
Marton Erdos ORCID logo, Sam AinsworthORCID logo, and Timothy M. JonesORCID logo
(University of Cambridge, UK; University of Edinburgh, UK)
Low-level languages, which require manual memory management from the programmer, remain in wide use for performance-critical applications. Memory-safety bugs are common, and now a major source of exploits. In particular, a use-after-free bug occurs when an object is erroneously deallocated, whilst pointers to it remain active in memory, and those (dangling) pointers are later used to access the object. An attacker can reallocate the memory area backing an erroneously freed object, then overwrite its contents, injecting carefully chosen data into the host program, thus altering its execution and achieving privilege escalation.
We present MineSweeper, a system to mitigate use-after-free vulnerabilities by retaining freed allocations in a quarantine, until no pointers to them remain in program memory, thus preventing their reallocation until it is safe. MineSweeper performs efficient linear sweeps of memory to identify quarantined items that have no dangling pointers to them, and thus can be safely reallocated. This allows MineSweeper to be significantly more efficient than previous transitive marking procedure techniques.
MineSweeper, attached to JeMalloc, improves security at an acceptable overhead in memory footprint (11.1% on average) and an execution-time cost of only 5.4% (geometric mean for SPEC CPU2006), with 9.6% additional threaded CPU usage. These figures considerably improve on the state-of-the-art for non-probabilistic drop-in temporal-safety systems, and make MineSweeper the only such scheme suitable for deployment in real-world production environments.

Article Search
CoolEdge: Hotspot-Relievable Warm Water Cooling for Energy-Efficient Edge Datacenters
Qiangyu Pei ORCID logo, Shutong Chen ORCID logo, Qixia Zhang ORCID logo, Xinhui Zhu, Fangming LiuORCID logo, Ziyang Jia, Yishuo Wang, and Yongjie Yuan
(Huazhong University of Science and Technology, China)
As the computing frontier drifts to the edge, edge datacenters play a crucial role in supporting various real-time applications. Different from cloud datacenters, the requirements of proximity to end-users, high density, and heterogeneity, present new challenges to cool the edge datacenters efficiently. Although warm water cooling has become a promising cooling technique for this infrastructure, the one-size-fits-all cooling control would lower the cooling efficiency considerably because of the severe thermal imbalance across servers, hardware, and even inside one hardware component in an edge datacenter. In this work, we propose CoolEdge, a hotspot-relievable warm water cooling system for improving the cooling efficiency and saving costs of edge datacenters. Specifically, through the elaborate design of water circulations, CoolEdge can dynamically adjust the water temperature and flow rate for each heterogeneous hardware component to eliminate the hardware-level hotspots. By redesigning cold plates, CoolEdge can quickly disperse the chip-level hotspots without manual intervention. We further quantify the power saving achieved by the warm water cooling theoretically, and propose a custom-designed cooling solution to decide an appropriate water temperature and flow rate periodically. Based on a hardware prototype and real-world traces from SURFsara, the evaluation results show that CoolEdge reduces the cooling energy by 81.81% and 71.92%, respectively, compared with conventional and state-of-the-art water cooling systems.

Article Search
Vector Instruction Selection for Digital Signal Processors using Program Synthesis
Maaz Bin Safeer Ahmad, Andrew Adams, Shoaib Kamil, Alvin Cheung, and Alexander J. Root
(University of Washington, USA; Adobe, USA; University of California at Berkeley, USA; Massachusetts Institute of Technology, USA)

Article Search
Paulihedral: A Generalized Block-Wise Compiler Optimization Framework for Quantum Simulation Kernels
Gushu LiORCID logo, Anbang Wu, Yunong Shi, Ali Javadi-Abhari, Yufei Ding ORCID logo, and Yuan Xie
(University of California at Santa Barbara, USA; Amazon, USA; IBM, USA)
The quantum simulation kernel is an important subroutine appearing as a very long gate sequence in many quantum programs. In this paper, we propose Paulihedral, a block-wise compiler framework that can deeply optimize this subroutine by exploiting high-level program structure and optimization opportunities. Paulihedral first employs a new Pauli intermediate representation that can maintain the high-level semantics and constraints in quantum simulation kernels. This naturally enables new large-scale optimizations that are hard to implement at the low gate-level. In particular, we propose two technology-independent instruction scheduling passes, and two technology-dependent code optimization passes which reconcile the circuit synthesis, gate cancellation, and qubit mapping stages of the compiler. Experimental results show that Paulihedral can outperform state-of-the-art compiler infrastructures in a wide-range of applications on both near-term superconducting quantum processors and future fault-tolerant quantum computers.

Article Search
Randomized Row-Swap: Mitigating Row Hammer by Breaking Spatial Correlation between Aggressor and Victim Rows
Gururaj SaileshwarORCID logo, Bolin Wang, Moinuddin QureshiORCID logo, and Prashant Nair
(Georgia Institute of Technology, USA; University of British Columbia, Canada)

Article Search
FaaSFlow: Enable Efficient Workflow Execution for Function-as-a-Service
Zijun Li ORCID logo, Yushi Liu ORCID logo, Linsong GuoORCID logo, Quan Chen ORCID logo, Jiagan Cheng ORCID logo, Wenli Zheng ORCID logo, and Minyi Guo ORCID logo
(Shanghai Jiao Tong University, China)
Serverless computing (Function-as-a-Service) provides fine-grain resource sharing by running functions (or Lambdas) in containers. Data-dependent functions are required to be invoked following a pre-defined logic, which is known as serverless workflows. However, our investigation shows that the traditional master-worker based workflow execution architecture performs poorly in serverless context. One significant overhead results from the master-side workflow schedule pattern, with which the functions are triggered in the master node and assigned to worker nodes for execution. Besides, the data movement between workers also reduces the throughput.
To this end, we present a worker-side workflow schedule pattern for serverless workflow execution. Following the design, we implement FaaSFlow to enable efficient workflow execution in the serverless context. Besides, we propose an adaptive storage library FaaStore that enables fast data transfer between functions on the same node without through the database. Experiment results show that FaaSFlow effectively mitigates the workflow scheduling overhead by 74.6% on average and data transmission overhead by 95% at most. When the network bandwidth fluctuates, FaaSFlow-FaaStore reduces the throughput degradation by 23.0%, and is able to multiply the utilization of network bandwidth by 1.5X-4X.

Article Search
Every Walk’s a Hit: Making Page Walks Single-Access Cache Hits
Chang Hyun ParkORCID logo, Ilias Vougioukas ORCID logo, Andreas Sandberg ORCID logo, and David Black-Schaffer ORCID logo
(Uppsala University, Sweden; Arm Research, UK)
As memory capacity has outstripped TLB coverage, large data applications suffer from frequent page table walks. We investigate two complementary techniques for addressing this cost: reducing the number of accesses required and reducing the latency of each access. The first approach is accomplished by opportunistically "flattening" the page table: merging two levels of traditional 4 KB page table nodes into a single 2 MB node, thereby reducing the table's depth and the number of indirections required to traverse it. The second is accomplished by biasing the cache replacement algorithm to keep page table entries during periods of high TLB miss rates, as these periods also see high data miss rates and are therefore more likely to benefit from having the smaller page table in the cache than to suffer from increased data cache misses.
We evaluate these approaches for both native and virtualized systems and across a range of realistic memory fragmentation scenarios, describe the limited changes needed in our kernel implementation and hardware design, identify and address challenges related to self-referencing page tables and kernel memory allocation, and compare results across server and mobile systems using both academic and industrial simulators for robustness.
We find that flattening does reduce the number of accesses required on a page walk (to 1.0), but its performance impact (+2.3%) is small due to Page Walker Caches (already 1.5 accesses). Prioritizing caching has a larger effect (+6.8%), and the combination improves performance by +9.2%. Flattening is more effective on virtualized systems (4.4 to 2.8 accesses, +7.1% performance), due to 2D page walks. By combining the two techniques we demonstrate a state-of-the-art +14.0% performance gain and -8.7% dynamic cache energy and -4.7% dynamic DRAM energy for virtualized execution with very simple hardware and software changes.

Article Search
JSONSki: Streaming Semi-structured Data with Bit-Parallel Fast-Forwarding
Lin Jiang and Zhijia Zhao
(University of California at Riverside, USA)
Semi-structured data, like JSON, is the de facto standard for data communication on the Web and data representation in document-based data stores. Streaming analytics combines semi-structured data parsing and query evaluation into one pass to avoid generating any parse tree. Though promising, its conventional design requires to parse the data stream in detail character by character, which fundamentally limits the efficiency of streaming analytics.
This work addresses the above challenge by revealing a wide range of opportunities to fast-forward over some data substructures that are irrelevant to the query evaluation. However, identifying the irrelevant substructures itself may need detailed parsing. To resolve such a “chicken-egg” issue, an alternative yet faster way to identifying the irrelevant substructures is desired. For this purpose, this work presents a highly bit-parallel solution that intensively utilizes bitwise and SIMD operations to identify the irrelevant substructures during the streaming. It includes a new streaming model—recursive-descent streaming, for an easy adoption of various fast-forward optimizations, a concept—structural intervals, for partitioning the data stream, and a group of bit-parallel algorithms for implementing different fast-forward cases. The solution is packaged into a JSON streaming framework, called JSONSki. It offers a set of APIs that can be integrated into the streaming to dynamically fast-forward over different cases of irrelevant substructures. Evaluation using real-world datasets and standard path queries shows that JSONSki can achieve significant speedups over existing JSON processing tools while taking a minimum memory footprint.

Article Search
Parallel Virtualized Memory Translation
Jovan Stojkovic, Dimitrios Skarlatos, Apostolos Kokolis, Tianyin Xu ORCID logo, and Josep Torrellas
(University of Illinois at Urbana-Champaign, USA; Carnegie Mellon University, USA; Facebook, USA)
A major reason why nested or virtualized address translations are slow is because current systems organize page tables in a multi-level tree that is accessed in a sequential manner. A nested translation may potentially require up to twenty-four sequential memory accesses. To address this problem, this paper presents the first page table design that supports parallel nested address translation. The design is based on using hashed page tables (HPTs) for both guest and host. However, directly extending a native HPT design to a nested environment leads to minor gains. Instead, our design solves a new set of challenges that appear in nested environments. Our scheme eliminates all but three of the potentially twenty-four sequential steps of a nested translation—while judiciously limiting the number of parallel memory accesses issued to avoid over-consuming cache bandwidth. As a result, compared to conventional nested radix tables, our design speeds-up the execution of a set of applications by an average of 1.19x (for 4KB pages) and 1.24x (when huge pages are used). In addition, we also show a migration path from current nested radix page tables to our design.

Article Search
Astraea: Towards QoS-Aware and Resource-Efficient Multi-stage GPU Services
Wei Zhang, Quan Chen ORCID logo, Kaihua Fu, Ningxin Zheng, Zhiyi Huang, Jingwen Leng ORCID logo, and Minyi Guo ORCID logo
(Shanghai Jiao Tong University, China; Microsoft Research, China; University of Otago, New Zealand)
Multi-stage user-facing applications on GPUs are widely-used nowa- days, and are often implemented to be microservices. Prior re- search works are not applicable to ensuring QoS of GPU-based microservices due to the different communication patterns and shared resource contentions. We propose Astraea to manage GPU microservices considering the above factors. In Astraea, a microser- vice deployment policy is used to maximize the supported peak service load while ensuring the required QoS. To adaptively switch the communication methods between microservices according to different deployments, we propose an auto-scaling GPU communi- cation framework. The framework automatically scales based on the currently used hardware topology and microservice location, and adopts global memory-based techniques to reduce intra-GPU communication. Astraea increases the supported peak load by up to 82.3% while achieving the desired 99%-ile latency target compared with state-of-the-art solutions.

Article Search
ProSE: The Architecture and Design of a Protein Discovery Engine
Eyes Robson ORCID logo, Ceyu XuORCID logo, and Lisa Wu WillsORCID logo
(University of California at Berkeley, USA; Duke University, USA)
Protein language models have enabled breakthrough approaches to protein structure prediction, function annotation, and drug discovery. A primary limitation to the widespread adoption of these powerful models is the high computational cost associated with the training and inference of these models, especially at longer sequence lengths. We present the architecture, microarchitecture, and hardware implementation of a protein design and discovery accelerator, ProSE (Protein Systolic Engine). ProSE has a collection of custom heterogeneous systolic arrays and special functions that process transfer learning model inferences efficiently. The architecture marries SIMD-style computations with systolic array architectures, optimizing coarse-grained operation sequences across model layers to achieve efficiency without sacrificing generality. ProSE performs Protein BERT inference at up to 6.9× speedup and 48× power efficiency (performance/Watt) compared to one NVIDIA A100 GPU. ProSE achieves up to 5.5 × (12.7×) speedup and 173× (249×) power efficiency compared to TPUv3 (TPUv2).

Article Search
AStitch: Enabling a New Multi-dimensional Optimization Space for Memory-Intensive ML Training and Inference on Modern SIMT Architectures
Zhen Zheng, Xuanda Yang, Pengzhan Zhao, Guoping Long, Kai Zhu, Feiwen Zhu, Wenyi Zhao, Xiaoyong Liu, Jun Yang, Jidong Zhai, Shuaiwen Leon Song, and Wei Lin
(Alibaba Group, China; Alibaba, China; Tsinghua University, China; University of Sydney, Australia; University of Washington, USA)

Article Search
Pinned Loads: Taming Speculative Loads in Secure Processors
Zirui Neil Zhao ORCID logo, Houxiang Ji, Adam MorrisonORCID logo, Darko MarinovORCID logo, and Josep Torrellas
(University of Illinois at Urbana-Champaign, USA; Tel Aviv University, Israel)

Article Search
Memory-Harvesting VMs in Cloud Platforms
Alexander Fuerst ORCID logo, Stanko Novakovic, Inigo Goiri, Gohar Irfan Chaudhry, Prateek Sharma, Kapil Arya, Kevin Broas, Eugene Bak, Mehmet Iyigun, and Ricardo Bianchini
(Indiana University, USA; Microsoft Research, USA; Microsoft Azure, USA)
loud platforms monetize their spare capacity by renting “Spot” virtual machines (VMs) that can be evicted in favor of higher-priority VMs. Recent work has shown that resource-harvesting VMs are more effective at exploiting spare capacity than Spot VMs, while also reducing the number of evictions. However, the prior work focused on harvesting CPU cores while keeping memory size fixed. This wastes a substantial monetization opportunity and may even limit the ability of harvesting VMs to leverage spare cores. Thus, in this paper, we explore memory harvesting and its challenges in real cloud platforms, namely its impact on VM creation time, NUMA spanning, and page fragmentation. We start by characterizing the amount and dynamics of the spare memory in Azure. We then design and implement memory-harvesting VMs (MHVMs), introducing new techniques for memory buffering, batching, and pre-reclamation. To demonstrate the use of MHVMs, we also extend a popular cluster scheduling framework (Hadoop) and a FaaS platform to adapt to them. Our main results show that (1) there is plenty of scope for memory harvesting in real platforms; (2) MHVMs are effective at mitigating the negative impacts of harvesting; and (3) our extensions of Hadoop and FaaS successfully hide the MHVMs’ varying memory size from the users’ data-processing jobs and functions. We conclude that memory harvesting has great potential for practical deployment and users can save up to 93% of their costs when running workloads on MHVMs.

Article Search
Aries: A Data Plane Architecture for Per-Packet ML
Tushar Swamy, Alexander Rucker, Muhammad Shahbaz, Ishan Gaur, and Kunle Olukotun
(Stanford University, USA; Purdue University, USA)
Emerging applications---cloud computing, the internet of things, and augmented/virtual reality---demand responsive, secure, and scalable datacenter networks. These networks currently implement simple, per-packet, data-plane heuristics (e.g., ECMP and sketches) under a slow, millisecond-latency control plane that runs data-driven performance and security policies. However, to meet applications' service-level objectives (SLOs) in a modern data center, networks must bridge the gap between line-rate, per-packet execution and complex decision making.
In this work, we present the design and implementation of Taurus, a data plane for line-rate inference. Taurus adds custom hardware based on a flexible, parallel-patterns (MapReduce) abstraction to programmable network devices, such as switches and NICs; this new hardware uses pipelined SIMD parallelism to enable per-packet MapReduce operations (e.g., inference). Our evaluation of a Taurus switch ASIC---supporting several real-world models---shows that Taurus operates orders of magnitude faster than a server-based control plane while increasing area by 3.8% and latency for line-rate ML models by up to 221 ns. Furthermore, our Taurus FPGA prototype achieves full model accuracy and detects two orders of magnitude more events than a state-of-the-art control-plane anomaly-detection system.

Article Search Info
ImperIO: Block IO Control for Containers in the Datacenter
Tejun Heo, Dan Schatzberg, Andy Newell, Song Liu, Saravanan Dhakshinamurthy, Iyswarya Narayanan, Josef Bacik, Chris Mason, Chunqiang Tang, and Dimitrios Skarlatos
(Facebook, n.n.; Facebook, USA; Carnegie Mellon University, USA)
Resource isolation is a fundamental requirement in datacenter environments. However, our production experience in Meta’s large-scale datacenters shows that existing IO control mechanisms for block storage are inadequate in containerized environments. IO control needs to provide proportional resources to containers while taking into account the hardware heterogeneity of storage devices and the idiosyncrasies of the workloads deployed in datacenters. The speed of modern SSDs requires IO control to execute with low-overheads. Furthermore, IO control should strive for work conservation, take into account the interactions with the memory management subsystem, and avoid priority inversions that lead to isolation failures. To address these challenges, this paper presents IOCost, an IO control solution that is designed for containerized environments and provides scalable, work-conserving, and low-overhead IO control for heterogeneous storage devices and diverse workloads in datacenters. IOCost performs offline profiling to build a device model and uses it to estimate device occupancy of each IO request. To minimize runtime overhead, it separates IO control into a fast per-IO issue path and a slower periodic planning path. A novel work-conserving budget donation algorithm enables containers to dynamically share unused budget. We have deployed IOCost across the entirety of Meta’s datacenters comprised of millions of ma- chines, upstreamed IOCost to the Linux kernel, and open-sourced our device-profiling tools. IOCost has been running in production for two years, providing IO control for Meta’s fleet. We describe the design of IOCost and share our experience deploying it at scale.

Article Search
A One-for-All and $O(V\log(V))$-Cost Solution for Parallel Merge Style Operations on Sorted Key-Value Arrays
Bangyan Wang ORCID logo, Lei Deng, Fei Sun, Guohao Dai, Liu Liu ORCID logo, Yu Wang, and Yuan Xie
(University of California at Santa Barbara, USA; Tsinghua University, China; Alibaba DAMO Academy, n.n.)
The processing of sorted key-value arrays using a “merge style operation (MSO)” is a very basic and important problem in domains like scientific computing, deep learning, database, graph analysis, sorting, set-operation etc. MSOs dominate the execution time in some important applications like SpGEMM and graph mining. For example, sparse vector addition as an MSO takes up to 98% execution time in SpGEMM in our experiment. For this reason, accelerating MSOs on CPU, GPU, and accelerators using parallel execution has been extensively studied but the solutions in prior work have three major limitations. (1) They treat different MSOs as isolated problems using incompatible methods and an unified solution is still lacking. (2) They do not have the flexibility to support variable key/value sizes and value calculations in the runtime given a fixed hardware design. (3) They require a quadratic hardware cost (O(V2)) for given parallelism V in most cases.
To address above three limitations, we make the following efforts. (1) We present a one-for-all solution to support all interested MSOs based on a unified abstraction model “restricted zip machine (RZM)”. (2) We propose a set of composable and parallel primitives for RZM to provide the flexibility to support variable key/value sizes and value calculations. (3) We provide the hardware design to implement the proposed primitives using only O(Vlog(V)) resource. With the above techniques, a flexible and efficient solution for MSOs has been built. Our design can be used either as a drop-in replacement of the merge unit in prior accelerators to reduce the cost from O(V2) to O(Vlog(V)), or as an extension to the SIMD ISA of CPU and GPU. In our evaluation on CPU, when V=16 (512-bit SIMD, 32-bit element), we achieve significant speedup on a range of representative kernels including set operations (8.4×), database joins (7.3×), sparse vector/matrix/tensor addition/multiplication on real/complex numbers (6.5×), merge sort (8.0× over scalar, 3.4× over the state-of-the-art SIMD), and SpGEMM (4.4× over the best one in the baseline collection).

Article Search
Revizor: Testing Black-Box CPUs against Speculation Contracts
Oleksii Oleksenko, Christof Fetzer, Boris Köpf, and Mark Silberstein ORCID logo
(TU Dresden, Germany; Microsoft Research, UK; Technion, Israel)
Speculative vulnerabilities such as Spectre and Meltdown expose speculative execution state that can be exploited to leak information across security domains via side-channels. Such vulnerabilities often stay undetected for a long time as we lack the tools for systematic testing of CPUs to find them.
In this paper, we propose an approach to automatically detect microarchitectural information leakage in commercial black-box CPUs. We build on speculation contracts, which we employ to specify the permitted side effects of program execution on the CPU's microarchitectural state. We propose a Model-based Relational Testing (MRT) technique to empirically assess the CPU compliance with these specifications.
We implement MRT in a testing framework called Revizor, and showcase its effectiveness on real Intel x86 CPUs. Revizor automatically detects violations of a rich set of contracts, or indicates their absence. A highlight of our findings is that Revizor managed to automatically surface Spectre, MDS, and LVI, as well as several previously unknown variants.

Article Search
FINGERS: Exploiting Fine-Grained Parallelism in Graph Mining Accelerators
Qihang Chen, Boyu Tian, and Mingyu Gao ORCID logo
(Tsinghua University, China)
Graph mining is an emerging application of high importance and also with high complexity, thus requiring efficient hardware acceleration. Current accelerator designs only utilize coarse-grained parallelism, leaving large room for further optimizations. Our key insight is to fully exploit fine-grained parallelism to overcome the existing issues of hardware underutilization, inefficient resource provision, and limited single-thread performance under imbalanced loads. Targeting pattern-aware graph mining algorithms, we first comprehensively identify and analyze the abundant fine-grained parallelism at the branch, set, and segment levels during search tree exploration and set operations. We then propose a novel graph mining accelerator, FINGERS, which effectively exploits these multiple levels of fine-grained parallelism to achieve significant performance improvements. FINGERS mainly enhances the design of each single processing element with parallel compute units for set operations, and efficient techniques for task scheduling, load balancing, and data aggregation. FINGERS outperforms the state-of-the-art design by 2.8× on average and up to 8.9× with the same chip area. We also demonstrate that different patterns and different graphs exhibit drastically different parallelism opportunities, justifying the necessity of exploiting all levels of fine-grained parallelism in FINGERS.

Article Search
TMO: Transparent Memory Offloading in Heterogeneous Datacenters
Johannes Weiner, Niket Agarwal, Dan Schatzberg, Leon Yang, Hao Wang, Blaise Sanouillet, Bikash Sharma, Tejun Heo, Mayank Jain, Chunqiang Tang, and Dimitrios Skarlatos
(Facebook, n.n.; Facebook, USA; Carnegie Mellon University, USA)
The unrelenting growth of the memory needs of emerging datacenter applications, along with ever increasing cost and volatility of DRAM prices, has led to DRAM being a major infrastructure expense. Alternative technologies, such as NVMe SSDs and upcoming NVM devices, offer higher capacity than DRAM at a fraction of the cost and power. One promising approach is to transparently offload colder memory to cheaper memory technologies via kernel or hypervisor techniques. The key challenge, however, is to develop a datacenter-scale solution that is robust in dealing with diverse workloads and large performance variance of different offload devices such as compressed memory, SSD, and NVM. This paper presents TMO, Meta’s transparent memory offloading solution for heterogeneous datacenter environments. TMO introduces a new Linux kernel mechanism that directly measures in realtime the lost work due to resource shortage across CPU, memory, and I/O. Guided by this information and without any prior application knowledge, TMO automatically adjusts how much memory to offload to heterogeneous devices (e.g., compressed memory or SSD) according to the device’s performance characteristics and the application’s sensitivity to memory-access slowdown. TMO holistically identifies offloading opportunities from not only the application containers but also the sidecar containers that provide infrastructure-level functions. To maximize memory savings, TMO targets both anonymous memory and file cache, and balances the swap-in rate of anonymous memory and the reload rate of file pages that were recently evicted from the file cache. TMO has been running in production for more than a year, and has saved between 20-32% of the total memory across millions of servers in our large datacenter fleet. We have successfully upstreamed TMO into the Linux kernel.

Article Search
Serverless Computing on Heterogeneous Computers
Dong DuORCID logo, Qingyuan Liu, Xueqiang Jiang, Yubin Xia, Binyu Zang, and Haibo Chen ORCID logo
(Shanghai Jiao Tong University, China)
Existing serverless computing platforms are built upon homogeneous computers, limiting the function density and restricting serverless computing to limited scenarios. We introduce Molecule, the first serverless computing system utilizing heterogeneous computers. Molecule enables both general-purpose devices (e.g., Nvidia DPU) and domain-specific accelerators (e.g., FPGA and GPU) for serverless applications that significantly improve function density (50% higher) and application performance (up to 34.6x). To achieve these results, we first propose XPU-Shim, a distributed shim to bridge the gap between underlying multi-OS systems (when using general-purpose devices) and our serverless runtime (i.e., Molecule). We further introduce vectorized sandbox, a sandbox abstraction to abstract hardware heterogeneity (when using domain-specific accelerators). Moreover, we also review state-of-the-art serverless optimizations on startup and communication latency and overcome the challenges to implement them on heterogeneous computers. We have implemented Molecule on real platforms with Nvidia DPUs and Xilinx FPGAs and evaluate it using benchmarks and real-world applications.

Article Search
ShEF: Shielded Enclaves for Cloud FPGAs
Mark Zhao ORCID logo, Mingyu Gao ORCID logo, and Christos Kozyrakis
(Stanford University, USA; Tsinghua University, China)
FPGAs are now used in public clouds to accelerate a wide range of applications, including many that operate on sensitive data such as financial and medical records. We present ShEF, a trusted execution environment (TEE) for cloud-based reconfigurable accelerators. ShEF is independent from CPU-based TEEs and allows secure execution under a threat model where the adversary can control all software running on the CPU connected to the FPGA, has physical access to the FPGA, and can compromise the FPGA interface logic of the cloud provider. ShEF provides a secure boot and remote attestation process that relies solely on existing FPGA mechanisms for root of trust. It also includes a Shield component that provides secure access to data while the accelerator is in use. The Shield is highly customizable and extensible, allowing users to craft a bespoke security solution that fits their accelerator's memory access patterns, bandwidth, and security requirements at minimum performance and area overheads. We describe a prototype implementation of ShEF for existing cloud FPGAs, map ShEF to a performant and secure storage application, and measure the performance benefits of customizable security using five additional accelerators.

Article Search
A Tree Clock Data Structure for Causal Orderings in Concurrent Executions
Umang Mathur, Andreas PavlogiannisORCID logo, Hünkar Can Tunç ORCID logo, and Mahesh Viswanathan
(National University of Singapore, Singapore; Aarhus University, Denmark; University of Illinois at Urbana-Champaign, USA)

Article Search
NASPipe: High Performance and Reproducible Pipeline Parallel Supernet Training via Causal Synchronous Parallel
Shixiong Zhao, Fanxin Li, Xusheng Chen, Tianxiang Shen, Li Chen, Sen Wang, Nicholas Zhang, Cheng Li, and Heming Cui
(University of Hong Kong, China; Huawei Technologies, China; University of Science and Technology of China, China)

Article Search
EXAMINER: Automatically Locating Inconsistent Instructions between Real Devices and CPU Emulators for ARM
Muhui Jiang, Tianyi Xu, Yajin Zhou, Yufeng Hu, Ming Zhong, Lei Wu, Xiapu Luo, and Kui Ren
(Hong Kong Polytechnic University, China; Zhejiang University, China)
Emulators are widely used to build dynamic analysis frameworks due to its fine-grained tracing capability, full system monitoring functionality, and scalability of running on different operating systems and architectures. However, whether emulators are consistent with real devices is unknown. To understand this problem, we aim to automatically locate inconsistent instructions, which behave differently between emulators and real devices.
We target the ARM architecture, which provides machine-readable specifications. Based on the specification, we propose a sufficient test case generator by designing and implementing the first symbolic execution engine for the ARM architecture specification language (ASL). We generate 2,774,649 representative instruction streams and conduct differential testing between four ARM real devices in different architecture versions (i.e., ARMv5, ARMv6, ARMv7, and ARMv8) and three state-of-the-art emulators (i.e., QEMU, Unicorn, and Angr). We locate a huge number of inconsistent instruction streams (171,858 for QEMU, 223,264 for unicorn, and 120,169 for Angr). We find that undefined implementation in ARM manual and bugs of emulators are the major causes of inconsistencies. Furthermore, we discover 12 bugs, which influence commonly used instructions (e.g., BLX). With the inconsistent instructions, we build three security applications and demonstrate the capability of these instructions on detecting emulators, anti-emulation, and anti-fuzzing.

Article Search
Client-Optimized Algorithms and Acceleration for Encrypted Compute Offloading
McKenzie van der HagenORCID logo and Brandon Lucia ORCID logo
(Carnegie Mellon University, USA)
Homomorphic Encryption (HE) enables secure cloud offload processing on encrypted data. HE schemes are limited in the complexity and type of operations they can perform, motivating client-aided implementations that distribute computation between client (unencrypted) and server (encrypted). Prior client-aided systems optimize server performance, ignoring client costs: client-aided models put encryption and decryption on the critical path and require communicating large ciphertexts. We introduce Client-aided HE for Opaque Compute Offloading (CHOCO), a client-optimized system for encrypted offload processing. CHOCO reduces ciphertext size, reducing communication and computing costs through HE parameter minimization and through “rotational redundancy”, a new HE algorithm optimization. We present Client-aided HE for Opaque Compute Offloading Through Accelerated Cryptographic Operations (CHOCO-TACO), an accelerator for HE encryption and decryption, making client-aided HE feasible for even resource-constrained clients. CHOCO supports two popular HE schemes (BFV and CKKS) and several applications, including DNNs, PageRank, KNN, and K-Means. CHOCO reduces communication by up to 2948× over prior work. With CHOCO-TACO client enc-/decryption is up to 1094× faster and uses up to 648× less energy.

Article Search
DOTA: Detect and Omit Weak Attentions for Scalable Transformer Acceleration
Zheng Qu ORCID logo, Liu Liu ORCID logo, Fengbin Tu ORCID logo, Zhaodong Chen ORCID logo, Yufei Ding ORCID logo, and Yuan Xie ORCID logo
(University of California at Santa Barbara, USA)
Transformer Neural Networks have demonstrated leading performance in many applications spanning over language understanding, image processing, and generative modeling. Despite the impressive performance, long-sequence Transformer processing is expensive due to quadratic computation complexity and memory consumption of self-attention. In this paper, we present DOTA, an algorithm-architecture co-design that effectively addresses the challenges of scalable Transformer inference. Based on the insight that not all connections in an attention graph are equally important, we propose to jointly optimize a lightweight Detector with the Transformer model to accurately detect and omit weak connections during runtime. Furthermore, we design a specialized system architecture for end-to-end Transformer acceleration using the proposed attention detection mechanism. Experiments on a wide range of benchmarks demonstrate the superior performance of DOTA over other solutions. In summary, DOTA achieves 152.6x and 4.5x performance speedup and orders of magnitude energy-efficiency improvements over GPU and customized hardware, respectively.

Article Search
QUEST: Systematically Approximating Quantum Circuits for Higher Output Fidelity
Tirthak Patel, Ed Younis, Costin Iancu, Wibe de Jong, and Devesh Tiwari
(Northeastern University, USA; Lawrence Berkeley National Laboratory, USA)
We present QUEST, a procedure to systematically generate approximations for quantum circuits to reduce their CNOT gate count. Our approach employs circuit partitioning for scalability with procedures to 1) reduce circuit length using approximate synthesis, 2) improve fidelity by running circuits that represent key samples in the approximation space, and 3) reason about approximation upper bound. Our evaluation results indicate that our approach of "dissimilar" approximations provides close fidelity to the original circuit. Overall, the results indicate that QUEST can reduce CNOT gate count by 30-80% on ideal systems and decrease the impact of noise on existing and near-future quantum systems.

Article Search
PLD: Fast FPGA Compilation to Make Reconfigurable Acceleration Compatible with Modern Incremental Refinement Software Development
Yuanlong Xiao ORCID logo, Eric Micallef, Andrew Butt ORCID logo, Matthew Hofmann, Marc Alston, Matthew Goldsmith, Andrew Merczynski-Hait, and Andre DeHon ORCID logo
(University of Pennsylvania, USA)
FPGA-based accelerators are demonstrating significant absolute performance and energy efficiency compared with general-purpose CPUs. While FPGA computations can now be described in standard, programming languages, like C, development for FPGAs accelerators remains tedious and inaccessible to modern software engineers. Slow compiles (potentially taking tens of hours) inhibit the rapid, incremental refinement of designs that is the hallmark of modern software engineering. To address this issue, we introduce separate compilation and linkage into the FPGA design flow, providing faster design turns more familiar to software development. To realize this flow, we provide abstractions, compiler options, and compiler flow that allow the same C source code to be compiled to processor cores in seconds and to FPGA regions in minutes, providing the missing -O0 and -O1 options familiar in software development. This raises the FPGA programming level and standardizes the programming experience, bringing FPGA-based accelerators into a more familiar software platform ecosystem for software engineers.

Article Search
UCS²: Untrusted Cores in a Shared System
Nils Asmussen, Sebastian Haas, Carsten Weinhold, Till Miemietz, and Michael Roitzsch
(Barkhausen Institut, Germany)

Article Search
Enzian: An Open, General, CPU/FPGA Platform for Systems Software Research
David Cock, Mothy Roscoe, Gustavo Alonso, Abishek Ramdas, Reto Achermann, Adam Turowski, Anastasiia Ruzhanskaia, Dario Korolija, Nora Hossle, Kristina Martsenko, Melissa Licciardello, Michael Giardino, Daniel Schwyn, Zhenhao He, Lukas Humbel, and Mohsen Awaida
(ETH Zurich, Switzerland; University of British Columbia, Canada)

Article Search
NVAlloc: Rethinking Heap Metadata Management in Persistent Memory Allocators
Zheng Dang ORCID logo, Shuibing He, Peiyi Hong, Zhenxin Li, Xuechen Zhang, Xian-He Sun, and Gang Chen
(Zhejiang University, China; Washington State University, USA; Illinois Institute of Technology, USA)
Persistent memory allocation is a fundamental building block for developing high-performance and in-memory applications. Existing persistent memory allocators suffer from suboptimal heap organizations that introduce repeated cache line flushes and small random accesses in persistent memory. Worse, many allocators use static slab segregation resulting in a dramatic increase in memory consumption when allocation request size is changed. In this paper, we design a novel allocator, named NVAlloc, to solve the above issues simultaneously. First, NVAlloc eliminates cache line reflushes by mapping contiguous data blocks in slabs to interleaved metadata entries stored in different cache lines. Second, it writes small metadata units to a persistent bookkeeping log in a sequential pattern to remove random heap metadata accesses in persistent memory. Third, instead of using static slab segregation, it supports slab morphing, which allows slabs to be transformed between size classes to significantly improve slab usage. NVAlloc is complementary to the existing consistency models. Results on 6 benchmarks demonstrate that NVAlloc improves the performance of state-of-the-art persistent memory allocators by up to 6.4x and 57x for small and large allocations, respectively. Using NVAlloc reduces memory usage by up to 57.8%. Besides, we integrate NVAlloc in a persistent FPTree. Compared to the state-of-the-art allocators, NVAlloc improves the performance of this application by up to 3.1x.

Article Search
Understanding and Exploiting Optimal Function Inlining
Theodoros Theodoridis ORCID logo, Tobias Grosser ORCID logo, and Zhendong Su ORCID logo
(ETH Zurich, Switzerland; University of Edinburgh, UK)
Inlining is a core transformation in optimizing compilers. It replaces a function call (call site) with the body of the called function (callee). It helps reduce function call overhead and binary size, and more importantly, enables other optimizations. The problem of inlining has been extensively studied, but it is far from being solved; predicting which inlining decisions are beneficial is nontrivial due to interactions with the rest of the compiler pipeline. Previous work has mainly focused on designing heuristics for better inlining decisions and has not investigated optimal inlining, i.e., exhaustively finding the optimal inlining decisions. Optimal inlining is necessary for identifying and exploiting missed opportunities and evaluating the state of the art. This paper fills this gap through an extensive empirical analysis of optimal inlining using the SPEC2017 benchmark suite. Our novel formulation drastically reduces the inlining search space size (from 2349 down to 225) and allows us to exhaustively evaluate all inlining choices on 1,135 SPEC2017 files. We show a significant gap between the state-of-the-art strategy in LLVM and optimal inlining when optimizing for binary size, an important, deterministic metric independent of workload (in contrast to performance, another important metric). Inspired by our analysis, we introduce a simple, effective autotuning strategy for inlining that outperforms the state of the art by 7% on average (and up to 28%) on SPEC2017, 15% on the source code of LLVM itself, and 10% on the source code of SQLite. This work highlights the importance of exploring optimal inlining by providing new, actionable insight and an effective autotuning strategy that is of practical utility.

Article Search
CRISP: Critical Slice Prefetching
Heiner LitzORCID logo, Grant Ayers, and Parthasarathy Ranganathan
(University of California at Santa Cruz, USA; Google, USA)
The high access latency of DRAM continues to be a performance challenge for contemporary microprocessor systems. Prefetching is a well-established technique to address this problem, however, existing implemented designs fail to provide any performance benefits in the presence of irregular memory access patterns. The hardware complexity of prior techniques that can predict irregular memory accesses such as runahead execution has proven untenable for implementation in real hardware. We propose a lightweight mechanism to hide the high latency of irregular memory access patterns by leveraging criticality-based scheduling. In particular, our technique executes delinquent loads and their load slices as early as possible, hiding a significant fraction of their latency. Furthermore, we observe that the latency induced by branch mispredictions and other high latency instructions can be hidden with a similar approach. Our proposal only requires minimal hardware modifications by performing memory access classification, load and branch slice extraction, as well as priority analysis exclusively in software. As a result, our technique is feasible to implement, introducing only a simple new instruction prefix while requiring minimal modifications of the instruction scheduler. Our technique increases the IPC of memory-latency-bound applications by up to 38% and by 8.4% on average.

Article Search
BiSon-e: A Lightweight and High-Performance Accelerator for Narrow Integer Linear Algebra Computing on the Edge
Enrico Reggiani ORCID logo, Cristóbal Ramírez Lazo ORCID logo, Roger Figueras Bagué ORCID logo, Adrián Cristal ORCID logo, Mauro Olivieri ORCID logo, and Osman Sabri Unsal ORCID logo
(Barcelona Supercomputing Center, Spain; Sapienza University of Rome, Italy)
Linear algebra computational kernels based on byte and sub-byte integer data formats are at the base of many classes of applications, ranging from Deep Learning to Pattern Matching. Porting the computation of these applications from cloud to edge and mobile devices would enable significant improvements in terms of security, safety, and energy efficiency. However, despite their low memory and energy demands, their intrinsically high computational intensity makes the execution of these workloads challenging on highly resource-constrained devices. In this paper, we present BiSon-e, a novel RISC-V based architecture that accelerates linear algebra kernels based on narrow integer computations on edge processors by performing Single Instruction Multiple Data (SIMD) operations on off-the-shelf scalar Functional Units (FUs). Our novel architecture is built upon the binary segmentation technique, which allows to significantly reduce the memory footprint and the arithmetic intensity of linear algebra kernels requiring narrow data sizes. We integrate BiSon-e into a complete System-on-Chip (SoC) based on RISC-V, synthesized and Place&Routed in 65nm and 22nm technologies, introducing a negligible 0.07% area overhead with respect to the baseline architecture. Our experimental evaluation shows that, when computing the Convolution and Fully-Connected layers of the AlexNet and VGG-16 Convolutional Neural Networks (CNNs) with 8-, 4-, and 2-bit, our solution gains up to 5.6×, 13.9× and 24× in execution time compared to the scalar implementation of a single RISC-V core, and improves the energy efficiency of string matching tasks by 5× when compared to a RISC-V-based Vector Processing Unit (VPU).

Article Search
DAGguise: Mitigating Memory Timing Side Channels
Peter W. Deutsch ORCID logo, Yuheng Yang, Thomas Bourgeat, Jules Drean, Joel S. Emer, and Mengjia Yan
(Massachusetts Institute of Technology, USA; NVIDIA, USA)

Article Search
HeteroGen: Transpiling C to Heterogeneous HLS Code with Automated Test Generation and Program Repair
Qian Zhang, Jiyuan Wang, Guoqing Harry Xu, and Miryung Kim
(University of California at Los Angeles, USA)
Despite the trend of incorporating heterogeneity and specialization in hardware, the development of heterogeneous applications is limited to a handful of engineers with deep hardware expertise. We propose HeteroGen that takes C/C++ code as input and automatically generates an HLS version with test behavior preservation and better performance. Key to the success of HeteroGen is adapting the idea of search-based program repair to the heterogeneous computing domain, while addressing two technical challenges. First, the turn-around time of HLS compilation and simulation is much longer than the usual C/C++ compilation and execution time; therefore, HeteroGen applies pattern-oriented program edits guided by common fix patterns and their dependences. Second, behavior and performance checking requires testing, but test cases are often unavailable. Thus, HeteroGen auto-generates test inputs suitable for checking C to HLS-C conversion errors, while providing high branch coverage for the original C code.
An evaluation of HeteroGen shows that it produces an HLS-compatible version for nine out of ten real-world heterogeneous applications fully automatically, applying up to 438 lines of edits to produce an HLS version 1.63x faster than the original version.

Article Search Info
CryoWire: Wire-Driven Microarchitecture Designs for Cryogenic Computing
Dongmoon Min ORCID logo, Yujin Chung ORCID logo, Ilkwon Byun ORCID logo, Junpyo Kim ORCID logo, and Jangwoo Kim ORCID logo
(Seoul National University, South Korea)
Cryogenic computing, which runs a computer device at an extremely low temperature, is promising thanks to its significant reduction of wire resistance as well as leakage current. Recent studies on cryogenic computing have focused on various architectural units including the main memory, cache, and CPU core running at 77K. However, little research has been conducted to fully exploit the fast cryogenic wires, even though the slow wires are becoming more serious performance bottleneck in modern processors. In this paper, we propose a CPU microarchitecture which extensively exploits the fast wires at 77K. For this goal, we first introduce our validated cryogenic-performance models for the CPU pipeline and network on chip (NoC), whose performance can be significantly limited by the slow wires. Next, based on the analysis with the models, we architect CryoSP and CryoBus as our pipeline and NoC designs to fully exploit the fast wires. Our evaluation shows that our cryogenic computer equipped with both microarchitectures achieves 3.82 times higher system-level performance compared to the conventional computer system thanks to the 96% higher clock frequency of CryoSP and five times lower on-chip communication latency of CryoBus.

Article Search
IceBreaker: Warming Serverless Functions Better with Heterogeneity
Rohan Basu Roy, Tirthak Patel, and Devesh Tiwari
(Northeastern University, USA)
Serverless computing, an emerging computing model, relies on ``warming up'' functions prior to its anticipated execution for faster and cost-effective service to users. Unfortunately, warming up functions can be inaccurate and incur prohibitively expensive cost during the warmup period (i.e., keep-alive cost). In this paper, we introduce IceBreaker, a novel technique that reduces the service time and the ``keep-alive'' cost by composing a system with heterogeneous nodes (costly and cheaper). IceBreaker does so by dynamically determining the cost-effective node type to warm up a function based on the function's time-varying probability of the next invocation. By employing heterogeneity, IceBreaker allows for more number of nodes under the same cost budget and hence, keeps more number of functions warm and reduces the wait time during high load. Our real-system evaluation confirms that IceBreaker reduces the overall keep-alive cost by 45% and execution time by 27% using representative serverless applications and industry-grade workload trace. IceBreaker is the first technique to employ and leverage the idea of mixing expensive and cheaper nodes to improve both service time and keep-alive cost for serverless functions -- opening up a new research avenue of serverless computing on heterogeneous servers for researchers and practitioners.

Article Search
Tree Traversal Synthesis Using Domain-Specific Symbolic Compilation
Yanju Chen, Junrui Liu, Yu Feng, and Rastislav Bodik
(University of California at Santa Barbara, USA; University of Washington, USA)

Article Search
VELTAIR: Towards High-Performance Multi-tenant Deep Learning Services via Adaptive Compilation and Scheduling
Zihan Liu ORCID logo, Jingwen Leng ORCID logo, Zhihui Zhang ORCID logo, Quan Chen ORCID logo, Chao Li ORCID logo, and Minyi Guo ORCID logo
(Shanghai Jiao Tong University, China)
Deep learning (DL) models have achieved great success in many application domains. As such, many industrial companies such as Google and Facebook have acknowledged the importance of multi-tenant DL services. Although the multi-tenant service has been studied in conventional workloads, it is not been deeply studied on deep learning service, especially on general-purpose hardware.
In this work, we systematically analyze the opportunities and challenges of providing multi-tenant deep learning services on the general-purpose CPU architecture from the aspects of scheduling granularity and code generation. We propose an adaptive granularity scheduling scheme to both guarantee resource usage efficiency and reduce the scheduling conflict rate. We also propose an adaptive compilation strategy, by which we can dynamically and intelligently pick a program with proper exclusive and shared resource usage to reduce overall interference-induced performance loss. Compared to the existing works, our design can serve more requests under the same QoS target in various scenarios (e.g., +71%, +62%, +45% for light, medium, and heavy workloads, respectively), and reduce the averaged query latency by 50%.

Article Search
Who Goes First? Detecting Go Concurrency Bugs via Message Reordering
Ziheng Liu ORCID logo, Shihao Xia, Yu Liang, Linhai Song, and Hong Hu
(Pennsylvania State University, USA)

Article Search
GPUReplay: A 50-KB GPU Stack for Client ML
Heejin ParkORCID logo and Felix Xiaozhu Lin
(Purdue University, USA; University of Virginia, USA)
GPUReplay (GR) is a novel way for deploying GPU-accelerated computation on mobile and embedded devices. It addresses high complexity of a modern GPU stack for deployment ease and security. The idea is to record GPU executions on the full GPU stack ahead of time and replay the executions on new input at run time. We address key challenges towards making GR feasible, sound, and practical to use. The resultant replayer is a drop-in replacement of the original GPU stack. It is tiny (50 KB of executable), robust (replaying long executions without divergence), portable (running in a commodity OS, in TEE, and baremetal), and quick to launch (speeding up startup by up to two orders of magnitude). We show that GPUReplay works with a variety of integrated GPU hardware, GPU APIs, ML frameworks, and 33 neural network (NN) implementations for inference or training. The code is available at

Article Search
Efficiently Detecting Concurrency Bugs in Persistent Memory Programs
Zhangyu Chen, Yu Hua, Yongle Zhang, and Luochangqi Ding
(Huazhong University of Science and Technology, China; Purdue University, USA)
Due to the salient DRAM-comparable performance, TB-scale capacity, and non-volatility, persistent memory (PM) provides new opportunities for large-scale in-memory computing with instant crash recovery. However, programming PM systems is error-prone due to the existence of crash-consistency bugs, which are challenging to diagnose especially with concurrent programming widely adopted in PM applications to exploit hardware parallelism. Existing bug detection tools for DRAM-based concurrency issues cannot detect PM crash-consistency bugs because they are oblivious to PM operations and PM consistency. On the other hand, existing PM-specific debugging tools only focus on sequential PM programs and cannot effectively detect crash-consistency issues hidden in concurrent executions.
In order to effectively detect crash-consistency bugs that only manifest in concurrent executions, we propose PMRace, the first PM-specific concurrency bug detection tool. We identify and define two new types of concurrent crash-consistency bugs: PM Inter-thread Inconsistency and PM Synchronization Inconsistency. In particular, PMRace adopts PM-aware and coverage-guided fuzz testing to explore PM program executions. For PM Inter-thread Inconsistency, which denotes the data inconsistency hidden in thread interleavings, PMRace performs PM-aware interleaving exploration and thread scheduling to drive the execution towards executions that reveal such inconsistencies. For PM Synchronization Inconsistency between persisted synchronization variables and program data, PMRace identifies the inconsistency during interleaving exploration. The post-failure validation reduces the false positives that come from custom crash recovery mechanisms. PMRace has found 14 bugs (10 new bugs) in real-world concurrent PM systems including PM-version memcached.

Article Search
Invisible Bits: Hiding Secret Messages in SRAM’s Analog Domain
Jubayer Mahmod and Matthew Hicks
(Virginia Tech, USA)
Electronic devices are increasingly the subject of inspection by authorities. While encryption hides secret messages, it does not hide the transmission of those secret messages---in fact, it calls attention to them. Thus, an adversary, seeing encrypted data, turns to coercion to extract the credentials required to reveal the secret message. Steganographic techniques hide secret messages in plain sight, providing the user with plausible deniability, removing the threat of coercion.
This paper unveils Invisible Bits a new steganographic technique that hides secret messages in the analog domain of Static Random Access Memory (SRAM) embedded within a computing device. Unlike other memory technologies, the power-on state of SRAM reveals the analog-domain properties of its individual cells. We show how to quickly and systematically change the analog-domain properties of SRAM cells to encode data in the analog domain and how to reveal those changes by capturing SRAM's power-on state. Experiments with commercial devices show that Invisible Bits provides over 90% capacity---two orders-of-magnitude more than previous on-chip steganographic approaches, while retaining device functionality---even when the device undergoes subsequent normal operation or is shelved for months. Experiments also show that adversaries cannot differentiate between devices with encoded messages and those without. Lastly, we show how to layer encryption and error correction on top of our message encoding scheme in an end-to-end demonstration.

Article Search
Eavesdropping User Credentials via GPU Side Channels on Smartphones
Boyuan Yang ORCID logo, Ruirong Chen, Kai Huang, Jun Yang ORCID logo, and Wei GaoORCID logo
(University of Pittsburgh, USA)
Graphics Processing Unit (GPU) on smartphones is an effective target for hardware attacks. In this paper, we present a new side channel attack on mobile GPUs of Android smartphones, allowing an unprivileged attacker to eavesdrop the user's credentials, such as login usernames and passwords, from their inputs through on-screen keyboard. Our attack targets on Qualcomm Adreno GPUs and investigate the amount of GPU overdraw when rendering the popups of user's key presses of inputs. Such GPU overdraw caused by each key press corresponds to unique variations of selected GPU performance counters, from which these key presses can be accurately inferred. Experiment results from practical use on multiple models of Android smartphones show that our attack can correctly infer more than 80% of user's credential inputs, but incur negligible amounts of computing overhead and network traffic on the victim device. To counter this attack, this paper suggests mitigations of access control on GPU performance counters, or applying obfuscations on the values of GPU performance counters.

Article Search Video
GPM: Leveraging Persistent Memory from a GPU
Shweta Pandey, Aditya K Kamath, and Arkaprava Basu
(Indian Institute of Science, India)
The GPU is a key computing platform for many application domains. While the new non-volatile memory technology has brought the promise of byte-addressable persistence (a.k.a., persistent memory, or PM) to CPU applications, the same, unfortunately, is beyond the reach of GPU programs.
We take three key steps toward enabling GPU programs to access PM directly. First, enable direct access to PM from within a GPU kernel without needing to modify the hardware. Next, we demonstrate three classes of GPU-accelerated applications that benefit from PM. In the process, we create a workload suite with nine such applications. We then create a GPU library, written in CUDA, to support logging, checkpointing, and primitives for native persistence for programmers to easily leverage PM.

Article Search
FlexOS: Towards Flexible OS Isolation
Hugo Lefeuvre ORCID logo, Vlad-Andrei Bădoiu, Alexander Jung ORCID logo, Stefan Lucian Teodorescu, Sebastian Rauch, Felipe Huici, Costin Raiciu, and Pierre Olivier
(University of Manchester, UK; University Politehnica of Bucharest, Romania; Lancaster University, UK;, Germany; KIT, Germany; NEC Laboratories, Germany; Correct Networks, Romania)
At design time, modern operating systems are locked in a specific safety and isolation strategy that mixes one or more hardware/software protection mechanisms (e.g. user/kernel separation); revisiting these choices after deployment requires a major refactoring effort. This rigid approach shows its limits given the wide variety of modern applications' safety/performance requirements, when new hardware isolation mechanisms are rolled out, or when existing ones break.
We present FlexOS, a novel OS allowing users to easily specialize the safety and isolation strategy of an OS at compilation/deployment time instead of design time. This modular LibOS is composed of fine-grained components that can be isolated via a range of hardware protection mechanisms with various data sharing strategies and additional software hardening. The OS ships with an exploration technique helping the user navigate the vast safety/performance design space it unlocks. We implement a prototype of the system and demonstrate, for several applications (Redis/Nginx/SQLite), FlexOS' vast configuration space as well as the efficiency of the exploration technique: we evaluate 80 FlexOS configurations for Redis and show how that space can be probabilistically subset to the 5 safest ones under a given performance budget. We also show that, under equivalent configurations, FlexOS performs similarly or better than existing solutions which use fixed safety configurations.

Preprint Info
Creating Concise and Efficient Dynamic Analyses with ALDA
Xiang Cheng and David Devecsery
(Georgia Institute of Technology, USA)

Article Search
Suppressing ZZ Crosstalk of Quantum Computers through Pulse and Scheduling Co-Optimization
Lei Xie, Jidong Zhai, ZhenXing Zhang, Jonathan Allcock, Shengyu Zhang, and Yi-Cong Zheng ORCID logo
(Tsinghua University, China; Tencent Quantum Laboratory, China)
Noise is a significant obstacle to quantum computing, and ZZ crosstalk is one of the most destructive types of noise affecting superconducting qubits. Previous approaches to suppressing ZZ crosstalk have mainly relied on specific chip design that can complicate chip fabrication and aggravate decoherence. To some extent, special chip design can be avoided by relying on pulse optimization to suppress ZZ crosstalk. However, existing approaches are non-scalable, as their required time and memory grow exponentially with the number of qubits involved.
To address the above problems, we propose a scalable approach by co-optimizing pulses and scheduling. We optimize pulses to offer an ability to suppress ZZ crosstalk surrounding a gate, and then design scheduling strategies to exploit this ability and achieve suppression across the whole circuit. A main advantage of such co-optimization is that it does not require special hardware support. Besides, we implement our approach as a general framework that is compatible with different pulse optimization methods. We have conducted extensive evaluations by simulation and on a real quantum computer. Simulation results show that our proposal can improve the fidelity of quantum computing on 4∼12 qubits by up to 81× (11× on average). Ramsey experiments on a real quantum computer also demonstrate that our method can eliminate the effect of ZZ crosstalk to a great extent.

Article Search
Clio: A Hardware-Software Co-Designed Disaggregated Memory System
Zhiyuan Guo, Yizhou Shan, Xuhao Luo, Yutong Huang, and Yiying Zhang
(University of California at San Diego, USA)

Article Search
CirFix: Automatically Repairing Defects in Hardware Design Code
Hammad Ahmad, Yu Huang, and Westley Weimer
(University of Michigan, USA)
This paper presents CirFix, a framework for automatically repairing defects in hardware designs implemented in languages like Verilog. We propose a novel fault localization approach based on assignments to wires and registers, and a fitness function tailored to the hardware domain to bridge the gap between software-level automated program repair and hardware descriptions. We also present a benchmark suite of 32 defect scenarios corresponding to a variety of hardware projects. Overall, CirFix produces plausible repairs for 21/32 and correct repairs for 16/32 of the defect scenarios. This repair rate is comparable to that of successful program repair approaches for software, indicating CirFix is effective at bringing over the benefits of automated program repair to the hardware domain for the first time.

Article Search
Finding Missed Optimizations through the Lens of Dead Code Elimination
Theodoros Theodoridis ORCID logo, Manuel Rigger ORCID logo, and Zhendong Su ORCID logo
(ETH Zurich, Switzerland)
Compilers are foundational software development tools and incorporate increasingly sophisticated optimizations. Due to their complexity, it is difficult to systematically identify opportunities for improving them. Indeed, the automatic discovery of missed optimizations has been an important and significant challenge. The few existing approaches either cannot accurately pinpoint missed optimizations or target only specific analyses. This paper tackles this challenge by introducing a novel, effective approach that --- in a simple and general manner --- automatically identifies a wide range of missed optimizations. Our core insight is to leverage dead code elimination (DCE) to both analyze how well compilers optimize code and identify missed optimizations: (1) insert ``optimization markers'' in the basic blocks of a given program, (2) compute the program's live/dead basic blocks using the ``optimization markers'', and (3) identify missed optimizations from how well compilers eliminate dead blocks. We essentially exploit that, since DCE heavily depends on the rest of the optimization pipeline, through the lens of DCE, one can systematically quantify how well compilers optimize code. We conduct an extensive analysis of GCC and LLVM using our approach, which (1) provides quantitative and qualitative insights regarding their optimization capabilities, and (2) uncovers a diverse set of missed optimizations. Our results also lead to 84 bug reports for GCC and LLVM, of which 62 have already been confirmed or fixed, demonstrating our work's strong practical utility. We expect that the simplicity and generality of our approach will make it widely applicable for understanding compiler performance and finding missed optimizations. This work opens and initiates this promising direction.

Article Search
Temporal and SFQ Pulse-Streams Encoding for Area-Efficient Superconducting Accelerators
Patricia Gonzalez-GuerreroORCID logo, Meriam Gay Bautista, Darren Lyles, and George Michelogiannakis
(Lawrence Berkeley National Laboratory, USA)
Superconducting technology is a prime candidate for the future of computing. However, current superconducting prototypes are limited to small-scale examples due to stringent area constraints and complex architectures inspired from voltage-level encoding in CMOS; this is at odds with the ps-wide Single Quantum Flux (SFQ) pulses used in superconductors to carry information. In this work, we propose a wave-pipelined Unary SFQ (U-SFQ) architecture that leverages the advantages of two data representations: pulse-streams and Race Logic (RL). We introduce novel building blocks such as multipliers, adders, and memory cells, which leverage the natural properties of SFQ pulses to mitigate area constraints. We then design and simulate three popular hardware accelerators: i) a Processing Element (PE), typically used in spatial architectures; ii) A dot-product-unit (DPU), one of the most popular accelerators in artificial neural networks and digital signal processing (DSP); and iii) A Finite Impulse Response (FIR) filter, a popular and computationally demanding DSP accelerator. The proposed U-SFQ building blocks require up to 200× fewer JJs compared to their SFQ binary counterparts, exposing an area-delay trade-off. This work mitigates the stringent area constraints of superconducting technology.

Article Search
Yashme: Detecting Persistency Races
Hamed Gorjiara, Guoqing Harry Xu, and Brian Demsky
(University of California at Irvine, USA; University of California at Los Angeles, USA)
Persistent memory (PM) or Non-Volatile Random-Access Memory (NVRAM) hardware such as Intel’s Optane memory product promises to transform how programs store and manipulate information. Ensuring that persistent memory programs are crash consistent is a major challenge. We present a novel class of crash consistency bugs for persistent memory programs, which we call persistency races. Persistency races can cause non-atomic stores to be made partially persistent. Persistency races arise due to the interaction of standard compiler optimizations with persistent memory semantics.
We present Yashme, the first detector for persistency races. A major challenge is that in order to detect persistency races, the execution must crash in a very narrow window between a store with a persistency race and its corresponding cache flush operation, making it challenging for naïve techniques to be effective. Yashme overcomes this challenge with a novel technique for detecting races in executions that are prefixes of the pre-crash execution. This technique enables Yashme to effectively find persistency races even if the injected crashes do not fall into that window. We have evaluated Yashme on a range of persistent memory benchmarks and have found 26 real persistency races that have never been reported before.

Article Search Info
A Full-Stack Search Technique for Domain Optimized Deep Learning Accelerators
Dan Zhang, Safeen Huda, Ebrahim Songhori, Kartik Prabhu, Quoc Le, Anna Goldie, and Azalia Mirhoseini
(Google Brain, USA; Google, USA; Stanford University, USA)

Article Search
One Size Does Not Fit All: Security Hardening of MIPS Embedded Systems via Static Binary Debloating for Shared Libraries
Haotian Zhang ORCID logo, Mengfei Ren ORCID logo, Yu Lei, and Jiang Ming ORCID logo
(University of Texas at Arlington, USA)
Embedded systems have become prominent targets for cyberattacks. To exploit firmware’s memory corruption vulnerabilities, cybercriminals harvest reusable code gadgets from the large shared library codebase (e.g., uClibc). Unfortunately, unlike their desktop counterparts, embedded systems lack essential computing resources to enforce security hardening techniques. Recently, we have witnessed a surge of software debloating as a new defense mechanism against code-reuse attacks; it erases unused code to significantly diminish the possibilities of constructing reusable gadgets. Because of the single firmware image update style, static library debloating shows promise to fortify embedded systems without compromising performance and forward compatibility. However, static library debloating on stripped binaries (e.g., firmware’s shared libraries) is still an enormous challenge. In this paper, we show that this challenge is not insurmountable for MIPS firmware. We develop a novel system, named uTrimmer, to identify and wipe out unused basic blocks from shared libraries’ binary code, without causing additional runtime overhead or memory consumption. We propose a new method to identify address-taken blocks/functions, which further help us maintain an inter-procedural control flow graph to conservatively include library code that could be potentially used by firmware. By capturing address access patterns for position-independent code, we circumvent the challenge of determining code-pointer targets and safely eliminate unused code. We run uTrimmer to debloat shared libraries for SPEC CPU2017 benchmarks, popular firmware applications (e.g., Apache, BusyBox, and OpenSSL), and a real-world wireless router firmware image. Our experiments show that not only does uTrimmer deliver functional programs, but also it can cut the exposed code surface and eliminate various reusable code gadgets remarkably. uTrimmer’s debloating capability can compete with the static linking results.

Article Search
Domain Specific Run Time Optimization for Software Data Planes
Sebastiano MianoORCID logo, Alireza Sanaee ORCID logo, Fulvio Risso ORCID logo, Gábor Rétvári ORCID logo, and Gianni Antichi ORCID logo
(Queen Mary University of London, UK; Politecnico di Torino, Italy; Budapest University of Technology and Economics, Hungary)
State-of-the-art approaches to design, develop and optimize software packet-processing programs are based on static compilation: the compiler's input is a description of the forwarding plane semantics and the output is a binary that can accommodate any control plane configuration or input traffic.
In this paper, we demonstrate that tracking control plane actions and packet-level traffic dynamics at run time opens up new opportunities for code specialization. We present Morpheus, a system working alongside static compilers that continuously optimizes the targeted networking code. We introduce a number of new techniques, from static code analysis to adaptive code instrumentation, and we implement a toolbox of domain specific optimizations that are not restricted to a specific data plane framework or programming language. We apply Morpheus to several eBPF and DPDK programs including Katran, Facebook's production-grade load balancer. We compare Morpheus against state-of-the-art optimization frameworks and show that it can bring up to 2x throughput improvement, while halving the 99th percentile latency.

Article Search Info
Path-Sensitive and Alias-Aware Typestate Analysis for Detecting OS Bugs
Tuo Li, Jia-Ju Bai, Yulei SuiORCID logo, and Shi-Min Hu
(Tsinghua University, China; University of Technology Sydney, Australia)
Operating system (OS) is the cornerstone for modern computer systems. It manages devices and provides fundamental service for user-level applications. Thus, detecting bugs in OSes is important to improve reliability and security of computer systems. Static typestate analysis is a common technique for detecting different types of bugs, but it is often inaccurate or unscalable for large-size OS code, due to imprecision of identifying alias relationships as well as high costs of typestate tracking and path-feasibility validation.
In this paper, we present PATA, a novel path-sensitive and aliasaware typestate analysis framework to detect OS bugs. To improve the precision of identifying alias relationships in OS code, PATA performs a path-based alias analysis based on control-flow paths and access paths. With these alias relationships, PATA reduces the costs of typestate tracking and path-feasibility validation, to boost the efficiency of path-sensitive typestate analysis for bug detection. We have evaluated PATA on the Linux kernel and three popular IoT OSes (Zephyr, RIOT and TencentOS-tiny) to detect three common types of bugs (null-pointer dereferences, uninitialized variable accesses and memory leaks). PATA finds 574 real bugs with a false positive rate of 28%. 206 of these bugs have been confirmed by the developers of the four OSes.We also compare PATA to seven state-of-the-art static approaches (Cppcheck, Coccinelle, Smatch,CSA, Infer, Saber and SVF). PATA finds many real bugs missed by them, with a lower false positive rate.

Article Search
CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation
Brian Suchy, Souradip Ghosh, Aaron Nelson, Zhen Huang, Drew Kersnar, Siyuan Chai, Michael Cuevas, Alex Bernat, Gaurav Chaudhary, Nikos Hardavellas, Simone Campanoni, and Peter Dinda
(Northwestern University, USA)
Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management.

Article Search
REVAMP: A Systematic Framework for Heterogeneous CGRA Realization
Thilini Kaushalya Bandara ORCID logo, Dhananjaya Wijerathne ORCID logo, Tulika Mitra ORCID logo, and Li-Shiuan Peh ORCID logo
(National University of Singapore, Singapore)
Coarse-Grained Reconfigurable Architectures (CGRAs) provide an excellent balance between performance, energy efficiency, and flexibility. However, increasingly sophisticated applications, especially on the edge devices, demand even better energy efficiency for longer battery life.
Most CGRAs adhere to a canonical structure where a homogeneous set of processing elements and memories communicate through a regular interconnect due to the simplicity of the design. Unfortunately, the homogeneity leads to substantial idle resources while mapping irregular applications and creates inefficiency. We plan to mitigate the inefficiency by systematically and judiciously introducing heterogeneity in CGRAs in tandem with appropriate compiler support.
We propose REVAMP, an automated design space exploration framework that helps architects uncover and add pertinent heterogeneity to a diverse range of originally homogeneous CGRAs when fed with a suite of target applications. REVAMP explores a comprehensive set of optimizations encompassing compute, network, and memory heterogeneity, thereby converting a uniform CGRA into a more irregular architecture with improved energy efficiency. As CGRAs are inherently software scheduled, any micro-architectural optimizations need to be partnered with corresponding compiler support, which is challenging with heterogeneity. The REVAMP framework extends compiler support for efficient mapping of loop kernels on the derived heterogeneous CGRA architectures.
We showcase REVAMP on three state-of-the-art homogeneous CGRAs, demonstrating how REVAMP derives a heterogeneous variant of each homogeneous architecture, with its corresponding compiler optimizations. Our results show that the derived heterogeneous architectures achieve up to 52.4% power reduction, 38.1% area reduction, and 36% average energy reduction over the corresponding homogeneous versions with minimal performance impact for the selected kernel suite.

Article Search
RSSD: Defend against New Ransomware Attacks with Efficient Hardware-Assisted Logging and Post-Attack Analysis
Benjamin Reidys, Peng Liu, and Jian Huang
(University of Illinois at Urbana-Champaign, USA; Pennsylvania State University, USA)
Encryption ransomware has become a notorious malware. It encrypts user data on storage devices like solid-state drives (SSDs) and demands a ransom to restore data for users. To bypass existing defenses, ransomware would keep evolving and performing new attack models. For instance, we identify and validate three new attacks, including (1) garbage-collection (GC) attack that exploits storage capacity and keeps writing data to trigger GC and force SSDs to release the retained data; (2) timing attack that intentionally slows down the pace of encrypting data and hides its I/O patterns to escape existing defense; (3) trimming attack that utilizes the trim command available in SSDs to physically erase data.
To enhance the robustness of SSDs against these attacks, we propose RSSD, a ransomware-aware SSD. It redesigns the flash management of SSDs for enabling the hardware-assisted logging, which can conservatively retain older versions of user data and received storage operations in time order with low overhead. It also employs hardware-isolated NVMe over Ethernet to expand local storage capacity by transparently offloading the logs to remote cloud/servers in a secure manner. RSSD enables post-attack analysis by building a trusted evidence chain of storage operations to assist the investigation of ransomware attacks. We develop RSSD with a real-world SSD FPGA board. Our evaluation shows that RSSD can defend against new and future ransomware attacks, while introducing negligible performance overhead.

Article Search
Software-Defined Address Mapping: A Case on 3D Memory
Jialiang Zhang, Michael Swift, and Jing (Jane) Li
(University of Pennsylvania, USA; University of Wisconsin-Madison, USA)
3D-stacking memory such as High-Bandwidth Memory (HBM) and Hybrid Memory Cube (HMC) provides orders of magnitude more bandwidth and significantly increased channel-level parallelism (CLP) due to its new parallel memory architecture. However, it is challenging to fully exploit the abundant CLP for performance as the bandwidth utilization is highly dependent on address mapping in the memory controller. Unfortunately, CLP is very sensitive to a program’s data access pattern, which is not made available to OS/hardware by existing mechanisms.
In this work, we address these challenges with software-defined address mapping (SDAM) that, for the first time, enables user program to obtain a direct control of the low-level memory hardware in a more intelligent and fine-grained manner. In particular, we develop new mechanisms that can effectively communicate a program’s data access properties to the OS and hardware and to use it to control data placement in hardware. To guarantee correctness and reduce overhead in storage and performance, we extend Linux kernel and C-language memory allocators to support multiple address mappings. For advanced system optimization, we develop machine learning methods that can automatically identify access patterns of major variables in a program and cluster these with similar access patterns to reduce the overhead for SDAM. We demonstrate the benefits of our design on real system prototype, comprising (1) a RISC-V processor, near memory accelerators and HBM modules using Xilinx FPGA platform, and (2) modified Linux and glibc. Our evaluation on standard CPU benchmarks and data-intensive benchmarks (for both CPU and accelerators) demonstrates 1.41×, 1.84× speedup on CPU and 2.58× on near memory accelerators in our system with SDAM compared to a baseline system that uses a fixed address mapping.

Article Search
Protecting Adaptive Sampling from Information Leakage on Low-Power Sensors
Tejas Kannan and Henry HoffmannORCID logo
(University of Chicago, USA)
Adaptive sampling is a powerful family of algorithms for managing energy consumption on low-power sensors. These algorithms use captured measurements to control the sensor's collection rate, leading to near-optimal error under energy constraints. Adaptive sampling's data-driven nature, however, comes at a cost in privacy. In this work, we demonstrate how the collection rates of general adaptive policies leak information about captured measurements. Further, individual adaptive policies display this leakage on multiple tasks. This result presents a challenge in maintaining privacy for sensors using energy-efficient batched communication. In this context, the size of measurement batches exposes the sampling policy's collection rate. Thus, an attacker who monitors the encrypted link between sensor and server can use message lengths to uncover information about the captured values. We address this side-channel by introducing a framework called Adaptive Group Encoding (AGE) that protects any periodic adaptive sampler. AGE uses quantization to encode all batches as fixed-length messages, making message sizes independent of the collection rate. AGE reduces the quantization error through a series of transformations. The proposed framework preserves the low error of adaptive sampling while preventing information leakage and incurring negligible energy overhead.

Article Search
FlexDriver: A Network Driver for Your Accelerator
Haggai EranORCID logo, Maxim Fudim, Gabi Malka, Gal Shalom, Noam Cohen, Amit Hermony, Dotan Levi, Liran Liss, and Mark Silberstein ORCID logo
(NVIDIA, Israel; Technion, Israel)
We propose a new system design for connecting hardware and FPGA accelerators to the network, allowing the accelerator to directly control commodity Network Interface Cards (NICs) without using the CPU. This enables us to solve the key challenge of leveraging existing NIC hardware offloads such as virtualization, tunneling, and RDMA for accelerator networking. Our approach supports a diverse set of use cases, from direct network access for disaggregated accelerators to inline-acceleration of the network stack, all without the complex networking logic in the accelerator.
To demonstrate the feasibility of this approach, we build FlexDriver (FLD), an on-accelerator hardware module that implements a NIC data-plane driver. Our main technical contribution is a mechanism that compresses the NIC control structures by two orders of magnitude, allowing FLD to achieve high networking scalability with low die area cost and no bandwidth interference with the accelerator logic.
The prototype for NVIDIA Innova-2 FPGA SmartNICs showcases our design’s utility for three different accelerators: a disaggregated LTE cipher, an IP-defragmentation inline accelerator, and an IoT cryptographic-token authentication offload. These accelerators reach 25 Gbps line rate and leverage the NIC for RDMA processing, VXLAN tunneling, and traffic shaping without CPU involvement.

Article Search
RecShard: Statistical Feature-Based Memory Optimization for Industry-Scale Neural Recommendation
Geet Sethi, Bilge Acun, Niket Agarwal, Christos Kozyrakis, Caroline TrippelORCID logo, and Carole-Jean Wu
(Stanford University, USA; Facebook, USA)

Article Search
Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads
Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sabet, Saeed Maleki, Youshan Miao ORCID logo, Madan Musuvathi, Todd Mytkowicz, and Olli Saarikivi ORCID logo
(University of Massachusetts at Amherst, USA; Ohio State University, USA; Chinese Academy of Sciences, China; University of California at Riverside, USA; Microsoft Research, USA; Microsoft Research, China)

Article Search
Continuous Address Space Layout Re-randomization for Linux Drivers
Ruslan Nikolaev, Hassan Nadeem, Cathlyn Stone, and Binoy Ravindran
(Virginia Tech, USA; Pennsylvania State University, USA)
While address space layout randomization (ASLR) has been extensively studied for user-space programs, the corresponding OS kernel's KASLR support remains very limited, making the kernel vulnerable to just-in-time (JIT) return-oriented programming (ROP) attacks. Furthermore, commodity OSs such as Linux restrict their KASLR range to 32 bits due to architectural constraints (e.g., x86-64 only supports 32-bit immediate operands for most instructions), which makes them vulnerable to even unsophisticated brute-force ROP attacks due to low entropy. Most in-kernel pointers remain static, exacerbating the problem when pointers are leaked.
Adelie, our kernel defense mechanism, overcomes KASLR limitations, increases KASLR entropy, and makes successful ROP attacks on the Linux kernel much harder to achieve. First, Adelie enables the position-independent code (PIC) model so that the kernel and its modules can be placed anywhere in the 64-bit virtual address space, at any distance apart from each other. Second, Adelie implements stack re-randomization and address encryption on modules. Finally, Adelie enables efficient continuous KASLR for modules by using the PIC model to make it (almost) impossible to inject ROP gadgets through these modules regardless of gadget's origin.
Since device drivers (typically compiled as modules) are often developed by third parties and are typically less tested than core OS parts, they are also often more vulnerable. By fully re-randomizing device drivers, the last two contributions together prevent most JIT ROP attacks since vulnerable modules are very likely to be a starting point of an attack. Furthermore, some OS instances in virtualized environments are specifically designated to run device drivers, where drivers are the primary target of JIT ROP attacks. Using a GCC plugin that we developed, we automatically modify different kinds of kernel modules. Since the prior art tackles only user-space programs, we solve many challenges unique to the kernel code. Our evaluation shows high efficiency of Adelie's approach: the overhead of the PIC model is completely negligible and re-randomization cost remains reasonable for typical use cases.

Article Search
ViK: Practical Mitigation of Temporal Memory Safety Violations through Object ID Inspection
Haehyun Cho ORCID logo, Jinbum Park, Adam Oest, Tiffany Bao, Ruoyu (Fish) Wang, Yan Shoshitaishvili, Adam Doupe, and Gail-Joon Ahn
(Soongsil University, South Korea; Samsung Research, n.n.; PayPal, USA; Arizona State University, USA; Samsung Research, USA)

Article Search

proc time: 5.43