PPAA 2015 – Proceedings

Computing at the Speed of Trading (Keynote)
Neil Bartlett
(IBM, Canada)
When the word trading comes up in conversation, most people think first of the stock markets and frantic traders of movie and television. In reality many of the largest deals are done over-the-counter and so expose the parties to the contract to each other’s financial circumstances. This raises the question: what is my potential future exposure (PFE) if “the other guy” – the counterparty – defaults? Sophisticated measures like PFE and the related CVA allow firms to monitor their exposure to others, to limit it, and to ensure their capital will support the deals it makes. This analysis involves forecasting deal values far into the future, examining legal agreements between the two firms, and evaluating the deal itself. Algorithms in this domain use thousands of scenarios as well as complex aggregation and pricing techniques, all across hundreds of future time points to produce actionable risk metrics. In this talk I’ll discuss some of the complexities of the problem, how it can be broken down into efficient computational chunks and delve into our recent experiments with parallelizing the aggregation across scenarios and time points using OpenMP to enhance real-time performance.

Large Scale Machine Learning for Response Prediction (Invited Talk)
Bo Long
(LinkedIn, USA)
I describe LASER, a scalable response prediction platform currently used as part of a social network advertising system. LASER enables the familiar logistic regression model to be applied to very large scale response prediction problems, including ones beyond advertising. Though the underlying model is well understood, we apply a whole-system approach to address model accuracy, scalability, explore-exploit, and real-time inference. To facilitate training with both large numbers of training examples and high dimensional features on commodity clustered hardware, we employ the Alternating Direction Method of Multipliers (ADMM). Because online advertising applications are much less static than classical presentations of response prediction, LASER employs a number of techniques that allows it to adapt in real time.

Analytics Applications on the Cloud: Business Potential, Solution Requirements, and Research Opportunities (Invited Talk)
Gaurav Rewari and Rahul Kapoor
(Numerify, USA)
The rapid adoption of cloud applications like SalesForce, ServiceNow, NetSuite, Marketo etc. has opened up an interesting opportunity for vendors of analytical applications and BI middleware as these modern cloud data sources are not well served by existing BI Tools and applications. For customers, part or all of their data moving to the cloud also means any existing on-premise warehouses and analytical applications are partially or fully defunct, and with the increased acceptance of moving application level functionality to the cloud there is little interest in upgrading the defunct on-premise offerings. A new class of software vendors, including Numerify, step into this gap by providing a cloud based, end-to-end solution for data extraction, transformation and warehouse based analytical applications, directly to the business user in select domains (e.g. IT Service Management, Human Resources and Financials). This talk summarizes the overall business potential for analytics on the cloud, expands on “analytical applications” with some examples, discusses engineering challenges in implementing cloud based analytics solutions highlighting some advances in ETL techniques, and points to potential research opportunities in this space.

A Pattern Oriented Approach for Designing Scalable Analytics Applications (Invited Talk)
Matthew Dixon
(University of San Francisco, USA)
The biggest gain in fast processing of big-data will most likely be a result of mapping computation onto clusters of machines while exploiting per-processor parallelism by means of vector instructions and multi-threading. As a result, a new generation of parallel computing infrastructure is needed, driven by the need for application scalability through harnessing growing computing resources in the public cloud, in private data centers and custom clusters, and even on the desktop.
This paper uses Our Pattern Language (OPL) to guide the design of a pattern oriented software framework for analytics applications which enables scalability, flexibility, modularity and portability. Using a compute intensive financial application as a motivating example, we demonstrate how following a pattern oriented design approach leads to parallel code in which the domains of concerns for modeling and mapping the computations to the architecture are cleanly delineated, the code is portable across architectures, and accessible to both an application developer and systems developer. In future work, we seek to demonstrate this software framework by architecting and developing a portable and scalable version of quantlib, a popular open-source quantitative finance library.

Mahout on Heterogeneous Clusters using HadoopCL
Xiangyu Li, Max Grossman, and David Kaeli

(Northeastern University, USA; Rice University, USA)
MapReduce is a programming model capable of processing large amounts of data in parallel across hundreds of compute nodes in a cluster. Many applications have leveraged the power of this model, including a number of compute-hungry applications in machine learning. MapReduce can meet the demands of massive data analysis in applications such as web search, digital media selection, and online shopping analytics. The Mahout recommendation system is one of the most popular open source recommendation systems that employs machine learning techniques. Mahout provides us with a parallel computing infrastructure that can be applied to implement a range of applications that work with large datasets. A complementary trend in cluster computing has been the use of GPUs to increase the performance of data-intensive applications. There have been several efforts to utilize GPUs to accelerate the MapReduce framework to improve performance. HadoopCL is a framework that auto-generates OpenCL kernels from Hadoop tasks and can then run them across a heterogeneous cluster. In this paper, we analyze the performance of a Mahout recommendation system running on different cluster platforms, including CPU-only platforms and heterogeneous platforms, where both discrete GPUs and integrated APUs can be evaluated. We propose a cooperative HadoopCL model that improves both GPU/APU programming flexibility and performance.

Characterizing Dataset Dependence for Sparse Matrix-Vector Multiplication on GPUs
Naser Sedaghati, Arash Ashari, Louis-Noël Pouchet, Srinivasan Parthasarathy, and P. Sadayappan
(Ohio State University, USA)
Sparse matrix-vector multiplication (SpMV) is a widely used kernel in scientific applications as well as data analytics. Many GPU implementations of SpMV have been proposed, proposing different sparse matrix representations. However, no sparse matrix representation is consistently superior, and the best representation varies for sparse matrices with different sparsity patterns. In this paper we study four popular sparse representations implemented in the NVIDIA cuSPARSE library: CSR, ELL, COO and a hybrid ELL-COO scheme. We analyze statistical features of a dataset of 27 matrices, covering a wide spectrum of sparsity features, and attempt to correlate SpMV performance with each representation with simple aggregate metrics of the matrices. We present some insights on the correlation between matrix features and the best choice for sparse matrix representation.

BLAS Extensions for Algebraic Pricing Methods
Paolo Regondi, Mohammad Zubair, and Claudio Albanese
(Global Valuation, UK; Old Dominion University, USA)
Partial differential equation (PDE) pricing methods such as backward and forward induction are typically implemented as unconditionally marginally stable algorithms in double precision for individual transactions. In this paper, we reconsider this strategy and argue that optimal GPU implementations should be based on a quite different strategy involving higher level BLAS routines. We argue that it is advantageous to use conditionally strongly stable algorithms in single precision and to price concurrently sub-portfolios of similar transactions. To support these operator algebraic methods, we propose some BLAS extensions. CUDA implementations of our extensions turn out to be significantly faster than implementations based on standard cuBLAS. The key to the performance gain of our implementation is in the efficient utilization of the memory system of the new GPU architecture.

A Portable Benchmark Suite for Highly Parallel Data Intensive Query Processing
Ifrah Saeed, Jeffrey Young

, and Sudhakar Yalamanchili
(Georgia Tech, USA)
Traditionally, data warehousing workloads have been processed using CPU-focused clusters, such as those that make up the bulk of available machines in Amazon's EC2, and the focus on improving analytics performance has been to utilize a homogenous, multi-threaded CPU environment with optimized algorithms for this infrastructure. The increasing availability of highly parallel accelerators, like the GPU and Xeon Phi discrete accelerators, in these types of clusters has provided an opportunity to further accelerate analytics operations but at a high programming cost due to optimizations required to fully utilize each of these new pieces of hardware. This work describes and analyzes highly parallel relational algebra primitives that are developed to focus on data warehousing queries through the use of a common OpenCL framework that can be executed both on standard multi-threaded processors and on emerging accelerator architectures. As part of this work, we propose a set of data-intensive benchmarks to help compare and differentiate the performance of accelerator hardware and to determine the key characteristics for efficiently running data warehousing queries on accelerators.

Monitoring HPC Applications in the Production Environment
Hadi Sharifi, Omar Aaziz, and Jonathan Cook
(Intel, USA; New Mexico State University, USA)
The advancement of HPC systems brings with it a need for more introspection into the run-time environment and performance of long-running applications. Software and hardware fault tolerance, scaling performance issues, soft error effect on computations, and even large-scale computational progress will require more capable run-time monitoring of applications during production runs. Current HPC toolsets, however, are geared towards heavyweight program introspection during the development, debugging, and optimization phases of software development, and are not suitable for production monitoring. To this end this work has developed ProMon, a framework for the production monitoring of HPC applications. ProMon is part of a vision for automatic monitoring of HPC applications without requiring developers or users to do significant work to support it. This paper presents ProMon, motivates its purpose, and shows that production monitoring can be performed efficiently.

PPAA 2015 – Proceedings

2nd Workshop on Parallel Programming for Analytics Applications (PPAA 2015)