Machine learning is an increasingly large fraction of datacenter workloads, making efficient execution of ML models a priority for industry. At the same time, the slow down of Moore's Law has created space for a plethora of innovative hardware designs to wring maximum performance from each transistor. To bridge the gap between software and hardware, we need compilers that understand both the characteristics of ML workloads and the nuances of the hardware. In this talk I will describe how Facebook's Glow compiler leverages LLVM infrastructure to build a high-performance software stack for machine learning, by combining high-level domain-specific optimizations with customized low-level code generation strategies.
Graphics Processing Units (GPUs) have been widely adopted to accelerate the execution of High Performance Computing (HPC) workloads due to their enormous computational throughput, ability to execute a large number of threads inside SIMD groups in parallel, and their use of multithreaded hardware to hide long pipelining and memory access latency. However, developing applications able to exploit the high performance of GPUs requires proper code tuning. As a consequence, computer scientists proposed different approaches to simplify GPU programming, including directive-based programming models such as OpenMP and OpenACC. Their intention is to solve the aforementioned programming challenges with a directive-based approach which allows the users to insert non-executable pragma constructs that guide the compiler to handle the low-level complexities of the system. Flang, a Fortran front end for the LLVM Compiler Infrastructure, has drawn attention from the HPC community. Although Flang supports OpenMP for multicore architectures, it has no capability of offloading parallel regions to accelerator devices. In this paper, we present OpenMP Offload support in Flang targeting NVIDIA GPUs. Our goal is to investigate possible implementation strategies of OpenMP GPU offloading into Flang. The experimental results show that our approach is able to achieve performance similar to existing compilers with OpenMP GPU offload support.
With the diversification of HPC architectures beyond traditional CPU-based clusters, a number of new frameworks for performance portability across architectures have arisen. One way of implementing such frameworks is to use C++ templates and lambda expressions to design loop-like functions. However, lower level programming APIs that these implementations must use are often designed with C in mind and do not specify how they interact with C++ features such as lambda expressions.
This paper discusses a change to the behavior of the OpenMP specification with respect to lambda expressions such that when functions generated by lambda expressions are called inside GPU regions, any pointers used in the lambda expression correctly refer to device pointers. This change has been implemented in a branch of the Clang C++ compiler and demonstrated with two representative codes. This change has also been accepted into the draft OpenMP specification for inclusion in OpenMP 5. Our results show that the implicit mapping of lambda expressions always exhibits identical performance to an explicit mapping but without breaking the abstraction provided by the high level frameworks.
OpenACC was launched in 2010 as a portable programming model for heterogeneous accelerators. Although various implementations already exist, no extensible, open-source, production-quality compiler support is available to the community. This deficiency poses a serious risk for HPC application developers targeting GPUs and other accelerators, and it limits experimentation and progress for the OpenACC specification. To address this deficiency, Clacc is a recent effort funded by the US Exascale Computing Project to develop production OpenACC compiler support for Clang and LLVM. A key feature of the Clacc design is to translate OpenACC to OpenMP to build on Clang's existing OpenMP compiler and runtime support. In this paper, we describe the Clacc goals and design. We also describe the challenges that we have encountered so far in our prototyping efforts, and we present some early performance results.
The vectorization of loops invoking math function is an important optimization that is available in most commercial compilers. This paper describes a new command line option, -fsimdmath, available in Arm Compiler for HPC, that enables auto-vectorization of math functions in C and C++ code, and that will also be applicable to Fortran code in a future versions.
The design of -fsimdath is based on open standards and public architectural specifications. The library that provides the vector implementation of the math routines, libsimdmath.so, is shipped with the compiler and based on the SLEEF library libsleefgnuabi.so. SLEEF is a project that aims to provide a vector implementation of all C99 math functions, for a wide variety of vector extensions and architectures, across multiple platforms.
This feature is very important for HPC programmers, because the vector units of new CPUs are getting wider. Whether you are targeting Intel architectures with the AVX512 vector extension, or Arm architectures with the Scalable Vector Extension, good quality auto-vectorization is of increasing importance.
Although -fsimdmath has been implemented in a commercial compiler, it has been designed with portability and compatibility in mind, so that its use is not limited only to the vector extensions of the Arm architectures, but can be easily introduced as a major optimization for all the vector extensions that LLVM supports.
If accepted upstream, this new feature will enlarge the set of loops that LLVM will be able to auto-vectorize.
Currently, there are three vectorizers in the LLVM trunk: Loop Vectorizer, SLP Vectorizer, and Load-Store Vectorizer. There is a need for vectorizing functions/kernels: 1) Function calls are an integral part of programming real world application code and we cannot always rely on fully inlining them. When a function call is made from a vectorized context such as vectorized loop or vectorized function, if there are no vectorized callees available, the call has to be made to a scalar callee, one vector element at a time. At the programming model level, OpenMP declare simd is a standardized syntax to address this problem. LLVM needs a vectorizer to properly vectorize OpenMP declare simd functions. 2) Also, in the GPGPU programming model, such as OpenCL, work-item (thread) parallelism is not expressed with a loop; it is implicit in the execution of the kernels. In order to exploit SIMD parallelism at this top-level (thread-level), we need to start from vectorizing the kernel.
One of the obvious ways to vectorize functions/kernels is to add a fourth vectorizer that specifically deals with function vectorization. In this paper, we argue that such a naive approach will lead us to sub-optimal performance and/or higher maintenance burden. Instead, we present a technique to take advantages of the current functionalities and future improvements of Loop Vectorizer in order to vectorize functions and kernels.
Directives for the compiler such as pragmas can help programmers to separate an algorithm's semantics from its optimization. This keeps the code understandable and easier to optimize for different platforms. Simple transformations such as loop unrolling are already implemented in most mainstream compilers. We recently submitted a proposal to add generalized loop transformations to the OpenMP standard. We are also working on an implementation in LLVM/Clang/Polly to show its feasibility and usefulness. The current prototype allows applying patterns common to matrix-matrix multiplication optimizations.
Domain Specific Languages or Active Library frameworks have recently emerged as an important method for gaining performance portability, where an application can be efficiently executed on a wide range of HPC architectures without significant manual modifications. Embedded DSLs such as OP2, provides an API embedded in general purpose languages such as C/C++/Fortran. They rely on source-to-source translation and code refactorization to translate the higher-level API calls to platform specific parallel implementations. OP2 targets the solution of unstructured-mesh computations, where it can generate a variety of parallel implementations for execution on architectures such as CPUs, GPUs, distributed memory clusters and heterogeneous processors making use of a wide range of platform specific optimizations. Compiler tool-chains supporting source-to-source translation of code written in mainstream languages currently lack the capabilities to carry out such wide-ranging code transformations. Clang/LLVM’s Tooling library (LibTooling) has long been touted as having such capabilities but have only demonstrated its use in simple source refactoring tasks.
In this paper we introduce OP2-Clang, a source-to-source translator based on LibTooling, for OP2’s C/C++ API, capable of generating target parallel code based on SIMD, OpenMP, CUDA and their combinations with MPI. OP2-Clang is designed to significantly reduce maintenance, particularly making it easy to be extended to generate new parallelizations and optimizations for hardware platforms. In this research, we demonstrate its capabilities including (1) the use of LibTooling’s AST matchers together with a simple strategy that use parallelization templates or skeletons to significantly reduce the complexity of generating radically different and transformed target code and (2) chart the challenges and solution to generating optimized parallelizations for OpenMP, SIMD and CUDA. Results indicate that OP2-Clang produces near-identical parallel code to that of OP2’s current source-to-source translator. We believe that the lessons learnt in OP2-Clang can be readily applied to developing other similar source-to-source translators, particularly for DSLs.
The relationship of application performance to its required development effort plays an important role in today’s budget-oriented HPC environment. This effort-performance relationship is especially affected by the structure and characterization of an HPC application. We aim at a classification of HPC applications using (design) patterns for parallel programming. For an efficient analysis of parallel patterns and applicable pattern definitions, we introduce our tool PInT that is based on source code instrumentation and Clang LibTooling. Furthermore, we propose metrics to examine occurrences and compositions of patterns that can be automatically evaluated by PInT. In two case studies, we show the applicability and functionality of PInT.
Measuring performance-critical characteristics of application workloads is important both for developers, who must understand and optimize the performance of codes, as well as designers and integrators of HPC systems, who must ensure that compute architectures are suitable for the intended workloads. However, if these workload characteristics are tied to architectural features that are specific to a particular system, they may not generalize well to alternative or future systems. An architecture-independent method ensures an accurate characterization of inherent program behaviour, without bias due to architecture-dependent features that vary widely between different types of accelerators. This work presents the first architecture-independent workload characterization framework for heterogeneous compute platforms, proposing a set of metrics determining the suitability and performance of an application on any parallel HPC architecture. The tool, AIWC, is a plugin for the open-source Oclgrind simulator. It supports parallel workloads and is capable of characterizing OpenCL codes currently in use in the supercomputing setting. AIWC simulates an OpenCL device by directly interpreting LLVM instructions, and the resulting metrics may be used for performance prediction and developer feedback to guide device-specific optimizations. An evaluation of the metrics collected over a subset of the Extended OpenDwarfs Benchmark Suite is also presented.
Heterogeneous platforms may include accelerators such as Digital Signal Processors (DSP’s) that employ SW-controlled scratch-pad memories instead of, or in addition to standard HW-cached memory. Controlling scratch-pads efficiently typically requires tiling and pipelining loops, thereby optimizing for memory locality rather than parallelism as a primary objective. On the other hand, achieving high performance on CPU’s and GPU’s typically requires optimizing for data-level parallelism as a primary objective, compromising locality. In this lightning talk we show how OpenCL and LLVM can be used to achieve both target-dependent locality and target-independent parallelism. Such an approach facilitates the development of optimized software for DSP accelerators while enabling its efficient execution on standard servers. Following the work of Tian et al., our approach leverages automatic compiler optimization and relies purely on OpenCL, including its device-side enqueue capability and SPIR-V format.
To leverage widely available accelerators, OpenMP has introduced device constructs. Device constructs simplify the development of heterogeneous parallel programs and improve the performance. Many compilers including Clang already have support for device constructs, but there exist few documentations about the implementation details of device constructs. Lacking implementation details makes it cumbersome to understand the root cause of concurrency bugs and performance issues encountered on accelerators. In this paper, we conduct a study on Clang to analyze the implementation of device constructs for GPUs. We manually analyze the generated Parallel Thread Execution (PTX) code for each OpenMP construct to determine the relationship between the construct and PTX instructions. Based on the analysis, we evaluate the correctness of these constructs and discuss potential concurrency bugs incurred by incorrect usage of device constructs, for instance, data races, stale data and atomicity violation. Furthermore, we also talk about three observed inconsistencies in Clang, which may misinform programmers while writing an OpenMP program. Our work can help programmers gain a better understanding of device offloading and avoid hidden pitfalls when using Clang and OpenMP.
The C++ Direction Group has set a future direction for C++ and includes a guidance towards Heterogeneous C++. The introduction of the executors TS means for the first time in C++ there will be a standard platform for writing applications which can execute across a wide range of architectures including multi-core and many-core CPUs, GPUs, DSPs, and FPGAs.
The SYCL standard from the Khronos Group is a strong candidate to implement this upcoming C++ standard as are many other C++ frameworks from DOE, and HPX for the distributed case. One of SYCL's main strength is the capability to support constraint accelerator systems as it only requires OpenCL 1.2. One of the core ideas of the standard is that everything must be standard C++, the only exception being that some feature of C++ cannot be used in places that can be executed on an OpenCL device, often due to hardware limitation.
This paper presents some of the challenges and solutions to implement a Heterogeneous C++ standard in clang based on our implementation of Khrono's SYCL language with Codeplay's ComputeCpp compiler, with the fast growth of C++ and clang being a platform of choice to prototype many of the new C++ features.
We describe the major issues with ABI for separate compilation tool chain that comes from non-standard layout type of lambdas, as well as the issues of data addressing that comes from non-flat and possibly non-coherent address space.
We also describe various papers which are being proposed to ISO C++ to move towards standardizing heterogeneous and distributed computing in C++. The introduction of a unified interface for execution across a wide range of different hardware, extensions to this to support concurrent exception handling and affinity queries, and an approach to improve the capability of the parallel algorithms through composability. All of this adds up to a future C++ which is much more aware of heterogeneity and capable of taking advantage of it to improve parallelism and performance.