The HAMR User’s Guide

HAMR is a library defining an accelerator technology agnostic memory model that bridges between accelerator technologies (CUDA, HIP, ROCm, OpenMP, Sycl, OpenCL, Kokos, etc) and traditional hosts in heterogeneous computing environments. HAMR is light weight and implemented in modern C++. HAMR can be used to manage memory with in a single code or as a data model for coupling codes in a technologically agnostic way. HAMR provides a Python module for coupling C++ and Python codes which implements zero-copy data transfers to and from Python using the Numpy array interface and Numba CUDA array interface protocols.

Design

Modern: Implemented in C++20 HAMR is efficient and easy to use. Declarative: Producers and consumers delcare where data will be accessed. If there is a missmatch HAMR automatically moves the data. Lazy: Data is left in place until it is accessed. Zero-copy constructors enable efficient code coupling.

Source Code

Source code can be obtained at the HAMR github repository.

Documentation

HAMR’s C++ sources are documented via Doxygen at the HAMR Doxygen site. The hamr::buffer is a container that has capabilities similar to std::vector and can provide access to data in different accelerator execution environments.

Build and Install

HAMR is configured with CMake. The following CMake variables influence the build.

CMake Variable

Description

CMAKE_BUILD_TYPE

Release or Debug. The default is Release.

CMAKE_CXX_FLAGS

HAMR will set the C++ compiler flags if not set.

CMAKE_CUDA_FLAGS

HAMR will set the CUDA compiler flags if not set.

HAMR_ENABLE_CUDA

If set to ON enables CUDA features. Default OFF

HAMR_ENABLE_HIP

If set to ON enables HIP features. Default OFF

HAMR_ENABLE_OPENMP

If set to ON enables OpenMP features. Default OFF

HAMR_ENABLE_PYTHON

If set to ON enables Python features. Default OFF

BUILD_TESTING

If set to ON enables regression tests. Default OFF

Note, use with HIP one must tell CMake to use clang and clang++ from the ROCm install via the CMAKE_C_COMPILER and CMAKE_CXX_COMPILER CMake options.

Introduction

HAMR deals only with memory models and serves as a bridge for moving data between various low and high level accelerator technologies and the host at run time. HAMR is designed as a data model for coupling codes such that developers need not code to a specific technology in order to share data. For this reason HAMR does not implement an execution environment. Developers write their codes for the technology of their choice. The technology’s native execution environment is used for computation. HAMR provides data structures that manage memory and can be used to share data and couple codes written in different accelerator technologies (including host based codes), by different developers, such that the receiving code need not have knowledge of technology used to generate that data, and the sending code need not have knowledge of the technology that will be used to consume the data. HAMR manages the necessary memory movements such that the codes have access to the data in the technology where they will use it.

When allocating or accessing data, codes declare the environment (CUDA, HIP, ROCm, OpenMP, Sycl, OpenCL, Kokos etc) in which data will be accessed. Direct access to device pointers in that environment is cheap. When the data is shared with other codes which may not necessarily be written in the same technology and the shared data is accessed, the consumer declares the environment in which the data will be processed. If the data is not already accessible in that environment, it is moved upon access.

Modern C++ design patterns alleviate the burden for explicit management of temporary buffers. Lazy movement of data means data can be left in place. This is an important feature for coupling codes, where it is not known in advance which computational technology the consumer of the data will make use of.

Technology Agnostic Memory Management

hamr::buffer

The hamr::buffer class is a container that has capabilities similar to std::vector and can provide access to data in different accelerator execution environments. During construction Producers of data declare in which environment (CUDA, ROCm, HIP, OpernMP, etc) the data will initially be accessible in. Access to the data in the declared environment is essentially free. When consumers of the data need to access the data, they declare in which environment access is needed. If the consumers are accessing in an environment in which the data is inaccessible, a temporary allocation is created and the data is moved. Reference counting is used to manage temporary allocations.

For instance a code that runs in CUDA would allocate a buffer for the results of a calculation as follows:

size_t n_vals = 10000;
hamr::buffer<float> data = buffer<float>(allocator::cuda, n_vals);

This memory is allocated on the active CUDA device. A device pointer may be obtained as follows:

std::shared_ptr<float> spdata = data->get_cuda_accessible();
float *pdata = spdata.get();

Because the buffer data was allocated for use in CUDA, spdata points to the buffer’s contents, no data was moved. pdata is a device pointer that can be passed to a CUDA kernel.

The contents of this buffer can then be passed to codes written for other technologies. For instance a code written for the host could access the data as follows:

std::shared_ptr<float> spdata = data->get_host_accessible();
float *pdata = spdata.get();

Because the buffer data was allocated for use in CUDA, here spdata points to a temporary buffer that has been moved to the host. pdata is a pointer that can be used to access the buffers contents on the host. Modern C++ std::shared_ptr is used to manage the temporary. pdata is valid as long as spdata is in scope. In this way the consumer of the data need not know if the data was moved or accessed in place.

Python Integration

HAMR provides Python bindings that enable interoperability and data sharing between C++ and Python codes. This is accomplished via the Numpy array interface protocol, and the Numba CUDA array interface protocol. HAMR manages the C++ and Python data structures such that they will persist while in use in the other language.

C++ to Python

In Python one constructs hamr::buffer objects using the wrapped C++ API. The passed allocator flag indicates where the memory will be allocated. For instance an buffer of 16 values initialized to 3.1415 allocated on the GPU would be created with

buf = buffer_float(buffer_allocator_cuda, 16, 3.1415)

When sharing data with Numpy or Cupy a hamr::buffer_handle must be used. The hamr::buffer_handle implements both the Numpy array interface and the Numba CUDA array interface protocols enabling Numpy and Cupy to directly make use of the shared data.

When sharing with Numpy the hamr::buffer_handle is obtained by calling hamr::buffer::get_host_accessible.

arr = numpy.array(buf.get_host_accessible())

As explained above, HAMR ensures that the returned data is accessible on the host by either moving it to a temporary if necessary, or returning a pointer to the buffer contents.

Similarly, when sharing with Cupy the hamr::buffer_handle instance is obtained by calling hamr::buffer::get_cuda_accessible ensuring that the returned data is accessible in CUDA.

arr = cupy.array(buf.get_cuda_accessible())

The hamr::bufer_handle returned holds a reference to the shared data, or the temporary, if data was automatically moved. In turn the Numpy or Cupy array holds a reference to the hamr::buffer_handle, which keeps the data alive while Numpy or Cupy uses it.

While the data is transferred to Numpy or Cupy via the array interface protocol which is a zero-copy transfer mechanism, modifications made to the data will only be visible in the other language if a temporary was not needed, that is, if the data was already accessible in the location it was requested.

Python to C++

hamr::buffer instances may be zero-copy constructed directly from both Numpy and Cupy arrays. For example

arr = cupy.full(16, 3.1415, dtype='float32')
buf = buffer(arr)

The hamr::buffer zero-copy constructor obtains pointers to the Numpy or Cupy array’s contents via the appropriate array interface protocol. The new hamr::buffer instance holds a reference to object sharing the data ensuring that it is kept alive while in use.

This is always a zero-copy transfer and modifications in one language will be visible in the other.

Examples

The source code for the following examples is located in the doc/rtd/source folder. The C++ examples include a simple Makefile that can be edited to point to a build.

Hello World! w/ C++ and OpenMP

This example illustrates coupling two codes, in this case functions, using HAMR so that they can process data produced either on the host or GPU without knowing specifically where the data passed to them resides. C++ smart pointers are used to manage temporary buffers if the passed data needed to be moved to the device where it was accessed. See hamr::buffer for more information. See Hello World! w/ C++ and CUDA for a CUDA implementation of this example. See Hello World! w/ C++ and HIP for a HIP implementation of this example.See Hello World! w/ Python and cupy for a Python implementation of this example.

Listing 1 Code that uses HAMR to access array based data in OpenMP. Calling get_openmp_accessible makes the array’s available in OpenMP if they are not. Then OpenMP device oofloading may be applied as usual.
 1template <typename T, typename U>
 2hamr::buffer<T> add(const hamr::buffer<T> &a1, const hamr::buffer<U> &a2)
 3{
 4    size_t n_vals = a1.size();
 5
 6    // get pointers to the input arrays that are safe to use on the GPU
 7    auto [spa1, pa1] = hamr::get_openmp_accessible(a1);
 8    auto [spa2, pa2] = hamr::get_openmp_accessible(a2);
 9
10    // allocate the memory for the result on the GPU, and get a pointer to it
11    hamr::buffer<T> ao(hamr::buffer_allocator::openmp, n_vals, T(0));
12    T *pao = ao.data();
13
14    // launch the kernel to add the arrays
15    #pragma omp target teams distribute parallel for is_device_ptr(pao, pa1, pa2)
16    for (size_t i = 0; i < n_vals; ++i)
17    {
18        pao[i] = pa1[i] + pa2[i];
19    }
20
21    return ao;
22}
Listing 2 Code that uses HAMR to access array based data on the host. Calling get_host_accessible makes the array available on the host if they are not.
 1template <typename T>
 2void write(std::ostream &os, const hamr::buffer<T> &ai)
 3{
 4    // get pointer to the input array that is safe to use on the host
 5    auto [spai, pai] = hamr::get_host_accessible(ai);
 6
 7    // write the elements of the array to the stream
 8    for (size_t i = 0; i < ai.size(); ++i)
 9    {
10        os << pai[i] << " ";
11    }
12
13    os << std::endl;
14}
Listing 3 This simple Hello world! program allocates an array on the GPU and an array on the host, both are initialized to 1. Then dispatch code use HAMR API’s to make sure that the data is accessible in OpenMP before launching a simple kernel that adds the two arrays. HMAR is used to make the data accessible on the host and print the result.
 1#include <hamr_buffer.h>
 2
 3#include <iostream>
 4#include <memory>
 5
 6#include "add.h"
 7#include "write.h"
 8
 9int main(int, char **)
10{
11    size_t n_vals = 400;
12
13    // allocate and initialize to 1 on the GPU
14    hamr::buffer<float> a0(hamr::buffer_allocator::openmp, n_vals, 1.0f);
15
16    // allocate and initialize to 1 on the host
17    hamr::buffer<float> a1(hamr::buffer_allocator::malloc, n_vals, 1.0f);
18
19    // add the two arrays
20    hamr::buffer<float> a2 = add(a0, a1);
21
22    // write the result
23    write(std::cerr, a2);
24
25    return 0;
26}

Hello World! w/ C++ and CUDA

This example illustrates coupling two codes, in this case functions, using HAMR so that they can process data produced either on the host or GPU without knowing specifically where the data passed to them resides. C++ smart pointers are used to manage temporary buffers if the passed data needed to be moved to the device where it was accessed. See hamr::buffer for more information. See Hello World! w/ C++ and HIP for a HIP implementation of this example. See Hello World! w/ Python and cupy for a Python implementation of this example.

Listing 4 A simple CUDA kernel that adds two arrays.
 1template<typename T, typename U>
 2__global__
 3void add(T *result, const T *array_1, const U *array_2, size_t n_vals)
 4{
 5    unsigned long i = blockIdx.x*blockDim.x + threadIdx.x;
 6
 7    if (i >= n_vals)
 8        return;
 9
10    result[i] = array_1[i] + array_2[i];
11}
Listing 5 Code that uses HAMR to access array based data in CUDA. Calling get_cuda_accessible makes the array’s available in CUDA if they are not. Then CUDA kernels may be applied as usual.
 1#include "add.cuh"
 2
 3template <typename T, typename U>
 4hamr::buffer<T> add(const hamr::buffer<T> &a1, const hamr::buffer<U> &a2)
 5{
 6    size_t n_vals = a1.size();
 7
 8    // get pointers to the input arrays that are safe to use on the GPU
 9    auto [spa1, pa1] = hamr::get_cuda_accessible(a1);
10    auto [spa2, pa2] = hamr::get_cuda_accessible(a2);
11
12    // allocate the memory for the result on the GPU, and get a pointer to it
13    hamr::buffer<T> ao(hamr::buffer_allocator::cuda, n_vals, T(0));
14    T *pao = ao.data();
15
16    // launch the kernel to add the arrays
17    dim3 thread_grid(128);
18    dim3 block_grid(n_vals/128 + (n_vals % 128 ? 1 : 0));
19    add<<<block_grid, thread_grid>>>(pao, pa1, pa2, n_vals);
20
21    return ao;
22}
Listing 6 Code that uses HAMR to access array based data on the host. Calling get_host_accessible makes the array available on the host if they are not.
 1template <typename T>
 2void write(std::ostream &os, const hamr::buffer<T> &ai)
 3{
 4    // get pointer to the input array that is safe to use on the host
 5    auto spai = ai.get_host_accessible();
 6    const T *pai = spai.get();
 7
 8    // write the elements of the array to the stream
 9    for (int i = 0; i < ai.size(); ++i)
10    {
11        os << pai[i] << " ";
12    }
13
14    os << std::endl;
15}
Listing 7 This simple Hello world! program allocates an array on the GPU and an array on the host, both are initialized to 1. Then dispatch code use HAMR API’s to make sure that the data is accessible in CUDA before launching a simple kernel that adds the two arrays. HMAR is used to make the data accessible on the host and print the result.
 1#include <hamr_buffer.h>
 2#include <hamr_buffer_util.h>
 3#include <cuda.h>
 4#include <cuda_runtime.h>
 5#include <iostream>
 6#include <memory>
 7
 8#include "add.h"
 9#include "write.h"
10
11int main(int, char **)
12{
13    size_t n_vals = 400;
14
15    // allocate and initialize to 1 on the GPU
16    hamr::buffer<float> a0(hamr::buffer_allocator::cuda, n_vals, 1.0f);
17
18    // allocate and initialize to 1 on the host
19    hamr::buffer<float> a1(hamr::buffer_allocator::malloc, n_vals, 1.0f);
20
21    // add the two arrays
22    hamr::buffer<float> a2 = add(a0, a1);
23
24    // write the result
25    write(std::cerr, a2);
26
27    return 0;
28}

Hello World! w/ C++ and HIP

This example illustrates coupling two codes, in this case functions, using HAMR so that they can process data produced either on the host or GPU without knowing specifically where the data passed to them resides. C++ smart pointers are used to manage temporary buffers if the passed data needed to be moved to the device where it was accessed. See hamr::buffer for more information. See Hello World! w/ C++ and CUDA for a CUDA implementation of this example.

Listing 8 A simple HIP kernel that adds two arrays.
 1template<typename T, typename U>
 2__global__
 3void add(T *result, const T *array_1, const U *array_2, size_t n_vals)
 4{
 5    unsigned long i = blockIdx.x*blockDim.x + threadIdx.x;
 6
 7    if (i >= n_vals)
 8        return;
 9
10    result[i] = array_1[i] + array_2[i];
11}
Listing 9 Code that uses HAMR to access array based data in HIP. Calling get_hip_accessible makes the array’s available in HIP if they are not. Then HIP kernels may be applied as usual.
 1#include "add_kernel.h"
 2
 3template <typename T, typename U>
 4hamr::buffer<T> add(const hamr::buffer<T> &a1, const hamr::buffer<U> &a2)
 5{
 6    size_t n_vals = a1.size();
 7
 8    // get pointers to the input arrays that are safe to use on the GPU
 9    auto [spa1, pa1] = hamr::get_hip_accessible(a1);
10    auto [spa2, pa2] = hamr::get_hip_accessible(a2);
11
12    // allocate the memory for the result on the GPU, and get a pointer to it
13    hamr::buffer<T> ao(hamr::buffer_allocator::hip, n_vals, T(0));
14    T *pao = ao.data();
15
16    // launch the kernel to add the arrays
17    dim3 thread_grid(128);
18    dim3 block_grid(n_vals/128 + (n_vals % 128 ? 1 : 0));
19    add<<<block_grid, thread_grid>>>(pao, pa1, pa2, n_vals);
20
21    return ao;
22}
Listing 10 Code that uses HAMR to access array based data on the host. Calling get_host_accessible makes the array available on the host if they are not.
 1template <typename T>
 2void write(std::ostream &os, const hamr::buffer<T> &ai)
 3{
 4    // get pointer to the input array that is safe to use on the host
 5    auto [spai, pai] = hamr::get_host_accessible(ai);
 6
 7    // write the elements of the array to the stream
 8    for (int i = 0; i < ai.size(); ++i)
 9    {
10        os << pai[i] << " ";
11    }
12
13    os << std::endl;
14}
Listing 11 This simple Hello world! program allocates an array on the GPU and an array on the host, both are initialized to 1. Then dispatch code use HAMR API’s to make sure that the data is accessible in HIP before launching a simple kernel that adds the two arrays. HMAR is used to make the data accessible on the host and print the result.
 1#include <hamr_buffer.h>
 2#include <hip/hip_runtime.h>
 3
 4#include <iostream>
 5#include <memory>
 6
 7#include "add.h"
 8#include "write.h"
 9
10int main(int, char **)
11{
12    size_t n_vals = 400;
13
14    // allocate and initialize to 1 on the GPU
15    hamr::buffer<float> a0(hamr::buffer_allocator::hip, n_vals, 1.0f);
16
17    // allocate and initialize to 1 on the host
18    hamr::buffer<float> a1(hamr::buffer_allocator::malloc, n_vals, 1.0f);
19
20    // add the two arrays
21    hamr::buffer<float> a2 = add(a0, a1);
22
23    // write the result
24    write(std::cerr, a2);
25
26    return 0;
27}

Hello World! w/ Python and cupy

This example illustrates the coupling of two codes using hamr so that they can process data produced either on the host or GPU without knowing specifically where the data passed to them resides. HAMR’s Python integration handle data sharing between C++ and Python objects. See Python Integration for more information. See Hello World! w/ C++ and CUDA for a C++ implementation of this example.

Listing 12 This simple Hello world! program allocates an array on the GPU and an array on the host, both are initialized to 1. Then dispatch code use HAMR API’s to make sure that the data is accessible in CUDA before launching a simple kernel that adds the two arrays. HMAR is used to make the data accessible on the host and print the result.
 1from hamr import *
 2import cupy as cp
 3import numpy as np
 4import sys
 5
 6
 7def add(buf_0, buf_1):
 8    """ add 2 arrays on the GPU """
 9    arr_0 = cp.array(buf_0.get_cuda_accessible()) # share data w/ cupy on GPU
10    arr_1 = cp.array(buf_1.get_cuda_accessible()) # share data w/ cupy on GPU
11    arr_2 = arr_0 + arr_1                         # add on the GPU
12    buf_2 = buffer(arr_2)                         # zero-copy from cupy on GPU
13    return buf_2
14
15def write(fh, buf):
16    """ print the array on the host """
17    arr = np.array(buf.get_host_accessible())      # share data w/ numpy on host
18    fh.write('%s\n'%(str(arr)))                   # write to the file on host
19
20
21n_vals = 400
22buf_0 = buffer_float(buffer_allocator_cuda, n_vals, 1.0)   # allocate on the host
23buf_1 = buffer_float(buffer_allocator_malloc, n_vals, 1.0) # allocate on the GPU
24
25buf_2 = add(buf_0, buf_1)                                  # add the arrays
26
27write(sys.stdout, buf_2)                                   # write the arrays