The HAMR User’s Guide

HAMR is a library defining an accelerator technology agnostic memory model that bridges between accelerator technologies (CUDA, HIP, ROCm, OpenMP, Sycl, OpenCL, Kokos, etc) and traditional hosts in heterogeneous computing environments. HAMR is light weight and implemented in modern C++. HAMR can be used to manage memory with in a single code or as a data model for coupling codes in a technologically agnostic way. HAMR provides a Python module for coupling C++ and Python codes which implements zero-copy data transfers to and from Python using the Numpy array interface and Numba CUDA array interface protocols.

Design

Modern: Implemented in C++20 HAMR is efficient and easy to use. Declarative: Producers and consumers delcare where data will be accessed. If there is a missmatch HAMR automatically moves the data. Lazy: Data is left in place until it is accessed. Zero-copy constructors enable efficient code coupling.

Source Code

Source code can be obtained at the HAMR github repository.

Documentation

HAMR’s C++ sources are documented via Doxygen at the HAMR Doxygen site. The hamr::buffer is a container that has capabilities similar to std::vector and can provide access to data in different accelerator execution environments.

Build and Install

HAMR is configured with CMake. The following CMake variables influence the build.

CMake Variable	Description
CMAKE_BUILD_TYPE	Release or Debug. The default is Release.
CMAKE_CXX_FLAGS	HAMR will set the C++ compiler flags if not set.
CMAKE_CUDA_FLAGS	HAMR will set the CUDA compiler flags if not set.
HAMR_ENABLE_CUDA	If set to ON enables CUDA features. Default OFF
HAMR_ENABLE_HIP	If set to ON enables HIP features. Default OFF
HAMR_ENABLE_OPENMP	If set to ON enables OpenMP features. Default OFF
HAMR_ENABLE_PYTHON	If set to ON enables Python features. Default OFF
BUILD_TESTING	If set to ON enables regression tests. Default OFF

Note, use with HIP one must tell CMake to use clang and clang++ from the ROCm install via the CMAKE_C_COMPILER and CMAKE_CXX_COMPILER CMake options.

Introduction

HAMR deals only with memory models and serves as a bridge for moving data between various low and high level accelerator technologies and the host at run time. HAMR is designed as a data model for coupling codes such that developers need not code to a specific technology in order to share data. For this reason HAMR does not implement an execution environment. Developers write their codes for the technology of their choice. The technology’s native execution environment is used for computation. HAMR provides data structures that manage memory and can be used to share data and couple codes written in different accelerator technologies (including host based codes), by different developers, such that the receiving code need not have knowledge of technology used to generate that data, and the sending code need not have knowledge of the technology that will be used to consume the data. HAMR manages the necessary memory movements such that the codes have access to the data in the technology where they will use it.

When allocating or accessing data, codes declare the environment (CUDA, HIP, ROCm, OpenMP, Sycl, OpenCL, Kokos etc) in which data will be accessed. Direct access to device pointers in that environment is cheap. When the data is shared with other codes which may not necessarily be written in the same technology and the shared data is accessed, the consumer declares the environment in which the data will be processed. If the data is not already accessible in that environment, it is moved upon access.

Modern C++ design patterns alleviate the burden for explicit management of temporary buffers. Lazy movement of data means data can be left in place. This is an important feature for coupling codes, where it is not known in advance which computational technology the consumer of the data will make use of.

Technology Agnostic Memory Management

hamr::buffer

The hamr::buffer class is a container that has capabilities similar to std::vector and can provide access to data in different accelerator execution environments. During construction Producers of data declare in which environment (CUDA, ROCm, HIP, OpernMP, etc) the data will initially be accessible in. Access to the data in the declared environment is essentially free. When consumers of the data need to access the data, they declare in which environment access is needed. If the consumers are accessing in an environment in which the data is inaccessible, a temporary allocation is created and the data is moved. Reference counting is used to manage temporary allocations.

For instance a code that runs in CUDA would allocate a buffer for the results of a calculation as follows:

size_t n_vals = 10000;
hamr::buffer<float> data = buffer<float>(allocator::cuda, n_vals);

This memory is allocated on the active CUDA device. A device pointer may be obtained as follows:

std::shared_ptr<float> spdata = data->get_cuda_accessible();
float *pdata = spdata.get();

Because the buffer data was allocated for use in CUDA, spdata points to the buffer’s contents, no data was moved. pdata is a device pointer that can be passed to a CUDA kernel.

The contents of this buffer can then be passed to codes written for other technologies. For instance a code written for the host could access the data as follows:

std::shared_ptr<float> spdata = data->get_host_accessible();
float *pdata = spdata.get();

Because the buffer data was allocated for use in CUDA, here spdata points to a temporary buffer that has been moved to the host. pdata is a pointer that can be used to access the buffers contents on the host. Modern C++ std::shared_ptr is used to manage the temporary. pdata is valid as long as spdata is in scope. In this way the consumer of the data need not know if the data was moved or accessed in place.

Python Integration

HAMR provides Python bindings that enable interoperability and data sharing between C++ and Python codes. This is accomplished via the Numpy array interface protocol, and the Numba CUDA array interface protocol. HAMR manages the C++ and Python data structures such that they will persist while in use in the other language.

C++ to Python

In Python one constructs hamr::buffer objects using the wrapped C++ API. The passed allocator flag indicates where the memory will be allocated. For instance an buffer of 16 values initialized to 3.1415 allocated on the GPU would be created with

buf = buffer_float(buffer_allocator_cuda, 16, 3.1415)

When sharing data with Numpy or Cupy a hamr::buffer_handle must be used. The hamr::buffer_handle implements both the Numpy array interface and the Numba CUDA array interface protocols enabling Numpy and Cupy to directly make use of the shared data.

When sharing with Numpy the hamr::buffer_handle is obtained by calling hamr::buffer::get_host_accessible.

arr = numpy.array(buf.get_host_accessible())

As explained above, HAMR ensures that the returned data is accessible on the host by either moving it to a temporary if necessary, or returning a pointer to the buffer contents.

Similarly, when sharing with Cupy the hamr::buffer_handle instance is obtained by calling hamr::buffer::get_cuda_accessible ensuring that the returned data is accessible in CUDA.

arr = cupy.array(buf.get_cuda_accessible())

The hamr::bufer_handle returned holds a reference to the shared data, or the temporary, if data was automatically moved. In turn the Numpy or Cupy array holds a reference to the hamr::buffer_handle, which keeps the data alive while Numpy or Cupy uses it.

While the data is transferred to Numpy or Cupy via the array interface protocol which is a zero-copy transfer mechanism, modifications made to the data will only be visible in the other language if a temporary was not needed, that is, if the data was already accessible in the location it was requested.

Python to C++

hamr::buffer instances may be zero-copy constructed directly from both Numpy and Cupy arrays. For example

arr = cupy.full(16, 3.1415, dtype='float32')
buf = buffer(arr)

The hamr::buffer zero-copy constructor obtains pointers to the Numpy or Cupy array’s contents via the appropriate array interface protocol. The new hamr::buffer instance holds a reference to object sharing the data ensuring that it is kept alive while in use.

This is always a zero-copy transfer and modifications in one language will be visible in the other.

Examples

The source code for the following examples is located in the doc/rtd/source folder. The C++ examples include a simple Makefile that can be edited to point to a build.

Hello World! w/ C++ and OpenMP

This example illustrates coupling two codes, in this case functions, using HAMR so that they can process data produced either on the host or GPU without knowing specifically where the data passed to them resides. C++ smart pointers are used to manage temporary buffers if the passed data needed to be moved to the device where it was accessed. See hamr::buffer for more information. See Hello World! w/ C++ and CUDA for a CUDA implementation of this example. See Hello World! w/ C++ and HIP for a HIP implementation of this example.See Hello World! w/ Python and cupy for a Python implementation of this example.

Listing 1 Code that uses HAMR to access array based data in OpenMP. Calling get_openmp_accessible makes the array’s available in OpenMP if they are not. Then OpenMP device oofloading may be applied as usual.

template <typename T, typename U>
hamr::buffer<T> add(const hamr::buffer<T> &a1, const hamr::buffer<U> &a2)
{
    size_t n_vals = a1.size();

    // get pointers to the input arrays that are safe to use on the GPU
    auto [spa1, pa1] = hamr::get_openmp_accessible(a1);
    auto [spa2, pa2] = hamr::get_openmp_accessible(a2);

    // allocate the memory for the result on the GPU, and get a pointer to it
    hamr::buffer<T> ao(hamr::buffer_allocator::openmp, n_vals, T(0));
    T *pao = ao.data();

    // launch the kernel to add the arrays
    #pragma omp target teams distribute parallel for is_device_ptr(pao, pa1, pa2)
    for (size_t i = 0; i < n_vals; ++i)
    {
        pao[i] = pa1[i] + pa2[i];
    }

    return ao;
}

Listing 2 Code that uses HAMR to access array based data on the host. Calling get_host_accessible makes the array available on the host if they are not.

template <typename T>
void write(std::ostream &os, const hamr::buffer<T> &ai)
{
    // get pointer to the input array that is safe to use on the host
    auto [spai, pai] = hamr::get_host_accessible(ai);

    // write the elements of the array to the stream
    for (size_t i = 0; i < ai.size(); ++i)
    {
        os << pai[i] << " ";
    }

    os << std::endl;
}

Listing 3 This simple Hello world! program allocates an array on the GPU and an array on the host, both are initialized to 1. Then dispatch code use HAMR API’s to make sure that the data is accessible in OpenMP before launching a simple kernel that adds the two arrays. HMAR is used to make the data accessible on the host and print the result.

#include <hamr_buffer.h>

#include <iostream>
#include <memory>

#include "add.h"
#include "write.h"

int main(int, char **)
{
    size_t n_vals = 400;

    // allocate and initialize to 1 on the GPU
    hamr::buffer<float> a0(hamr::buffer_allocator::openmp, n_vals, 1.0f);

    // allocate and initialize to 1 on the host
    hamr::buffer<float> a1(hamr::buffer_allocator::malloc, n_vals, 1.0f);

    // add the two arrays
    hamr::buffer<float> a2 = add(a0, a1);

    // write the result
    write(std::cerr, a2);

    return 0;
}

Hello World! w/ C++ and CUDA

This example illustrates coupling two codes, in this case functions, using HAMR so that they can process data produced either on the host or GPU without knowing specifically where the data passed to them resides. C++ smart pointers are used to manage temporary buffers if the passed data needed to be moved to the device where it was accessed. See hamr::buffer for more information. See Hello World! w/ C++ and HIP for a HIP implementation of this example. See Hello World! w/ Python and cupy for a Python implementation of this example.

Listing 4 A simple CUDA kernel that adds two arrays.

template<typename T, typename U>
__global__
void add(T *result, const T *array_1, const U *array_2, size_t n_vals)
{
    unsigned long i = blockIdx.x*blockDim.x + threadIdx.x;

    if (i >= n_vals)
        return;

    result[i] = array_1[i] + array_2[i];
}

Listing 5 Code that uses HAMR to access array based data in CUDA. Calling get_cuda_accessible makes the array’s available in CUDA if they are not. Then CUDA kernels may be applied as usual.

#include "add.cuh"

template <typename T, typename U>
hamr::buffer<T> add(const hamr::buffer<T> &a1, const hamr::buffer<U> &a2)
{
    size_t n_vals = a1.size();

    // get pointers to the input arrays that are safe to use on the GPU
    auto [spa1, pa1] = hamr::get_cuda_accessible(a1);
    auto [spa2, pa2] = hamr::get_cuda_accessible(a2);

    // allocate the memory for the result on the GPU, and get a pointer to it
    hamr::buffer<T> ao(hamr::buffer_allocator::cuda, n_vals, T(0));
    T *pao = ao.data();

    // launch the kernel to add the arrays
    dim3 thread_grid(128);
    dim3 block_grid(n_vals/128 + (n_vals % 128 ? 1 : 0));
    add<<<block_grid, thread_grid>>>(pao, pa1, pa2, n_vals);

    return ao;
}

Listing 6 Code that uses HAMR to access array based data on the host. Calling get_host_accessible makes the array available on the host if they are not.

template <typename T>
void write(std::ostream &os, const hamr::buffer<T> &ai)
{
    // get pointer to the input array that is safe to use on the host
    auto spai = ai.get_host_accessible();
    const T *pai = spai.get();

    // write the elements of the array to the stream
    for (int i = 0; i < ai.size(); ++i)
    {
        os << pai[i] << " ";
    }

    os << std::endl;
}

Listing 7 This simple Hello world! program allocates an array on the GPU and an array on the host, both are initialized to 1. Then dispatch code use HAMR API’s to make sure that the data is accessible in CUDA before launching a simple kernel that adds the two arrays. HMAR is used to make the data accessible on the host and print the result.

#include <hamr_buffer.h>
#include <hamr_buffer_util.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <iostream>
#include <memory>

#include "add.h"
#include "write.h"

int main(int, char **)
{
    size_t n_vals = 400;

    // allocate and initialize to 1 on the GPU
    hamr::buffer<float> a0(hamr::buffer_allocator::cuda, n_vals, 1.0f);

    // allocate and initialize to 1 on the host
    hamr::buffer<float> a1(hamr::buffer_allocator::malloc, n_vals, 1.0f);

    // add the two arrays
    hamr::buffer<float> a2 = add(a0, a1);

    // write the result
    write(std::cerr, a2);

    return 0;
}

Hello World! w/ C++ and HIP

This example illustrates coupling two codes, in this case functions, using HAMR so that they can process data produced either on the host or GPU without knowing specifically where the data passed to them resides. C++ smart pointers are used to manage temporary buffers if the passed data needed to be moved to the device where it was accessed. See hamr::buffer for more information. See Hello World! w/ C++ and CUDA for a CUDA implementation of this example.

Listing 8 A simple HIP kernel that adds two arrays.

template<typename T, typename U>
__global__
void add(T *result, const T *array_1, const U *array_2, size_t n_vals)
{
    unsigned long i = blockIdx.x*blockDim.x + threadIdx.x;

    if (i >= n_vals)
        return;

    result[i] = array_1[i] + array_2[i];
}

Listing 9 Code that uses HAMR to access array based data in HIP. Calling get_hip_accessible makes the array’s available in HIP if they are not. Then HIP kernels may be applied as usual.

#include "add_kernel.h"

template <typename T, typename U>
hamr::buffer<T> add(const hamr::buffer<T> &a1, const hamr::buffer<U> &a2)
{
    size_t n_vals = a1.size();

    // get pointers to the input arrays that are safe to use on the GPU
    auto [spa1, pa1] = hamr::get_hip_accessible(a1);
    auto [spa2, pa2] = hamr::get_hip_accessible(a2);

    // allocate the memory for the result on the GPU, and get a pointer to it
    hamr::buffer<T> ao(hamr::buffer_allocator::hip, n_vals, T(0));
    T *pao = ao.data();

    // launch the kernel to add the arrays
    dim3 thread_grid(128);
    dim3 block_grid(n_vals/128 + (n_vals % 128 ? 1 : 0));
    add<<<block_grid, thread_grid>>>(pao, pa1, pa2, n_vals);

    return ao;
}

Listing 10 Code that uses HAMR to access array based data on the host. Calling get_host_accessible makes the array available on the host if they are not.

template <typename T>
void write(std::ostream &os, const hamr::buffer<T> &ai)
{
    // get pointer to the input array that is safe to use on the host
    auto [spai, pai] = hamr::get_host_accessible(ai);

    // write the elements of the array to the stream
    for (int i = 0; i < ai.size(); ++i)
    {
        os << pai[i] << " ";
    }

    os << std::endl;
}

Listing 11 This simple Hello world! program allocates an array on the GPU and an array on the host, both are initialized to 1. Then dispatch code use HAMR API’s to make sure that the data is accessible in HIP before launching a simple kernel that adds the two arrays. HMAR is used to make the data accessible on the host and print the result.

#include <hamr_buffer.h>
#include <hip/hip_runtime.h>

#include <iostream>
#include <memory>

#include "add.h"
#include "write.h"

int main(int, char **)
{
    size_t n_vals = 400;

    // allocate and initialize to 1 on the GPU
    hamr::buffer<float> a0(hamr::buffer_allocator::hip, n_vals, 1.0f);

    // allocate and initialize to 1 on the host
    hamr::buffer<float> a1(hamr::buffer_allocator::malloc, n_vals, 1.0f);

    // add the two arrays
    hamr::buffer<float> a2 = add(a0, a1);

    // write the result
    write(std::cerr, a2);

    return 0;
}

Hello World! w/ Python and cupy

This example illustrates the coupling of two codes using hamr so that they can process data produced either on the host or GPU without knowing specifically where the data passed to them resides. HAMR’s Python integration handle data sharing between C++ and Python objects. See Python Integration for more information. See Hello World! w/ C++ and CUDA for a C++ implementation of this example.

Listing 12 This simple Hello world! program allocates an array on the GPU and an array on the host, both are initialized to 1. Then dispatch code use HAMR API’s to make sure that the data is accessible in CUDA before launching a simple kernel that adds the two arrays. HMAR is used to make the data accessible on the host and print the result.

from hamr import *
import cupy as cp
import numpy as np
import sys


def add(buf_0, buf_1):
    """ add 2 arrays on the GPU """
    arr_0 = cp.array(buf_0.get_cuda_accessible()) # share data w/ cupy on GPU
    arr_1 = cp.array(buf_1.get_cuda_accessible()) # share data w/ cupy on GPU
    arr_2 = arr_0 + arr_1                         # add on the GPU
    buf_2 = buffer(arr_2)                         # zero-copy from cupy on GPU
    return buf_2

def write(fh, buf):
    """ print the array on the host """
    arr = np.array(buf.get_host_accessible())      # share data w/ numpy on host
    fh.write('%s\n'%(str(arr)))                   # write to the file on host


n_vals = 400
buf_0 = buffer_float(buffer_allocator_cuda, n_vals, 1.0)   # allocate on the host
buf_1 = buffer_float(buffer_allocator_malloc, n_vals, 1.0) # allocate on the GPU

buf_2 = add(buf_0, buf_1)                                  # add the arrays

write(sys.stdout, buf_2)                                   # write the arrays