The HAMR User’s Guide
HAMR is a library defining an accelerator technology agnostic memory model that bridges between accelerator technologies (CUDA, HIP, ROCm, OpenMP, Sycl, OpenCL, Kokos, etc) and traditional hosts in heterogeneous computing environments. HAMR is light weight and implemented in modern C++. HAMR can be used to manage memory with in a single code or as a data model for coupling codes in a technologically agnostic way. HAMR provides a Python module for coupling C++ and Python codes which implements zero-copy data transfers to and from Python using the Numpy array interface and Numba CUDA array interface protocols.
Design
Modern: Implemented in C++20 HAMR is efficient and easy to use. Declarative: Producers and consumers delcare where data will be accessed. If there is a missmatch HAMR automatically moves the data. Lazy: Data is left in place until it is accessed. Zero-copy constructors enable efficient code coupling.
Source Code
Source code can be obtained at the HAMR github repository.
Documentation
HAMR’s C++ sources are documented via Doxygen at the HAMR Doxygen site. The hamr::buffer is a container that has capabilities similar to std::vector and can provide access to data in different accelerator execution environments.
Build and Install
HAMR is configured with CMake. The following CMake variables influence the build.
CMake Variable |
Description |
CMAKE_BUILD_TYPE |
Release or Debug. The default is Release. |
CMAKE_CXX_FLAGS |
HAMR will set the C++ compiler flags if not set. |
CMAKE_CUDA_FLAGS |
HAMR will set the CUDA compiler flags if not set. |
HAMR_ENABLE_CUDA |
If set to ON enables CUDA features. Default OFF |
HAMR_ENABLE_HIP |
If set to ON enables HIP features. Default OFF |
HAMR_ENABLE_OPENMP |
If set to ON enables OpenMP features. Default OFF |
HAMR_ENABLE_PYTHON |
If set to ON enables Python features. Default OFF |
BUILD_TESTING |
If set to ON enables regression tests. Default OFF |
Note, use with HIP one must tell CMake to use clang and clang++ from the ROCm install via the CMAKE_C_COMPILER and CMAKE_CXX_COMPILER CMake options.
Introduction
HAMR deals only with memory models and serves as a bridge for moving data between various low and high level accelerator technologies and the host at run time. HAMR is designed as a data model for coupling codes such that developers need not code to a specific technology in order to share data. For this reason HAMR does not implement an execution environment. Developers write their codes for the technology of their choice. The technology’s native execution environment is used for computation. HAMR provides data structures that manage memory and can be used to share data and couple codes written in different accelerator technologies (including host based codes), by different developers, such that the receiving code need not have knowledge of technology used to generate that data, and the sending code need not have knowledge of the technology that will be used to consume the data. HAMR manages the necessary memory movements such that the codes have access to the data in the technology where they will use it.
When allocating or accessing data, codes declare the environment (CUDA, HIP, ROCm, OpenMP, Sycl, OpenCL, Kokos etc) in which data will be accessed. Direct access to device pointers in that environment is cheap. When the data is shared with other codes which may not necessarily be written in the same technology and the shared data is accessed, the consumer declares the environment in which the data will be processed. If the data is not already accessible in that environment, it is moved upon access.
Modern C++ design patterns alleviate the burden for explicit management of temporary buffers. Lazy movement of data means data can be left in place. This is an important feature for coupling codes, where it is not known in advance which computational technology the consumer of the data will make use of.
Technology Agnostic Memory Management
hamr::buffer
The hamr::buffer class is a container that has capabilities similar to std::vector and can provide access to data in different accelerator execution environments. During construction Producers of data declare in which environment (CUDA, ROCm, HIP, OpernMP, etc) the data will initially be accessible in. Access to the data in the declared environment is essentially free. When consumers of the data need to access the data, they declare in which environment access is needed. If the consumers are accessing in an environment in which the data is inaccessible, a temporary allocation is created and the data is moved. Reference counting is used to manage temporary allocations.
For instance a code that runs in CUDA would allocate a buffer for the results of a calculation as follows:
size_t n_vals = 10000;
hamr::buffer<float> data = buffer<float>(allocator::cuda, n_vals);
This memory is allocated on the active CUDA device. A device pointer may be obtained as follows:
std::shared_ptr<float> spdata = data->get_cuda_accessible();
float *pdata = spdata.get();
Because the buffer data was allocated for use in CUDA, spdata points to the buffer’s contents, no data was moved. pdata is a device pointer that can be passed to a CUDA kernel.
The contents of this buffer can then be passed to codes written for other technologies. For instance a code written for the host could access the data as follows:
std::shared_ptr<float> spdata = data->get_host_accessible();
float *pdata = spdata.get();
Because the buffer data was allocated for use in CUDA, here spdata points to a temporary buffer that has been moved to the host. pdata is a pointer that can be used to access the buffers contents on the host. Modern C++ std::shared_ptr is used to manage the temporary. pdata is valid as long as spdata is in scope. In this way the consumer of the data need not know if the data was moved or accessed in place.
Python Integration
HAMR provides Python bindings that enable interoperability and data sharing between C++ and Python codes. This is accomplished via the Numpy array interface protocol, and the Numba CUDA array interface protocol. HAMR manages the C++ and Python data structures such that they will persist while in use in the other language.
C++ to Python
In Python one constructs hamr::buffer objects using the wrapped C++ API. The passed allocator flag indicates where the memory will be allocated. For instance an buffer of 16 values initialized to 3.1415 allocated on the GPU would be created with
buf = buffer_float(buffer_allocator_cuda, 16, 3.1415)
When sharing data with Numpy or Cupy a hamr::buffer_handle must be used. The hamr::buffer_handle implements both the Numpy array interface and the Numba CUDA array interface protocols enabling Numpy and Cupy to directly make use of the shared data.
When sharing with Numpy the hamr::buffer_handle is obtained by calling hamr::buffer::get_host_accessible.
arr = numpy.array(buf.get_host_accessible())
As explained above, HAMR ensures that the returned data is accessible on the host by either moving it to a temporary if necessary, or returning a pointer to the buffer contents.
Similarly, when sharing with Cupy the hamr::buffer_handle instance is obtained by calling hamr::buffer::get_cuda_accessible ensuring that the returned data is accessible in CUDA.
arr = cupy.array(buf.get_cuda_accessible())
The hamr::bufer_handle returned holds a reference to the shared data, or the temporary, if data was automatically moved. In turn the Numpy or Cupy array holds a reference to the hamr::buffer_handle, which keeps the data alive while Numpy or Cupy uses it.
While the data is transferred to Numpy or Cupy via the array interface protocol which is a zero-copy transfer mechanism, modifications made to the data will only be visible in the other language if a temporary was not needed, that is, if the data was already accessible in the location it was requested.
Python to C++
hamr::buffer instances may be zero-copy constructed directly from both Numpy and Cupy arrays. For example
arr = cupy.full(16, 3.1415, dtype='float32')
buf = buffer(arr)
The hamr::buffer zero-copy constructor obtains pointers to the Numpy or Cupy array’s contents via the appropriate array interface protocol. The new hamr::buffer instance holds a reference to object sharing the data ensuring that it is kept alive while in use.
This is always a zero-copy transfer and modifications in one language will be visible in the other.
Examples
The source code for the following examples is located in the doc/rtd/source folder. The C++ examples include a simple Makefile that can be edited to point to a build.
Hello World! w/ C++ and OpenMP
This example illustrates coupling two codes, in this case functions, using HAMR so that they can process data produced either on the host or GPU without knowing specifically where the data passed to them resides. C++ smart pointers are used to manage temporary buffers if the passed data needed to be moved to the device where it was accessed. See hamr::buffer for more information. See Hello World! w/ C++ and CUDA for a CUDA implementation of this example. See Hello World! w/ C++ and HIP for a HIP implementation of this example.See Hello World! w/ Python and cupy for a Python implementation of this example.
1template <typename T, typename U>
2hamr::buffer<T> add(const hamr::buffer<T> &a1, const hamr::buffer<U> &a2)
3{
4 size_t n_vals = a1.size();
5
6 // get pointers to the input arrays that are safe to use on the GPU
7 auto [spa1, pa1] = hamr::get_openmp_accessible(a1);
8 auto [spa2, pa2] = hamr::get_openmp_accessible(a2);
9
10 // allocate the memory for the result on the GPU, and get a pointer to it
11 hamr::buffer<T> ao(hamr::buffer_allocator::openmp, n_vals, T(0));
12 T *pao = ao.data();
13
14 // launch the kernel to add the arrays
15 #pragma omp target teams distribute parallel for is_device_ptr(pao, pa1, pa2)
16 for (size_t i = 0; i < n_vals; ++i)
17 {
18 pao[i] = pa1[i] + pa2[i];
19 }
20
21 return ao;
22}
1template <typename T>
2void write(std::ostream &os, const hamr::buffer<T> &ai)
3{
4 // get pointer to the input array that is safe to use on the host
5 auto [spai, pai] = hamr::get_host_accessible(ai);
6
7 // write the elements of the array to the stream
8 for (size_t i = 0; i < ai.size(); ++i)
9 {
10 os << pai[i] << " ";
11 }
12
13 os << std::endl;
14}
1#include <hamr_buffer.h>
2
3#include <iostream>
4#include <memory>
5
6#include "add.h"
7#include "write.h"
8
9int main(int, char **)
10{
11 size_t n_vals = 400;
12
13 // allocate and initialize to 1 on the GPU
14 hamr::buffer<float> a0(hamr::buffer_allocator::openmp, n_vals, 1.0f);
15
16 // allocate and initialize to 1 on the host
17 hamr::buffer<float> a1(hamr::buffer_allocator::malloc, n_vals, 1.0f);
18
19 // add the two arrays
20 hamr::buffer<float> a2 = add(a0, a1);
21
22 // write the result
23 write(std::cerr, a2);
24
25 return 0;
26}
Hello World! w/ C++ and CUDA
This example illustrates coupling two codes, in this case functions, using HAMR so that they can process data produced either on the host or GPU without knowing specifically where the data passed to them resides. C++ smart pointers are used to manage temporary buffers if the passed data needed to be moved to the device where it was accessed. See hamr::buffer for more information. See Hello World! w/ C++ and HIP for a HIP implementation of this example. See Hello World! w/ Python and cupy for a Python implementation of this example.
1template<typename T, typename U>
2__global__
3void add(T *result, const T *array_1, const U *array_2, size_t n_vals)
4{
5 unsigned long i = blockIdx.x*blockDim.x + threadIdx.x;
6
7 if (i >= n_vals)
8 return;
9
10 result[i] = array_1[i] + array_2[i];
11}
1#include "add.cuh"
2
3template <typename T, typename U>
4hamr::buffer<T> add(const hamr::buffer<T> &a1, const hamr::buffer<U> &a2)
5{
6 size_t n_vals = a1.size();
7
8 // get pointers to the input arrays that are safe to use on the GPU
9 auto [spa1, pa1] = hamr::get_cuda_accessible(a1);
10 auto [spa2, pa2] = hamr::get_cuda_accessible(a2);
11
12 // allocate the memory for the result on the GPU, and get a pointer to it
13 hamr::buffer<T> ao(hamr::buffer_allocator::cuda, n_vals, T(0));
14 T *pao = ao.data();
15
16 // launch the kernel to add the arrays
17 dim3 thread_grid(128);
18 dim3 block_grid(n_vals/128 + (n_vals % 128 ? 1 : 0));
19 add<<<block_grid, thread_grid>>>(pao, pa1, pa2, n_vals);
20
21 return ao;
22}
1template <typename T>
2void write(std::ostream &os, const hamr::buffer<T> &ai)
3{
4 // get pointer to the input array that is safe to use on the host
5 auto spai = ai.get_host_accessible();
6 const T *pai = spai.get();
7
8 // write the elements of the array to the stream
9 for (int i = 0; i < ai.size(); ++i)
10 {
11 os << pai[i] << " ";
12 }
13
14 os << std::endl;
15}
1#include <hamr_buffer.h>
2#include <hamr_buffer_util.h>
3#include <cuda.h>
4#include <cuda_runtime.h>
5#include <iostream>
6#include <memory>
7
8#include "add.h"
9#include "write.h"
10
11int main(int, char **)
12{
13 size_t n_vals = 400;
14
15 // allocate and initialize to 1 on the GPU
16 hamr::buffer<float> a0(hamr::buffer_allocator::cuda, n_vals, 1.0f);
17
18 // allocate and initialize to 1 on the host
19 hamr::buffer<float> a1(hamr::buffer_allocator::malloc, n_vals, 1.0f);
20
21 // add the two arrays
22 hamr::buffer<float> a2 = add(a0, a1);
23
24 // write the result
25 write(std::cerr, a2);
26
27 return 0;
28}
Hello World! w/ C++ and HIP
This example illustrates coupling two codes, in this case functions, using HAMR so that they can process data produced either on the host or GPU without knowing specifically where the data passed to them resides. C++ smart pointers are used to manage temporary buffers if the passed data needed to be moved to the device where it was accessed. See hamr::buffer for more information. See Hello World! w/ C++ and CUDA for a CUDA implementation of this example.
1template<typename T, typename U>
2__global__
3void add(T *result, const T *array_1, const U *array_2, size_t n_vals)
4{
5 unsigned long i = blockIdx.x*blockDim.x + threadIdx.x;
6
7 if (i >= n_vals)
8 return;
9
10 result[i] = array_1[i] + array_2[i];
11}
1#include "add_kernel.h"
2
3template <typename T, typename U>
4hamr::buffer<T> add(const hamr::buffer<T> &a1, const hamr::buffer<U> &a2)
5{
6 size_t n_vals = a1.size();
7
8 // get pointers to the input arrays that are safe to use on the GPU
9 auto [spa1, pa1] = hamr::get_hip_accessible(a1);
10 auto [spa2, pa2] = hamr::get_hip_accessible(a2);
11
12 // allocate the memory for the result on the GPU, and get a pointer to it
13 hamr::buffer<T> ao(hamr::buffer_allocator::hip, n_vals, T(0));
14 T *pao = ao.data();
15
16 // launch the kernel to add the arrays
17 dim3 thread_grid(128);
18 dim3 block_grid(n_vals/128 + (n_vals % 128 ? 1 : 0));
19 add<<<block_grid, thread_grid>>>(pao, pa1, pa2, n_vals);
20
21 return ao;
22}
1template <typename T>
2void write(std::ostream &os, const hamr::buffer<T> &ai)
3{
4 // get pointer to the input array that is safe to use on the host
5 auto [spai, pai] = hamr::get_host_accessible(ai);
6
7 // write the elements of the array to the stream
8 for (int i = 0; i < ai.size(); ++i)
9 {
10 os << pai[i] << " ";
11 }
12
13 os << std::endl;
14}
1#include <hamr_buffer.h>
2#include <hip/hip_runtime.h>
3
4#include <iostream>
5#include <memory>
6
7#include "add.h"
8#include "write.h"
9
10int main(int, char **)
11{
12 size_t n_vals = 400;
13
14 // allocate and initialize to 1 on the GPU
15 hamr::buffer<float> a0(hamr::buffer_allocator::hip, n_vals, 1.0f);
16
17 // allocate and initialize to 1 on the host
18 hamr::buffer<float> a1(hamr::buffer_allocator::malloc, n_vals, 1.0f);
19
20 // add the two arrays
21 hamr::buffer<float> a2 = add(a0, a1);
22
23 // write the result
24 write(std::cerr, a2);
25
26 return 0;
27}
Hello World! w/ Python and cupy
This example illustrates the coupling of two codes using hamr so that they can process data produced either on the host or GPU without knowing specifically where the data passed to them resides. HAMR’s Python integration handle data sharing between C++ and Python objects. See Python Integration for more information. See Hello World! w/ C++ and CUDA for a C++ implementation of this example.
1from hamr import *
2import cupy as cp
3import numpy as np
4import sys
5
6
7def add(buf_0, buf_1):
8 """ add 2 arrays on the GPU """
9 arr_0 = cp.array(buf_0.get_cuda_accessible()) # share data w/ cupy on GPU
10 arr_1 = cp.array(buf_1.get_cuda_accessible()) # share data w/ cupy on GPU
11 arr_2 = arr_0 + arr_1 # add on the GPU
12 buf_2 = buffer(arr_2) # zero-copy from cupy on GPU
13 return buf_2
14
15def write(fh, buf):
16 """ print the array on the host """
17 arr = np.array(buf.get_host_accessible()) # share data w/ numpy on host
18 fh.write('%s\n'%(str(arr))) # write to the file on host
19
20
21n_vals = 400
22buf_0 = buffer_float(buffer_allocator_cuda, n_vals, 1.0) # allocate on the host
23buf_1 = buffer_float(buffer_allocator_malloc, n_vals, 1.0) # allocate on the GPU
24
25buf_2 = add(buf_0, buf_1) # add the arrays
26
27write(sys.stdout, buf_2) # write the arrays