The HAMR User’s Guide

HAMR is a library defining an accelerator technology agnostic memory model that bridges between accelerator technologies (CUDA, HIP, ROCm, OpenMP, Kokos, etc) and traditional CPUs in heterogeneous computing environments. HAMR is light weight and implemented in modern C++.

Unlike other platform portability libraries HAMR deals only with the memory model and serves as a bridge for moving data between technologies at run time. HAMR is designed to make data easily accessible when coupling codes written for use in different technologies. For this reason HAMR does not implemnent an execution environment. Instead the technology’s native execution environment is used.

When allocating or accessing data, codes declare the envirnment in which data will be accessed. Access to the data in that environment is essentially free. The data can then be passed to other codes which may not neccessarily be written in the same technology. Those codes declare the environment in which the data will be accessed, if the data is not accessibile in that environment, it is moved.

Technology Agnostic Memory Management

hamr::buffer

The hamr::buffer class is a container that has capabilities similar to std::vector and can provide access to data in different accelerator execution environments. Durinng construction Producers of data declare in which environment (CUDA,ROCm, HIP, OpernMP, etc) the data will initially be accessible in. Access to the data in the declared environment is essentially free. When consumers of the data need to access the data, they declare in which environment access is needed. If the consumers are accessing in an environment in which the data is in accessibl, a temporary allocation is created and the data is moved. Reference counting is used to manage temporary allocations.

Online Source Code Documentation

HAMR’s C++ sources are documented via Doxygen at the HAMR Doxygen site.

Examples

CUDA

This example illustrates the use of hamr moving data to and from the GPU and CPU for use with CUDA.

Listing 1 A simple CUDA kernel that adds two arrays.
 1#ifndef add_cuda_h
 2#define add_cuda_h
 3
 4#include "hamr_cuda_launch.h"
 5
 6#include <cuda.h>
 7#include <cuda_runtime.h>
 8
 9// **************************************************************************
10template<typename T, typename U>
11__global__
12void add_cuda(T *result, const T *array_1, const U *array_2, size_t n_vals)
13{
14    unsigned long i = hamr::thread_id_to_array_index();
15
16    if (i >= n_vals)
17        return;
18
19    result[i] = array_1[i] + array_2[i];
20}
21
22#endif
Listing 2 Code that uses HAMR to access array based data in CUDA. Calling get_cuda_accessible makes the array’s available in CUDA if they are not. Then CUDA kernels may be applied as usual.
 1#ifndef add_cuda_dispatch_h
 2#define add_cuda_dispatch_h
 3
 4#include "add_cuda.h"
 5
 6#include <hamr_buffer.h>
 7#include <hamr_cuda_launch.h>
 8
 9#include <cuda.h>
10#include <cuda_runtime.h>
11
12#include <iostream>
13
14using hamr::buffer;
15using hamr::p_buffer;
16using allocator = hamr::buffer_allocator;
17
18// **************************************************************************
19template <typename T, typename U>
20p_buffer<T> add_cuda(const p_buffer<T> &a1, const p_buffer<U> &a2)
21{
22    // get the inputs
23    auto spa1 = a1->get_cuda_accessible();
24    const T *pa1 = spa1.get();
25
26    auto spa2 = a2->get_cuda_accessible();
27    const U *pa2 = spa2.get();
28
29    // allocate the memory
30    size_t n_vals = a1->size();
31    p_buffer<T> ao = std::make_shared<buffer<T>>(allocator::cuda);
32    ao->resize(n_vals, T(0));
33
34    auto spao = ao->get_cuda_accessible();
35    T *pao = spao.get();
36
37    // determine kernel launch parameters
38    int n_blocks = 0;
39    dim3 block_grid;
40    dim3 thread_grid;
41    if (hamr::partition_thread_blocks(0, n_vals,
42        8, block_grid, n_blocks, thread_grid))
43    {
44        std::cerr << "ERROR: Failed to determine launch parameters" << std::endl;
45        return nullptr;
46    }
47
48    // initialize the data
49    cudaError_t ierr = cudaSuccess;
50    add_cuda<<<block_grid, thread_grid>>>(pao, pa1, pa2, n_vals);
51    if ((ierr = cudaGetLastError()) != cudaSuccess)
52    {
53        std::cerr << "ERROR: Failed to launch the add_cuda kernel. "
54            << cudaGetErrorString(ierr) << std::endl;
55        return nullptr;
56    }
57
58    return ao;
59}
60
61#endif
Listing 3 This simple hello world style program allocates an array on the GPU and an array on the CPU, both are initialized to 1. Then dispatch code use HAMR API’s to make sure that the data is accessible in CUDA before launching a simple kernel that adds the two arrays. HMAR is used to make the data accessible on the CPU and print the resulkt.
 1#include "add_cuda_dispatch.h"
 2
 3#include <hamr_buffer.h>
 4#include <iostream>
 5#include <memory>
 6
 7int main(int, char **)
 8{
 9    size_t n_vals = 400;
10
11    // allocate an array initialized to 1 on the GPU
12    auto a0 = std::make_shared<buffer<float>>(allocator::cuda, n_vals, 1.0f);
13
14    // allocate an array initialized to 1 on the CPU
15    auto a1 = std::make_shared<buffer<float>>(allocator::malloc, n_vals, 1.0f);
16
17    // add the two arrays on the GPU
18    auto a2 = add_cuda(a0, a1);
19
20    // access the result on the CPU
21    auto spa2 = a2->get_cpu_accessible();
22    float *pa2 = spa2.get();
23
24    // print the result on the CPU
25    std::cerr << "a2 = ";
26    for (int i = 0; i < a2->size(); ++i)
27        std::cerr << pa2[i] << " ";
28    std::cerr << std::endl;
29
30    return 0;
31}