The HAMR User’s Guide

HAMR is a library defining an accelerator technology agnostic memory model that bridges between accelerator technologies (CUDA, HIP, ROCm, OpenMP, Kokos, etc) and traditional CPUs in heterogeneous computing environments. HAMR is light weight and implemented in modern C++.

Unlike other platform portability libraries HAMR deals only with the memory model and serves as a bridge for moving data between technologies at run time. HAMR is designed to make data easily accessible when coupling codes written for use in different technologies. For this reason HAMR does not implemnent an execution environment. Instead the technology’s native execution environment is used.

When allocating or accessing data, codes declare the envirnment in which data will be accessed. Access to the data in that environment is essentially free. The data can then be passed to other codes which may not neccessarily be written in the same technology. Those codes declare the environment in which the data will be accessed, if the data is not accessibile in that environment, it is moved.

Technology Agnostic Memory Management

hamr::buffer

The hamr::buffer class is a container that has capabilities similar to std::vector and can provide access to data in different accelerator execution environments. Durinng construction Producers of data declare in which environment (CUDA,ROCm, HIP, OpernMP, etc) the data will initially be accessible in. Access to the data in the declared environment is essentially free. When consumers of the data need to access the data, they declare in which environment access is needed. If the consumers are accessing in an environment in which the data is in accessibl, a temporary allocation is created and the data is moved. Reference counting is used to manage temporary allocations.

Online Source Code Documentation

HAMR’s C++ sources are documented via Doxygen at the HAMR Doxygen site.

Examples

CUDA

This example illustrates the use of hamr moving data to and from the GPU and CPU for use with CUDA.

Listing 1 A simple CUDA kernel that adds two arrays.

#ifndef add_cuda_h
#define add_cuda_h

#include "hamr_cuda_launch.h"

#include <cuda.h>
#include <cuda_runtime.h>

// **************************************************************************
template<typename T, typename U>
__global__
void add_cuda(T *result, const T *array_1, const U *array_2, size_t n_vals)
{
    unsigned long i = hamr::thread_id_to_array_index();

    if (i >= n_vals)
        return;

    result[i] = array_1[i] + array_2[i];
}

#endif

Listing 2 Code that uses HAMR to access array based data in CUDA. Calling get_cuda_accessible makes the array’s available in CUDA if they are not. Then CUDA kernels may be applied as usual.

#ifndef add_cuda_dispatch_h
#define add_cuda_dispatch_h

#include "add_cuda.h"

#include <hamr_buffer.h>
#include <hamr_cuda_launch.h>

#include <cuda.h>
#include <cuda_runtime.h>

#include <iostream>

using hamr::buffer;
using hamr::p_buffer;
using allocator = hamr::buffer_allocator;

// **************************************************************************
template <typename T, typename U>
p_buffer<T> add_cuda(const p_buffer<T> &a1, const p_buffer<U> &a2)
{
    // get the inputs
    auto spa1 = a1->get_cuda_accessible();
    const T *pa1 = spa1.get();

    auto spa2 = a2->get_cuda_accessible();
    const U *pa2 = spa2.get();

    // allocate the memory
    size_t n_vals = a1->size();
    p_buffer<T> ao = std::make_shared<buffer<T>>(allocator::cuda);
    ao->resize(n_vals, T(0));

    auto spao = ao->get_cuda_accessible();
    T *pao = spao.get();

    // determine kernel launch parameters
    int n_blocks = 0;
    dim3 block_grid;
    dim3 thread_grid;
    if (hamr::partition_thread_blocks(0, n_vals,
        8, block_grid, n_blocks, thread_grid))
    {
        std::cerr << "ERROR: Failed to determine launch parameters" << std::endl;
        return nullptr;
    }

    // initialize the data
    cudaError_t ierr = cudaSuccess;
    add_cuda<<<block_grid, thread_grid>>>(pao, pa1, pa2, n_vals);
    if ((ierr = cudaGetLastError()) != cudaSuccess)
    {
        std::cerr << "ERROR: Failed to launch the add_cuda kernel. "
            << cudaGetErrorString(ierr) << std::endl;
        return nullptr;
    }

    return ao;
}

#endif

Listing 3 This simple hello world style program allocates an array on the GPU and an array on the CPU, both are initialized to 1. Then dispatch code use HAMR API’s to make sure that the data is accessible in CUDA before launching a simple kernel that adds the two arrays. HMAR is used to make the data accessible on the CPU and print the resulkt.

#include "add_cuda_dispatch.h"

#include <hamr_buffer.h>
#include <iostream>
#include <memory>

int main(int, char **)
{
    size_t n_vals = 400;

    // allocate an array initialized to 1 on the GPU
    auto a0 = std::make_shared<buffer<float>>(allocator::cuda, n_vals, 1.0f);

    // allocate an array initialized to 1 on the CPU
    auto a1 = std::make_shared<buffer<float>>(allocator::malloc, n_vals, 1.0f);

    // add the two arrays on the GPU
    auto a2 = add_cuda(a0, a1);

    // access the result on the CPU
    auto spa2 = a2->get_cpu_accessible();
    float *pa2 = spa2.get();

    // print the result on the CPU
    std::cerr << "a2 = ";
    for (int i = 0; i < a2->size(); ++i)
        std::cerr << pa2[i] << " ";
    std::cerr << std::endl;

    return 0;
}