ACL

In this article, we assume that the target device are equipped with mobile GPU and uses OpenCL kernels of ARM Compute LIbrary.

Mem allocation

Input / parameters - zero copy

After graph is configured, the graph builder starts to allocate required memory of tensors based on the given shape (size) of tensor. Its OpenCL implementation uses CL_MEM_ALLOC_HOST_PTR when allocating GPU memory which allows user application directly put data into the GPU memory. Thereby, user application does not need to keep its separate buffer and can avoid data copy.

No given data: generating dummy tensors

If data is not given, the framework generate dummy tensor and accessor.

bool NumPyBinLoader::access_tensor(ITensor &tensor)
{
    if(!_already_loaded)
    {
        utils::NPYLoader loader;
        loader.open(_filename, _file_layout);
        loader.fill_tensor(tensor);
    }

    _already_loaded = !_already_loaded;
    return _already_loaded;
}

Unlike NumPyBinLoade, the ACL does not initialize or copy data but just allocate memory for dummy tensor.

bool DummyAccessor::access_tensor(ITensor &tensor)
{
    ARM_COMPUTE_UNUSED(tensor);
    bool ret = _maximum == 0 || _iterator < _maximum;
    if(_iterator == _maximum)
    {
        _iterator = 0;
    }
    else
    {
        _iterator++;
    }
    return ret;
}

Running sequence

1. Construct NN

Add layers into the graph with in/output tensor size of them.

Graph optimization - fusion, sort, etc.

Memory allocation: allocate backend memory (cpu or gpu ) according to used tensors in the NN

Copy data: if data is given, copy them to the allocated memory. NB: If no data,

Kernel Compilation

Summary

Zero copy is used (using a mapped host ptr to access GPU memory)
No data copy to fill the GPU buffers when dummy tensor is used
But mem sync still runs before starting computation (e.g. input synchronization)