ACL
In this article, we assume that the target device are equipped with mobile GPU and uses OpenCL kernels of ARM Compute LIbrary.
Mem allocation
Input / parameters - zero copy
After graph is configured, the graph builder starts to allocate required memory of tensors based on the given shape (size) of tensor. Its OpenCL implementation uses CL_MEM_ALLOC_HOST_PTR
when allocating GPU memory which allows user application directly put data into the GPU memory. Thereby, user application does not need to keep its separate buffer and can avoid data copy.
No given data: generating dummy tensors
If data is not given, the framework generate dummy tensor and accessor.
bool NumPyBinLoader::access_tensor(ITensor &tensor)
{
if(!_already_loaded)
{
utils::NPYLoader loader;
loader.open(_filename, _file_layout);
loader.fill_tensor(tensor);
}
_already_loaded = !_already_loaded;
return _already_loaded;
}
Unlike NumPyBinLoade
, the ACL does not initialize or copy data but just allocate memory for dummy tensor.
bool DummyAccessor::access_tensor(ITensor &tensor)
{
ARM_COMPUTE_UNUSED(tensor);
bool ret = _maximum == 0 || _iterator < _maximum;
if(_iterator == _maximum)
{
_iterator = 0;
}
else
{
_iterator++;
}
return ret;
}
Running sequence
1. Construct NN
Add layers into the graph with in/output tensor size of them.
Graph optimization - fusion, sort, etc.
Memory allocation: allocate backend memory (cpu or gpu ) according to used tensors in the NN
Copy data: if data is given, copy them to the allocated memory. NB: If no data,
Kernel Compilation
Summary
- Zero copy is used (using a mapped host ptr to access GPU memory)
- No data copy to fill the GPU buffers when dummy tensor is used
- But mem sync still runs before starting computation (e.g. input synchronization)