Jetson_nano

Spec

Tegra X1 SoC (TM660M)

CPU: 4 A53 + 4 A57
GPU: Maxwell-based (128 cores)

Note:

Seems it gather multiple commands into a job but serialized with job unit
No OpenCL support but CUDA
Cannot find specific

Heejin's writeup:

https://bakhi.github.io/Jetson-nano/

Terms

gk20a -- the GPU gen of jetson tk1 (kepler)

gr3d - the GPU engine. ref

cde color decompression engine

cdma: command dma

command stream: a sequence of GPU reg IO written to host1x

channel: a FIFO to push commands to a client (i.e. a hw unit)

Grate: Prior reverse engineering effort by record and replay syscalls (ioctl) at user/kernel interface
Myriad cmd submission: seem to be dumped from cmd stream

Grate - prior effort for reverse engineering

Record and replay syscall (ioctl) at user/kenel interface
Myriad cmd submission](https://github.com/grate-driver/grate/blob/master/tests/nvhost/gr3d.c): seem to be dumped from cmd stream

EXP Environment

Use a simple test application (vector addition)
Dump trace
printk
ftrace

Ftrace

Channel worker messes up the trace

It seems a separate channel worker thread puts or gets work item in the queue. Even no GPU apps run, the ftrace continuously prints out both gk20a_channel_put: channel 511 caller gk20a_channel_poll_timeouts and gk20a_channel_get: channel 511 caller gk20a_channel_poll_timeouts iteratively.

Achieve clean trace

The channel worker makes the entire trace messy. To get cleaned trace, follows some tips:

Add the following functions to set_ftrace_notrace
__nvgpu_log_dbg
_gk20a_channel_put
nvgpu_thread_should_stop
nvgpu_cond_broadcast
nvgpu_timeout_peek_expired
nvgpu_platform_is_silicon
Disable the following events from gk20a
gk20a_channel_put
gk20a_channel_get
gk20a_channel_set_timeout

What to validate?

One to one matching (job:IRQ) - IRQ and sync by the wait command and fence

Does the GPU use page table? - yes

GPU state updates - unknown

Interrupt

Inter-channel IRQ Coalescing

IRQ coalescing: IRQ after job (cmd) submission, but not 1:1 matching

Inter-channel IRQ Coalescing

IRQ cascade: user-space keeps submitting jobs (writing command to FIFO buffer and hence the ringbuffer)

Inter-channel IRQ Coalescing

Inter-channel IRQ coalescing: possible for all the channels.

static irqreturn_t syncpt_thresh_cascade_isr(int irq, void *dev_id)
...
for_each_set_bit(id, &reg, 32) {
...

irq semantics

syncpt_thresh_cascade_isr

ioctl

cf: kernel include/uapi/linux/nvhost_ioctl.h

NVHOST_IOCTL_CHANNEL_SUBMIT_EXT (7) --- no longer exists in the newest driver?

h1x

clients?

can be inferred from:

// include/linux/host1x.h
enum host1x_class {
    HOST1X_CLASS_HOST1X = 0x1,
    HOST1X_CLASS_GR2D = 0x51,
    HOST1X_CLASS_GR2D_SB = 0x52,
    HOST1X_CLASS_VIC = 0x5D,
    HOST1X_CLASS_GR3D = 0x60,
};

poc1 poc2

enum host1x_class {
  HOST1X_CLASS_HOST1X = 0x1,
  HOST1X_CLASS_NVENC = 0x21,
  HOST1X_CLASS_VI = 0x30,
  HOST1X_CLASS_ISPA = 0x32,
  HOST1X_CLASS_ISPB = 0x34,
  HOST1X_CLASS_GR2D = 0x51,
  HOST1X_CLASS_GR2D_SB = 0x52,
  HOST1X_CLASS_VIC = 0x5D,
  HOST1X_CLASS_GR3D = 0x60,
  HOST1X_CLASS_NVJPG = 0xC0,
  HOST1X_CLASS_NVDEC = 0xF0,
};

job synchronization

good refs:

(these may be outdated. should refer to "grate" code)

http://http.download.nvidia.com/tegra-public-appnotes/host1x.html "host1x hardware description"

https://lists.freedesktop.org/archives/dri-devel/2012-December/031410.html "First version of host1x intro"

kernel doc: https://www.kernel.org/doc/html/latest/gpu/tegra.html

fence is for sync. two types: semaphores (hw?) and syncpoints (??)

syncpt has id and value ("incr_id, incr_value"?). also called "increment"?

"pre-fence" and "post fence". key function: nvgpu_submit_prepare_syncs

Channels and Job Submission

number of jobs

Observation

Not many jobs submission from kernel execution
~ 300 jobs for init and term
command is more fine-grained than atom structure in mali
Even printf in the cuda kernel generates jobs

number of jobs

# of jobs is nondeterministic

Default # of jobs: init + term = 303
cuMemAlloc does not affect # of submitted jobs
May affect together with MemCpy

number of jobs

Impact of input size

The number of submitted jobs increases with larger input size
Guess
Redbox: MemAlloc/Cpy channel - the same input size, the same # of jobs
Bluebox: GPU state transition - may rely on execution time like the mali does in power regIO

Summary

Job execution

Context init and term generates ~300 jobs (deterministic)
Not many job submissions from kernel execution, but command stream is more fine-grained compared to Mali atom structure

Memory

MemAlloc does not add jobs but memCpy does
Large MemCpy generates more jobs together with memAlloc

Channels

userspace created using:

#define NVGPU_GPU_IOCTL_OPEN_CHANNEL \
    _IOWR(NVGPU_GPU_IOCTL_MAGIC, 11, struct nvgpu_gpu_open_channel_args)

The code suggests kernel will allocate channel id

struct nvgpu_gpu_open_channel_args {
    union {
        __s32 channel_fd; /* deprecated: use out.channel_fd instead */
        struct {
             /* runlist_id is the runlist for the
              * channel. Basically, the runlist specifies the target
              * engine(s) for which the channel is
              * opened. Runlist_id -1 is synonym for the primary
              * graphics runlist. */
            __s32 runlist_id;
        } in;
        struct {
            __s32 channel_fd;
        } out;
    };
};

when a channel is opened via ioctl, the kernel will grab a unused fd and create a userspace file (?) waiting to be opened by user. nvhost-dev-fdX...

static long nvhost_channelctl(struct file *filp,
    unsigned int cmd, unsigned long arg) {
...
    switch (cmd) {
    case NVHOST_IOCTL_CHANNEL_OPEN:
    {
    ...
    err = __nvhost_channelopen(NULL, priv->pdev, file);

Observation from log

Multiple channels are allocated for a single application running
Each channel has its own ringbuffer
ch 507 - for init? alwasy 293 jobs
ch 503 - related to mem allocation task. # of submission grow with allocation sizes
ch 506 - the source of non determinism. # of submission vary across runs - could be GPU state transitions (PM?), etc.
ch 504/505?

Note: channel ID is nondeterministic

Delays

May make execution deterministic
50 us delay makes one-to-one job execution but not always

https://github.com/grate-driver/grate/wiki/Grate-driver

helpful code. easy to read. be aware it's for ancient NV devices though. many IOCTLs are obsoleted

Jetson_nano

Spec

Terms

Grate - prior effort for reverse engineering

EXP Environment

Ftrace

Channel worker messes up the trace

Achieve clean trace

What to validate?

Interrupt

ioctl

h1x

job synchronization

Channels and Job Submission

Summary

Job execution

Memory

Channels

Observation from log

Delays

Related project