Notes on Arm Mali

1. Why this page

Arm's Mali GPUs have rich docs and code, making them fun to hack/mod. This page documents what we learnt about Mali during our research. Hopefully it will help others.

Disclaimer: Many contents are from Arm's official documents and blogs. We do our best to attribute their origins. Much info is based on our own guess work & code reading.

2. Mali generations

Mali-400 (Allwinner A20/A64 boards)

Mali-450 (Utgard; HiKey)

Mali-T6x0 (Midgard; Juno, Firefly, Chromebook, Odroid XU3; Odroid XU4)

Mali-T7x0 (Midgard; Firefly, Tinkerboard, Chromebook)

Mali-T8x0 (Midgard; Firefly 2, Chromebook)

Mali-G71 (Bifrost Gen1; Hikey960)

Mali-G72 (Bifrost Gen2; Hikey970)

Mali-G31 (Bifrost Gen1; Odroid HC4; Odroid C4)

Mali-G52 (Bifrost Gen2; Odroid N2)

"MPx" -- there are X GPU cores

The roadmaps from Arm

3. COTS boards

boards	buy?	CPU	GPU	OPTEE?	OS Support?
Hikey	N arrow	Aarch64	Utgard	Y	Debian/AOSP
Hikey960	N arrow	Aarch64	Bifrost G71 MP8	Y	Debian; AOSP (as refboard)
Hikey970	N	Aarch64	Bifrost G72 MP12	N link	AOSP
Odroid Xu4	Y	A15	Midgard Mali T628 MP6	N	Debian Good
Firefly 2 (rk3288)	Y	A17	Midgard	Y	Debian? Unclear
Odroid C4/HC4	Y	Aarch64	Bifrost G31 MP2 (wikipedia) ~20Gflops/core?)
Odroid N2	Y	Aarch64	Bifrost G52 MP4 (~30Gflops/core?)

ARM's official stack:

Linux/Debian -- OpenGL ES; OpenCL; No OpenGL.

AOSP -- OpenGL (?), Vulkan, OpenCL(?)

ODROID C4

ODROID-HC4 (same CPU) Mali G31 (Bifrost Gen1) at 650Mhz

"Ultra efficient", low cost https://developer.arm.com/ip-products/graphics-and-multimedia/mali-gpus/mali-g31-gpu

ODROID N2

Mali G52 6EE?? (Bifrost Gen2), "Mainstream"

ODROID XU4

http://www.hardkernel.com/main/products/prdt_info.php?g_code=G143452239825&tab_idx=1 Mali T628 MP6 (OpenGL ES 3.0/2.0/1.1 and OpenCL 1.1 Full profile seems to have pretty good Linux desktop support http://ameridroid.com/t/xu4

odroidc4GPUperformance (credits: Odroid) Note: XU4 GPU performance still far exceeds Rpi4/G31. Woohoo!

Hikey970

ARM Mali-G72 MP12 GPU. MP12 -- 12 cores

AOSP support. Debian supports seems pretty sparse.

[Dec 2020] out of stock everywhere… . does not look promising

Hikey960

SoC: Kirin 960 [Mar 2020] Seems out of stock everywhere. Bad.

Mali G71 MP8, Bifrost arch. MP8 -- 8 cores

Build GPU kernel driver

Kernel tree 4.19.5 (hikey960-upstream-rebase) - custom 96boards hikey kernel on Debian

Video does not work; no fbdev and hdmi support, known issue

Vulkan only on AOSP https://www.96boards.org/blog/vulkan-hikey960/

Debian

Heejin's build instructions: https://docs.google.com/document/d/1nP2pqNe4sJMY_mg0hksEFs6OayGto4XJeKht09zTmRU/edit

4.19 kernel with a prebuilt image https://wiki.debian.org/InstallingDebianOn/96Boards/HiKey960 [status] Mali driver downloaded as source from Arm's official website. Built from source and insmod. Works for compute (OCL); no graphics support.

AOSP

Prebuilt AOSP image. Prebuilt 4.19 kernel. GFX working. OCL: unknown but presumed to be ok. Android kernel -- the Bifrost driver seems older than the official one? https://source.android.com/setup/build/devices

Hikey620

T450mp4. (Utgard) kernel support for mali integrated for AOSP (but not for debian -- probably not a primary goal of linaro)

most recent debian kernel does NOT integrate mali driver (devices/gpu/mali) https://github.com/Linaro/documentation/blob/master/Reference-Platform/Releases/RPB_16.06/ConsumerEdition/HiKey/BFSDebianRPB.md

Android’s support for GPU is pretty good https://source.android.com/source/devices#620hikey

4. Hardware internals

The building block is a "quad". A quad execution engine -- Arm's fundamental building block

G71: 4-wide SIMD per quad; 3x quads per core. 12 FMAs at the same time. These 3 quads are managed by a core's quad manager.

G76: 8-wide SIMD per quad, 3 quads per core.

Used to be a "narrow wavefront" (before G76)

Midgard: VLIW -- extracting 4-way ILP from each thread. Bifrost -- SIMD grouping 4 threads together

Anandtech's writeup is particularly useful:

In both the Mali-G71 and G72, a quad is just that: a 4-wide SIMD unit, with each lane possessing separate FMA and ADD/SF pipes. Fittingly, the width of a wavefront at the ISA-level for these parts has also been just 4 instructions, meaning all of the threads within a wavefront are issued in a single cycle. Overall, Bifrost’s use of a 4-wide design was a notably narrow choice relative to most other graphics architectures.

This is a very interesting change because, simply put, the size of a wavefront is typically a defining feature of an architecture. For long-lived architectures, especially in the PC space, wavefront sizes haven’t changed for years.

Bifrost

Arm's slides:

From Arm's blog (italic fonts added later)

Mali GPUs use an architecture in which instructions operate on multiple data elements simultaneously. The peak throughput depends on the hardware implementation of the Mali GPU type and configuration. Mali GPUs can contain many identical shader cores. Each shader core supports hundreds of concurrently executing threads.

Each shader core contains:

• One to three arithmetic pipelines or execution engines.

• One load-store pipeline.

• One texture pipeline.

OpenCL typically only uses the arithmetic pipelines or execution engines and the load-store pipelines. The texture pipeline is only used for reading image data types. In the execution engines in Mali Bifrost GPUs, scalar instructions are executed in parallel so the GPU operates on multiple data elements simultaneously. You are not required to vectorize your code to do this.

G77 (Valhalla)

No longer "Quad engine". much wider.

2x processing units per core

5. Resources

5.1 Projects

PAtrace -- useful tools for Mali GLES record & replay https://github.com/ARM-software/patrace

Panfrost https://xdc2018.x.org/slides/Panfrost-XDC_2018.pdf

5.2 Articles

good blog series. esp on Bifrost internals https://community.arm.com/developer/tools-software/graphics/b/blog/posts/the-mali-gpu-an-abstract-machine-part-4---the-bifrost-shader-core
The lima wiki provides lots of good info about Mali (may be outdated) http://limadriver.org/T6xx+ISA/

Mali hardware (including 400) https://limadriver.org/Hardware/

Lima driver arch (reverse enginerring, hw/sw interaction ... can be useful) https://people.freedesktop.org/~libv/FOSDEM2012_lima.pdf

T880 internals (good -- the GPU was used in mediatek x20) https://www.hotchips.org/wp-content/uploads/hc_archives/hc27/HC27.25-Tuesday-Epub/HC27.25.50-GPU-Epub/HC27.25.531-Mali-T880-Bratt-ARM-2015_08_23.pdf
G71, Bifrost. Good https://www.hotchips.org/wp-content/uploads/hc_archives/hc28/HC28.22-Monday-Epub/HC28.22.10-GPU-HPC-Epub/HC28.22.110-Bifrost-JemDavies-ARM-v04-9.pdf
“ARM Mali GPU Midgard Architecture”. Good slides on Midgard (2016. good) http://fileadmin.cs.lth.se/cs/Education/EDAN35/guestLectures/ARM-Mali.pdf https://www.hotchips.org/wp-content/uploads/hc_archives/hc28/HC28.22-Monday-Epub/HC28.22.10-GPU-HPC-Epub/HC28.22.110-Bifrost-JemDavies-ARM-v04-9.pdf
Mali’s shader core: unified, for vertex/fragment/compute, varying 1--16. shared L2: 32--64KB http://www.anandtech.com/show/8234/arms-mali-midgard-architecture-explored/7
Midgard GPU arch (overview) http://malideveloper.arm.com/downloads/ARM_Game_Developer_Days/PDFs/2-Mali-GPU-architecture-overview-and-tile-local-storage.pdf
ARM mali officially supported boards (old?) https://developer.arm.com/products/software/mali-drivers/user-space inc. T760MP4 http://en.t-firefly.com/en/firenow/firefly_rk3288/
Linux’s support for Mali https://wiki.debian.org/MaliGraphics
Blogs on Mali driver internals (in Chinese, verbose. somewhat useful) https://jizhuoran.gitbook.io/mali-gpu/mali-gpu-driver/she-bei-zhu-ce-gpu-register

6. The kernel driver

6.1 Acronyms

Kbdev -- kbase device. This corresponds to a GPU

Kctx -- a GPU context (?)

GP -- geometry processor

PP -- pixel processor

group -- A render group, i.e. all core sharing the same Mali MMU. see struct mali_group

kbase -- the kernel driver instance for midgard

TLstream -- timeline stream (for trace record)

js - job slot. As exposed by GPUs

jc - job chain.

Jd - job dispatcher (in the driver)

AS - address space (for GPU)

LPU -- Logical Processing Unit. For timeline display only (?)

6.2 Overview

imgs/archoverview.png

6.3 Job chains

A job chain: a binary blob of GPU executable and its metadata. Wrapped in atom structure. Bet u/k, "struct base_jd_atom_v2" is used, see mali_base_kernel.h. bet k/hw "struct kbase_jd_atom" is used, check this in "mali_kbase_defs.h".

kbase_jd_atom appears to be kernel's internal d/s. not shared with hw. Just to separate from the u/k interface (atom_v2), easy to change.

(base_jd_atom_v2.jc / < job-chain GPU address /)

The former one passed by user and the latter is the driver's internal abstraction. Both have jc in their structure which points GPU kernel instructions I guess.

One can start from kbase_api_job_submit() in "mali_kbase_core_linux.c" which invoked when the job submitted by user space.

As both structure jc, the gpu instructions might be not different from each other. Both are baremetal. When compiling OCL kernel, we need to specify GPU model or the CL runtime checks the available GPU in the system.

6.4 The register interface

6.4.1 Overview

Primarily, there are 3 types of registers (CPU Control, Job Control, MMU Management). GPU_CTRL: GPU Control JOB_CTRL: JOB Control MMU_MGMT: MMU Control

Note each type may have multiple instances. e.g. we may have a one GPU command register, 3 JS command registers, 8 AS command registers. 3 JS (job slot) means you have three types of jobs e.g. shader, tiler, ... 8 AS are for multiple address spaces used by GPU, basically # of page tables the GPU can hold. Normally we only use a single page table since we do not generate multiple contexts.

Three types of commands can be written to their corresponding reg types: GPU_COMAMND: GPU-related (e.g. soft reset, performance counter sample, etc.) JS_COMMAND: Job-related (e.g. start or stop processing a job chain, etc.) AS_COMMAND: MMU-related (e.g. MMU lock, broadcast, etc.)

jsandtrace

From Arm's blog on command execution:

"The workload in each queue is broken into smaller pieces and dynamically distributed across all of the shader cores in the GPU, or in the case of tiling workloads to a fixed function tiling unit. Workloads from both queues can be processed by a shader core at the same time; for example, vertex processing and fragment processing for different render targets can be running in parallel"

6.4.2 Code

Mali_kbas_device_hw_.c

kbase_reg_write()
kbase_reg_read()

Reg definition: Mali_kbase_gpu_regmap.h

6.4.3 Reg map

regmap

6.5 Job slots

This is the CPU/GPU interface.

t880jobmgr

A job slot pertains to a job type. e.g. SLOT1 is for Tiling/Vertex/Compute. Slot 1 is default one.

~~Even the test application seems not to have any tiling/vertex, the atom comes from user runtime is not marked BASE_JD_REQ_ONLY_COMPUTE.~~

Kernel code below:

UPDATE: if job requires compute shader, there seems two types of request types

BASE_JD_REQ_CS - guess used by OpenGL compute?
BASE_JD_REQ_ONLY_COMPUTE - this is what we've got from OpenCL compute kernel

In the test application (OpenCL), the atom comes from user run time indeed has BASE_JD_REQ_ONLY_COMPUTE req type. However, the device_nr and num_core_groups are not matched with the const numbers in the condition and thus return SLOT1.

BASE_JD_REQ_ONLY_COMPUTE device_nr: 0 num_core_groups: 1

6.5.1 The ringbuffer inside a job slot

GPU is aware of the "current" and "next" jobs. A ringbuffer (rb) seems managed by driver. The ringbuffer seems to track the "current" and "next" jobs on GPU. A ringbuffer can hold 2 jobs max. The driver puts jobs it intends to submit next, or already submitted, in the ringbuffer.

Device driver manages jobs (atoms) by putting them into ring buffer (atoms and ring buffers managed by device driver).

Basically, device driver pulls runnable atom from queue and writes it to hardware job slot register, then kick GPU.

So ringbuffer is managed by device driver and a next atom to run is determined by device driver. It seems GPU only cares about atom written to job slots (check the next job slot and command, then do the things based on command).

The driver checks out atom state first and then put it into job_slot register.

Code snippet

rbstate

A ring buffer: two items. For current and next jobs.

gpuinspect

regs for "next" job are R/W (==> the driver can modify them since the job is not kicked yet)
regs for "current" job are R/O (==> the job is being executed on GPU. the driver cannot do anything about the job)

jobslotdef

6.5.2 Ringbuffer slot state

This is well commented

slotstate

More evidence from the NoMali project (see jobslot.hh)

6.6 Memory management

6.6.1 Memory group manager

On hikey960, this is in mali_kbase_native_mgm.[c|h]

* kbase_native_mgm_dev - Native memory group manager device
 * An implementation of the memory group manager interface that is intended for
 * internal use when no platform-specific memory group manager is available.
 * It ignores the specified group ID and delegates to the kernel's physical
 * memory allocation and freeing functions.

Note this code is generic (drivers/base), not quite Mali specific. In odroid-xu4 v4.15, there's not such a component (??) Absent in Hikey960 android kernel (4.19)

Seems the "root" of mm allocation.

memory_group_manager.h struct memory_group_manager_device - Device structure for a memory group Since it is a "device structure"? Caan there be such hardware? It defines a bunch of memory_group_manager_ops to be implemented… plat specific?

Kbdev->mgm_dev

6.6.2 Memory pool vs. memory group manager

One pool may contain pages from a specific memory group

Group id?

A physical memory group ID. The meaning of this is defined by the systems integrator (???)

A memory group ID to be passed to a platform-specific memory group manager

Pool -- a frontend of memory allocation
Memory group manager -- the backend; platform specific. May or may not present (if not, just a sw impl?)

cf: drivers/memory_group_manager/XXX

"An example mem group manager" This seems a sw-only implementation.

From kernel configuration: "This option enables an example implementation of a memory group manager for allocation and release of pages for memory pools managed by Mali GPU device drivers."

The example instance will be init'd during "device probe time". Because it emulates a hw device (??)

There's also "physical mem group manager"? See the devicetree binding memory_group_manager.txt

6.7 Tracing

mali_kbase_tracepoints.[ch] useful. note they are auto-gen by some python code, missing in the kernel tree.

There’s also debugfs support (MALI_MIDGARD_ENABLE_TRACE)

6.7.1 Mali_kbase_tracepoints.c

These seem all streamline events. Where do they go? via u/k buffer to Streamline?

/* Message ids of trace events that are recorded in the timeline stream. */
enum tl_msg_id_obj {
    KBASE_TL_NEW_CTX,
    KBASE_TL_NEW_GPU,


#define OBJ_TL_LIST \
    TP_DESC(KBASE_TL_NEW_CTX, \
        "object ctx is created", \
        "@pII", \
        "ctx,ctx_nr,tgid") \

6.7.2 Timeline stream abstraction

See very well documented

mali_kbase_tlstream.h

Who will consume the stream?

6.7.3 Timeline infrastructure

Mali_kbase_timeline.c

mali_kbase_timeline.h

kbase_timeline_streams_flush(). To be invoked by user? It appears timeline shares a buffer with userspace? Flush to userspace? How is this done?

KBASE_AUX_PAGESALLOC

Invoked upon phys page alloc/free…. Indicating changes of pages

6.7.4 Streams

Two types of "streams". AUX and OBJ. Seems: _obj is for obj operation, like creating context, MMU. _aux is for other events, like PM, page alloc, etc.

streamsflush

6.7.5 Sample traces dumped from the driver

sampletrace

6.8 GPU virtual memory

6.8.1 Top level funcs:

Mali_base_mmu.h // well written

6.8.2 MMU

GPU MMU is programmed by CPU.

Looks like a 4 level pgtable?? (this is for Bifrost.. They didn’t change the macro names) #define MIDGARD_MMU_TOPLEVEL MIDGARD_MMU_LEVEL(0) #define MIDGARD_MMU_BOTTOMLEVEL MIDGARD_MMU_LEVEL(3) See kbase_mmu_teardown_pages()

For Midgard, 2 levels.

Update hw mmu entries??

kbase_mmu_hw_configure, called by mmu_update(), by kbase_mmu_update

Reg defs

see mali_kbase_gpu_regmap.hMax 16 addr spaces

D/S

struct kbase_mmu_table - object representing a set of GPU page tables. @pgd -- pgtable root (hw address)

struct kbase_mmu_mode - object containing pointer to methods invoked for programming the MMU, as per the MMU mode supported by Hw.

struct kbase_as - object representing an address space of GPU.

One kbase_device has many (up to 16) kbase_as struct kbase_as as[BASE_MAX_NR_AS];

Each @kbase_context has a @as_nr, which seems to point to the as.

Code

drivers\gpu\arm\midgard\mmu

High level MMU function kbase_mmu_flush_invalidate_noretain()

(..goes into…) The actual hw function operating MMU: Mali_kbase_mmu_hw_direct.c --> kbase_mmu_hw_do_operation(), etc.

Pgtable

Allocate page table for GPU kbase_mmu_alloc_pgd()

Kernel threads

Workqueue mali_mmu page_fault_worker bus_fault_worker

Address space

This is like "segment". One addr space seems a tree of pgtables. The actual count of addr spaces…. Supported by MMU hardware

 * @nr_hw_address_spaces:  Number of address spaces actually available in the
                          GPU, remains constant after driver initialisation.
 * @nr_user_address_spaces: Number of address spaces available to user contexts

// 8 for G71

See kbase_device_as_init()

// desc for one addr space… 
struct kbase_mmu_setup {
    u64 transtab; // translation table? See AS_TRANSTAB_BASE_MASK
    u64 memattr;
    u64 transcfg; // translation cfg? See AS_TRANSCFG_ADRMODE_AARCH64_4K etc.
};

Configuration kbase_mmu_get_as_setup()

mmap path

Right, at that time that only allocate page for GPU pgtable only.

Think mapping happen when mmap() is called, which invokes kbase_mmap() and t thus kbase_context_mmap(). There is a structure called "kbase_va_region" which contains some mapping related information.

Basically, pages are allocated in GPU side and then CPU calls kbase_mmap to get the allocated pages from GPU side. Look at kbase_reg_mmap() and kbase_gpu_mmap(). It tries to let GPU use same VA of CPU and maps the VA to PA updating MMU.

7. Address space

In this secsion, we are tyring to understand GPU address spaces used by Mali Bifrost. To this end, we dumped all the address spaces while running OpenCL vector addition.

7.1 AS information from device driver

The address space are allocated by device driver but controlled by user-space runtime. For instance, the device driver gets the FOPS and IOCTL command from user-space runtime as follows.

KBASE_FOPS_GET_UNMAPPED_AREA: gets the CPU address space esed by both user-space runtime and device driver
KBASE_FOPS_MMAP: maps AS to GPU and setup the VMOPS (e.g. open, close, and fault)
KBASE_IOCTL_MEM_ALLOC: allocates the actual physical memory (pages)

[AS:0 type: 0] start: ffff9428e000, end: ffff942ae000, flags: 606a, valid: 0, nr_pages: 32, nents: 32
ZONE_SAME_VA | CPU_WR | CPU_RD | GPU_RD | GPU_NX | GPU_CACHED | 
[AS:1 type: 0] start: ffff9c03c000, end: ffff9c03d000, flags: 606e, valid: 1, nr_pages: 1, nents: 1
ZONE_SAME_VA | CPU_WR | CPU_RD | GPU_WR | GPU_RD | GPU_NX | GPU_CACHED | 
[AS:2 type: 0] start: ffff9c03b000, end: ffff9c03c000, flags: 606e, valid: 1, nr_pages: 1, nents: 1
ZONE_SAME_VA | CPU_WR | CPU_RD | GPU_WR | GPU_RD | GPU_NX | GPU_CACHED | 
[AS:3 type: 0] start: ffff9c03a000, end: ffff9c03b000, flags: 606e, valid: 1, nr_pages: 1, nents: 1
ZONE_SAME_VA | CPU_WR | CPU_RD | GPU_WR | GPU_RD | GPU_NX | GPU_CACHED | 
[AS:4 type: 0] start: ffff9c039000, end: ffff9c03a000, flags: 206a, valid: 1, nr_pages: 1, nents: 1
ZONE_SAME_VA | CPU_WR | GPU_RD | GPU_NX | GPU_CACHED | 
[AS:5 type: 2] start: ffff93800000, end: ffff93a01000, flags: 627e, valid: 1, nr_pages: 513, nents: 513
ZONE_SAME_VA | CPU_WR | CPU_RD | GPU_WR | GPU_RD | GPU_NX | CPU_CACHED | GPU_CACHED | 
[AS:6 type: 2] start: ffff93a4d000, end: ffff93a8d000, flags: 627e, valid: 1, nr_pages: 64, nents: 64
ZONE_SAME_VA | CPU_WR | CPU_RD | GPU_WR | GPU_RD | GPU_NX | CPU_CACHED | GPU_CACHED | 
[AS:7 type: 2] start: ffff93a0d000, end: ffff93a4d000, flags: 627e, valid: 1, nr_pages: 64, nents: 64
ZONE_SAME_VA | CPU_WR | CPU_RD | GPU_WR | GPU_RD | GPU_NX | CPU_CACHED | GPU_CACHED | 
[AS:8 type: 1] start: ffff92000000, end: ffff93000000, flags: 6062, valid: 1, nr_pages: 4096, nents: 1
ZONE_SAME_VA | CPU_WR | CPU_RD | GPU_RD | GPU_CACHED | 
[AS:9 type: 4] start: ffff9c038000, end: ffff9c039000, flags: 606e, valid: 1, nr_pages: 1, nents: 1
ZONE_SAME_VA | CPU_WR | CPU_RD | GPU_WR | GPU_RD | GPU_NX | GPU_CACHED |

AS spaces used by vector addition (using Hikey960)

7.2 How many address spaces?

The # of AS depends on the GPU application (e.g. how many GPU kernels are loaded or size of in/output and intermediate data). However, all the application has at least 5 ASs from AS0 to AS5 by default (pre-reserved). The purpose of them are unknown. If the in/output sizes are quite small, AS5 sometimes stores them.

7.3 Types of GPU address space

Besides the unknown, we could guess some types of ASs.

AS for storing data: AS6 and 7 are used for in/output data
AS for storing GPU binary: AS8 is used for GPU binary where GPU_NX (non-executable) flag is not set. Although 4096 pages are reserved (not committed), the device driver incrementally allocates memory when a new GPU kernel is compiled and thus loaded. See KBASE_IOCTL_MEM_COMMIT
AS for jobchain: AS9 is used for jobchain. While # of jobchain ASs is dynamic, its size is always limited to a single page. To be specific, when you have more GPU kernels and jobs, the device driver allocates more ASs of single page to store jobchains.

7.4 OpenCL

AS of which flag is 606e and nr_pages = 1 indicate it is used for job chain.

7.5 OpenGL

OpenGL has different AS layout, seems reserving some pages for job chain.

AS of which flag is 607eand nr_pages = 8 indicate it is used for job chain.

[AS:0 type: 0] start: ffff81bd7000, end: ffff81bf7000, flags: 606a, valid: 0, nr_pages: 32, nents: 32
ZONE_SAME_VA | CPU_WR | CPU_RD | GPU_RD | GPU_NX | GPU_CACHED | 
[AS:1 type: 0] start: ffff82cfc000, end: ffff82cfd000, flags: 606e, valid: 1, nr_pages: 1, nents: 1
ZONE_SAME_VA | CPU_WR | CPU_RD | GPU_WR | GPU_RD | GPU_NX | GPU_CACHED | 
[AS:2 type: 0] start: ffff82cdf000, end: ffff82ce0000, flags: 606e, valid: 1, nr_pages: 1, nents: 1
ZONE_SAME_VA | CPU_WR | CPU_RD | GPU_WR | GPU_RD | GPU_NX | GPU_CACHED | 
[AS:3 type: 0] start: ffff82cde000, end: ffff82cdf000, flags: 606e, valid: 1, nr_pages: 1, nents: 1
ZONE_SAME_VA | CPU_WR | CPU_RD | GPU_WR | GPU_RD | GPU_NX | GPU_CACHED | 
[AS:4 type: 0] start: ffff82cdd000, end: ffff82cde000, flags: 206a, valid: 1, nr_pages: 1, nents: 1
ZONE_SAME_VA | CPU_WR | GPU_RD | GPU_NX | GPU_CACHED | 
[AS:5 type: 2] start: ffff81b97000, end: ffff81bd7000, flags: 607e, valid: 0, nr_pages: 64, nents: 64
ZONE_SAME_VA | CPU_WR | CPU_RD | GPU_WR | GPU_RD | GPU_NX | CPU_CACHED | GPU_CACHED | 
[AS:6 type: 0] start: ffff81b57000, end: ffff81b97000, flags: 606e, valid: 1, nr_pages: 64, nents: 64
ZONE_SAME_VA | CPU_WR | CPU_RD | GPU_WR | GPU_RD | GPU_NX | GPU_CACHED | 
[AS:7 type: 0] start: ffff81b4a000, end: ffff81b57000, flags: 206c, valid: 0, nr_pages: 13, nents: 13
ZONE_SAME_VA | GPU_WR | GPU_RD | GPU_NX | GPU_CACHED | 
[AS:8 type: 1] start: ffff77000000, end: ffff78000000, flags: 6062, valid: 1, nr_pages: 4096, nents: 1
ZONE_SAME_VA | CPU_WR | CPU_RD | GPU_RD | GPU_CACHED | 
[AS:9 type: 4] start: ffff82cdc000, end: ffff82cdd000, flags: 606a, valid: 1, nr_pages: 1, nents: 1
ZONE_SAME_VA | CPU_WR | CPU_RD | GPU_RD | GPU_NX | GPU_CACHED | 
[AS:10 type: 0] start: ffff81b0a000, end: ffff81b4a000, flags: 626e, valid: 1, nr_pages: 64, nents: 64
ZONE_SAME_VA | CPU_WR | CPU_RD | GPU_WR | GPU_RD | GPU_NX | GPU_CACHED | 
[AS:11 type: 0] start: ffff81aca000, end: ffff81b0a000, flags: 606e, valid: 1, nr_pages: 64, nents: 64
ZONE_SAME_VA | CPU_WR | CPU_RD | GPU_WR | GPU_RD | GPU_NX | GPU_CACHED | 
[AS:12 type: 2] start: ffff82cd4000, end: ffff82cdc000, flags: 607e, valid: 1, nr_pages: 8, nents: 8
ZONE_SAME_VA | CPU_WR | CPU_RD | GPU_WR | GPU_RD | GPU_NX | CPU_CACHED | GPU_CACHED | 
[AS:13 type: 2] start: ffff81ac2000, end: ffff81aca000, flags: 627e, valid: 1, nr_pages: 8, nents: 8
ZONE_SAME_VA | CPU_WR | CPU_RD | GPU_WR | GPU_RD | GPU_NX | CPU_CACHED | GPU_CACHED | 
[tgx_dump_sync_as] total count of sync_as: 2

The dumped one is from when running gc_absdiff and both AS12 and 13 seem to be used for job chains. Here is the content of job chain:

ffff82cd4000 |  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                09 00 02 00 01 00 00 00  00 00 00 00 00 00 00 00
                B8 38 00 00 00 00 86 23  00 00 00 1C 00 00 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                02 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00  C0 40 CD 82 FF FF 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 20 AC 81 FF FF 00 00  C0 CF CD 82 FF FF 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                40 42 CD 82 FF FF 00 00  00 00 00 00 00 00 00 00
                07 D0 40 CD 82 FF FF 00  05 00 A0 AC 81 FF FF 00
                80 34 B2 81 FF FF 00 00  8A 32 00 00 00 00 00 00
                40 6A B1 81 FF FF 00 00  8A 32 00 00 00 00 00 00
                00 A0 B0 81 FF FF 00 00  8A 32 00 00 00 00 00 00
                00 20 AC 81 FF FF 00 00  00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                09 00 01 00 00 00 00 00  00 40 CD 82 FF FF 00 00
                00 00 00 00 00 00 40 00  00 00 00 14 00 00 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                02 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00  C0 40 CD 82 FF FF 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 20 AC 81 FF FF 00 00  80 CF CD 82 FF FF 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                40 42 CD 82 FF FF 00 00  00 00 00 00 00 00 00 00
                00 00 00 00 1F 00 00 00  00 A0 B4 81 FF FF 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00

AS 12

ffff81ac2000 |  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 20 AC 81 FF FF 00 00  40 6A B1 81 FF FF 00 00
                00 A0 B0 81 FF FF 00 00  80 34 B2 81 FF FF 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00

AS 13

8. Misc

8.1 # of shader cores (regarding MP)

You can check # of share cores via clinfo, or check out the raw GPU stat from device driver.

void kbase_gpuprops_set(struct kbase_device *kbdev)
{
    struct kbase_gpu_props *gpu_props;
    struct gpu_raw_gpu_props *raw;

    KBASE_DEBUG_ASSERT(NULL != kbdev);
    gpu_props = &kbdev->gpu_props;
    raw = &gpu_props->props.raw_props;

    /* Initialize the base_gpu_props structure from the hardware */
    kbase_gpuprops_get_props(&gpu_props->props, kbdev);

    /* Populate the derived properties */
    kbase_gpuprops_calculate_props(&gpu_props->props, kbdev);

    /* Populate kbase-only fields */
    gpu_props->l2_props.associativity = KBASE_UBFX32(raw->l2_features, 8U, 8);
    gpu_props->l2_props.external_bus_width = KBASE_UBFX32(raw->l2_features, 24U, 8);

    gpu_props->mem.core_group = KBASE_UBFX32(raw->mem_features, 0U, 1);

    gpu_props->mmu.va_bits = KBASE_UBFX32(raw->mmu_features, 0U, 8);
    gpu_props->mmu.pa_bits = KBASE_UBFX32(raw->mmu_features, 8U, 8);

    gpu_props->num_cores = hweight64(raw->shader_present);
    gpu_props->num_core_groups = hweight64(raw->l2_present);
    gpu_props->num_address_spaces = hweight32(raw->as_present);
    gpu_props->num_job_slots = hweight32(raw->js_present);
}

NP means multiple processing and hence represents # of cores the GPU is quipped with (e.g. Bifrost G71-mp8 has 8 shader cores)
However, there are some odd products (e.g. G31). See the answer from Arm