nvc0 compute support is now fixed

If you try to monitor MP performance counters through the HUD on nvc0 you should get the following error message :

gallium_hud: all queries are busy after 8 frames, can’t add another query.

This message occurs when the kernel is not synchronized, ie. when it doesn’t run correctly.

Now, if you take a quick look to the kernel error messages, you should get the following precision :

DATA_ERROR [INVALID_VALUE] ch 4 [0x000027f839 glxgears[11550]] subc 1 class 0xa0c0 mthd 0x02e8 data 0x0040cccc

Actually, data must be aligned to 0x8000 on nvc0 according to rnndb.

A 3 lines patch fixes the compute support on nvc0.

How to decode the pushbuffer using valgrind-mmt and dedma ?

In some cases, informations are not presently exposed through MMIO registers and the blob uses FIFO methods instead. Actually, the blob uses FIFO methods for enabling MP counters. Let start to explain how to do that.

 
In this example, I use the NVC1 chipset, and I want to decode the pushbuffer used by the NVC0_COMPUTE class (0x000090c0).

First you have to trace a signal using cupti_trace :

$ cupti_trace --trace NVC1 --event active_cycles

Now, you have to grep the FIFO object class id 0x000090c0.

$ grep 0x000090c0 active_cycles.trace
--6903-- out2 0x00000004 0x00000002 0x00000003 0x0000003d 0x0000003e 0x0000003f 0x00000040 0x00009197 0x000090b8 0x00000073 0x00005080 0x00009072 0x00009074 0x0000844c 0x000090dd 0x000090b2 0x000090b1 0x00008570 0x0000857a 0x0000857b 0x0000857c 0x0000857d 0x0000857e 0x0000007d 0x00009068 0x0000907f 0x0000906f 0x0000902d 0x00009097 0x000090c0 0x00009039 0x000090e0 0x000090e6 0x000090e2 0x000090e3 0x000050a0 0x00009096 0x000090e1 0x000090b3 0x000090b5 0x0000208a 0x000085b6 0x00009067 0x000090f1 0x0000503b 0x0000503c 0x00000075 
--6903-- out2 0x00000004 0x00000002 0x00000003 0x0000003d 0x0000003e 0x0000003f 0x00000040 0x00009197 0x000090b8 0x00000073 0x00005080 0x00009072 0x00009074 0x0000844c 0x000090dd 0x000090b2 0x000090b1 0x00008570 0x0000857a 0x0000857b 0x0000857c 0x0000857d 0x0000857e 0x0000007d 0x00009068 0x0000907f 0x0000906f 0x0000902d 0x00009097 0x000090c0 0x00009039 0x000090e0 0x000090e6 0x000090e2 0x000090e3 0x000050a0 0x00009096 0x000090e1 0x000090b3 0x000090b5 0x0000208a 0x000085b6 0x00009067 0x000090f1 0x0000503b 0x0000503c 0x00000075 
--6903-- pre_ioctl: fd:3, id:0x2b (full:0xc020462b), data: 0xc1d00511 0x5c0000c9 0x5c0000ca 0x000090c0 0x00000000 0x00000000 0x00000000 0x00000000 
--6903-- post_ioctl: fd:3, id:0x2b (full:0xc020462b), data: 0xc1d00511 0x5c0000c9 0x5c0000ca 0x000090c0 0x00000000 0x00000000 0x00000000 0x00000000 
--6903-- out 0x5c0000ca 0x000090c0 0x000090c0 0x00000001 
--6903-- w 2:0x2004, 0x000090c0 
--6903-- w 11:0x24300, 0x000090c3,0x000090c2,0x000090c1,0x000090c0 
--6903-- w 9:0x24300, 0x000090c3,0x000090c2,0x000090c1,0x000090c0 
--6903-- r 10:0x12180, 0x000090c6,0x000090c4,0x000090c2,0x000090c0 
--6903-- pre_ioctl: fd:3, id:0x2b (full:0xc020462b), data: 0xc1d00511 0x5c0000ec 0x5c0000ed 0x000090c0 0x00000000 0x00000000 0x00000000 0x00000000 
--6903-- post_ioctl: fd:3, id:0x2b (full:0xc020462b), data: 0xc1d00511 0x5c0000ec 0x5c0000ed 0x000090c0 0x00000000 0x00000000 0x00000000 0x00000000 
--6903-- out 0x5c0000ed 0x000090c0 0x000090c0 0x00000001 
--6903-- w 15:0x2004, 0x000090c0

The following line contains the map id, which is 2 in this example :

--6903-- w 2:0x2004, 0x000090c0

Now, you have to use dedma which decodes the pusbuffer using rnndb (the output is truncated here).

$ dedma -m c0 -v 2 active_cycles.trace > active_cycles.dedma
20014000  size 1, subchannel 2 (0x0), offset 0x0000, increment
000090c0    NVC0_COMPUTE mapped to subchannel 2
20014040  size 1, subchannel 2 (0x90c0), offset 0x0100, increment
00000000    NVC0_COMPUTE.GRAPH.NOP = 0
200141d6  size 1, subchannel 2 (0x90c0), offset 0x0758, increment
00000002    NVC0_COMPUTE.MP_LIMIT = 0x2
200141e4  size 1, subchannel 2 (0x90c0), offset 0x0790, increment
00000000    NVC0_COMPUTE.TEMP_ADDRESS_HIGH = 0
200141e5  size 1, subchannel 2 (0x90c0), offset 0x0794, increment
10000000    NVC0_COMPUTE.TEMP_ADDRESS_LOW = 0x10000000
200141e6  size 1, subchannel 2 (0x90c0), offset 0x0798, increment
00000000    NVC0_COMPUTE.TEMP_SIZE_HIGH = 0
200141e7  size 1, subchannel 2 (0x90c0), offset 0x079c, increment
00700000    NVC0_COMPUTE.TEMP_SIZE_LOW = 0x700000
200141e8  size 1, subchannel 2 (0x90c0), offset 0x07a0, increment
00012600    NVC0_COMPUTE.WARP_TEMP_ALLOC = 0x12600
200141df  size 1, subchannel 2 (0x90c0), offset 0x077c, increment
03000000    NVC0_COMPUTE.LOCAL_BASE = 0x3000000
20014081  size 1, subchannel 2 (0x90c0), offset 0x0204, increment
000000f0    NVC0_COMPUTE.LOCAL_POS_ALLOC = 0xf0
20014082  size 1, subchannel 2 (0x90c0), offset 0x0208, increment
000007c0    NVC0_COMPUTE.LOCAL_NEG_ALLOC = 0x7c0
20014083  size 1, subchannel 2 (0x90c0), offset 0x020c, increment
00001000    NVC0_COMPUTE.WARP_CSTACK_SIZE = 0x1000
20014359  size 1, subchannel 2 (0x90c0), offset 0x0d64, increment
0000000f    NVC0_COMPUTE.CALL_LIMIT_LOG = 0xf
200140c2  size 1, subchannel 2 (0x90c0), offset 0x0308, increment
00000003    NVC0_COMPUTE.CACHE_SPLIT = 48K_SHARED_16K_L1
20014085  size 1, subchannel 2 (0x90c0), offset 0x0214, increment
01000000    NVC0_COMPUTE.SHARED_BASE = 0x1000000
20014093  size 1, subchannel 2 (0x90c0), offset 0x024c, increment
00000000    NVC0_COMPUTE.SHARED_SIZE = 0
200140a8  size 1, subchannel 2 (0x90c0), offset 0x02a0, increment
00008000    NVC0_COMPUTE.UNK02A0 = 0x8000
2001408e  size 1, subchannel 2 (0x90c0), offset 0x0238, increment
00010001    NVC0_COMPUTE.GRIDDIM_YX = { X = 1 | Y = 1 }
2001408f  size 1, subchannel 2 (0x90c0), offset 0x023c, increment
00000001    NVC0_COMPUTE.GRIDDIM_Z = 1
200140eb  size 1, subchannel 2 (0x90c0), offset 0x03ac, increment
00010001    NVC0_COMPUTE.BLOCKDIM_YX = { X = 1 | Y = 1 }
200140ec  size 1, subchannel 2 (0x90c0), offset 0x03b0, increment
00000001    NVC0_COMPUTE.BLOCKDIM_Z = 1
200140b1  size 1, subchannel 2 (0x90c0), offset 0x02c4, increment
00000000    NVC0_COMPUTE.UNK02C4 = FALSE
...

However, dedma fails parsing when the blob uses method data from a different buffer, so you have to do that by hand but it’s pretty easy. You just have to find the data after the 0x20014cef address. In this example, I find 0xaaaa0 which is the value of MP_PM_OP https://github.com/pathscale/envytools/blob/master/rnndb/nvc0_compute.xml#L252.

See you! 😉

MP performance counters are now implemented on nvc0:nvc8

After two weeks of hard work, I managed to add support of MP performance counters on nvc0:nvc8. I tested my implementation only on nvc1 but it should work on other chipsets except nvc8 but I’ll add it in the next few weeks. In order to add this support, I had to implement compute support for nvc0, which is the ability to launch a kernel. My work is based on the compute support implementation of Christoph Bumiller (alias calim) http://people.freedesktop.org/~chrisbmr/90c0.c .

http://lists.freedesktop.org/archives/mesa-dev/2013-July/041448.html

http://lists.freedesktop.org/archives/mesa-dev/2013-July/041449.html