libpciaccess has now official support for Windows/Cygwin

Hey,

During my Google Summer of Code 2013, a part of my project was to reverse engineered GPU graphics counters on NVIDIA Tesla. However, these counters are only exposed on Windows through the NVIDIA NVPerfKit performance tools.

Usually the Nouveau community uses envytools, a collection of tools to help developers understand how NVIDIA GPUs work. Envytools depends on libpciaccess which is only available on POSIX platforms. That’s why I decided to port libpciaccess to Windows/Cygwin to be able to use these tools.

This port depends on WinIo which allows direct I/O port and physical memory access under Windows NT/2000/XP/2003/Vista/7 and 2008.

This port has been accepted in libpciaccess/master and merged today. It has only been tested on Windows Seven 32 bits, and has to be checked and fixed on 64 bits.

To use it, please follow the instructions found in README.cygwin.

This support helped me to understand how GPU graphics counters work on NVIDIA Tesla. I started writing a documentation of these counters here.

See you later!

NV50 graphics counters are now almost fully documented

Hello everyone,

The second part of my GSoC project was to understand how NVidia graphics counters work on Tesla family.  According to my previous post, I used my own implementation of libpciaccess on Windows 7 in order to read the PCOUNTER configuration of these signals through NVPerfkit and GDebugger.

After some week of hard work, I have succeeded in documenting most of these signals. However, some of them (like vertex_shader_busy for example) are still currently not understandable for me but I’ll try to do this task as soon as possible.

The result of my researches is available on my Github.

The next part is to complete the documentation and, after, it could be interesting to provide an implementation like the NVPerfSDK for Linux.

Have a good day. ;)

libpciaccess has now Windows support through WinIo and Cygwin

libpciaccess is the most famous generic library which allows us to access to PCI drivers under Linux and BSD systems.

As you may know, libpciaccess is not supported under Windows for various reasons that I don’t really know, but the most important one is probably because almost all developers of open source drivers use Linux only.

However, I need to use Windows in order to reverse engineer GPU graphics counters which are only available through NVPerfKit as part of my Google Summer of Code.

These counters are programmed using PCOUNTER, the hardware unit that contains performance monitoring counters, and they are exposed by MMIO. So, I need to have a full access to PCI drivers in order to map physical memory of the blob into the virtual address space.

So, I added Windows support into libpciaccess that now allows me to use the NVA tools (nvapeek, nvapoke…). That support is for Cygwin only mainly because I didn’t test my implementation under MinGW, but I believe that may be really easy to port it.

See you soon!

nvc0 compute support is now fixed

If you try to monitor MP performance counters through the HUD on nvc0 you should get the following error message :

gallium_hud: all queries are busy after 8 frames, can’t add another query.

This message occurs when the kernel is not synchronized, ie. when it doesn’t run correctly.

Now, if you take a quick look to the kernel error messages, you should get the following precision :

DATA_ERROR [INVALID_VALUE] ch 4 [0x000027f839 glxgears[11550]] subc 1 class 0xa0c0 mthd 0x02e8 data 0x0040cccc

Actually, data must be aligned to 0×8000 on nvc0 according to rnndb.

A 3 lines patch fixes the compute support on nvc0.

How to decode the pushbuffer using valgrind-mmt and dedma ?

In some cases, informations are not presently exposed through MMIO registers and the blob uses FIFO methods instead. Actually, the blob uses FIFO methods for enabling MP counters. Let start to explain how to do that.

 
In this example, I use the NVC1 chipset, and I want to decode the pushbuffer used by the NVC0_COMPUTE class (0x000090c0).

First you have to trace a signal using cupti_trace :

$ cupti_trace --trace NVC1 --event active_cycles

Now, you have to grep the FIFO object class id 0x000090c0.

$ grep 0x000090c0 active_cycles.trace
--6903-- out2 0x00000004 0x00000002 0x00000003 0x0000003d 0x0000003e 0x0000003f 0x00000040 0x00009197 0x000090b8 0x00000073 0x00005080 0x00009072 0x00009074 0x0000844c 0x000090dd 0x000090b2 0x000090b1 0x00008570 0x0000857a 0x0000857b 0x0000857c 0x0000857d 0x0000857e 0x0000007d 0x00009068 0x0000907f 0x0000906f 0x0000902d 0x00009097 0x000090c0 0x00009039 0x000090e0 0x000090e6 0x000090e2 0x000090e3 0x000050a0 0x00009096 0x000090e1 0x000090b3 0x000090b5 0x0000208a 0x000085b6 0x00009067 0x000090f1 0x0000503b 0x0000503c 0x00000075 
--6903-- out2 0x00000004 0x00000002 0x00000003 0x0000003d 0x0000003e 0x0000003f 0x00000040 0x00009197 0x000090b8 0x00000073 0x00005080 0x00009072 0x00009074 0x0000844c 0x000090dd 0x000090b2 0x000090b1 0x00008570 0x0000857a 0x0000857b 0x0000857c 0x0000857d 0x0000857e 0x0000007d 0x00009068 0x0000907f 0x0000906f 0x0000902d 0x00009097 0x000090c0 0x00009039 0x000090e0 0x000090e6 0x000090e2 0x000090e3 0x000050a0 0x00009096 0x000090e1 0x000090b3 0x000090b5 0x0000208a 0x000085b6 0x00009067 0x000090f1 0x0000503b 0x0000503c 0x00000075 
--6903-- pre_ioctl: fd:3, id:0x2b (full:0xc020462b), data: 0xc1d00511 0x5c0000c9 0x5c0000ca 0x000090c0 0x00000000 0x00000000 0x00000000 0x00000000 
--6903-- post_ioctl: fd:3, id:0x2b (full:0xc020462b), data: 0xc1d00511 0x5c0000c9 0x5c0000ca 0x000090c0 0x00000000 0x00000000 0x00000000 0x00000000 
--6903-- out 0x5c0000ca 0x000090c0 0x000090c0 0x00000001 
--6903-- w 2:0x2004, 0x000090c0 
--6903-- w 11:0x24300, 0x000090c3,0x000090c2,0x000090c1,0x000090c0 
--6903-- w 9:0x24300, 0x000090c3,0x000090c2,0x000090c1,0x000090c0 
--6903-- r 10:0x12180, 0x000090c6,0x000090c4,0x000090c2,0x000090c0 
--6903-- pre_ioctl: fd:3, id:0x2b (full:0xc020462b), data: 0xc1d00511 0x5c0000ec 0x5c0000ed 0x000090c0 0x00000000 0x00000000 0x00000000 0x00000000 
--6903-- post_ioctl: fd:3, id:0x2b (full:0xc020462b), data: 0xc1d00511 0x5c0000ec 0x5c0000ed 0x000090c0 0x00000000 0x00000000 0x00000000 0x00000000 
--6903-- out 0x5c0000ed 0x000090c0 0x000090c0 0x00000001 
--6903-- w 15:0x2004, 0x000090c0

The following line contains the map id, which is 2 in this example :

--6903-- w 2:0x2004, 0x000090c0

Now, you have to use dedma which decodes the pusbuffer using rnndb (the output is truncated here).

$ dedma -m c0 -v 2 active_cycles.trace > active_cycles.dedma
20014000  size 1, subchannel 2 (0x0), offset 0x0000, increment
000090c0    NVC0_COMPUTE mapped to subchannel 2
20014040  size 1, subchannel 2 (0x90c0), offset 0x0100, increment
00000000    NVC0_COMPUTE.GRAPH.NOP = 0
200141d6  size 1, subchannel 2 (0x90c0), offset 0x0758, increment
00000002    NVC0_COMPUTE.MP_LIMIT = 0x2
200141e4  size 1, subchannel 2 (0x90c0), offset 0x0790, increment
00000000    NVC0_COMPUTE.TEMP_ADDRESS_HIGH = 0
200141e5  size 1, subchannel 2 (0x90c0), offset 0x0794, increment
10000000    NVC0_COMPUTE.TEMP_ADDRESS_LOW = 0x10000000
200141e6  size 1, subchannel 2 (0x90c0), offset 0x0798, increment
00000000    NVC0_COMPUTE.TEMP_SIZE_HIGH = 0
200141e7  size 1, subchannel 2 (0x90c0), offset 0x079c, increment
00700000    NVC0_COMPUTE.TEMP_SIZE_LOW = 0x700000
200141e8  size 1, subchannel 2 (0x90c0), offset 0x07a0, increment
00012600    NVC0_COMPUTE.WARP_TEMP_ALLOC = 0x12600
200141df  size 1, subchannel 2 (0x90c0), offset 0x077c, increment
03000000    NVC0_COMPUTE.LOCAL_BASE = 0x3000000
20014081  size 1, subchannel 2 (0x90c0), offset 0x0204, increment
000000f0    NVC0_COMPUTE.LOCAL_POS_ALLOC = 0xf0
20014082  size 1, subchannel 2 (0x90c0), offset 0x0208, increment
000007c0    NVC0_COMPUTE.LOCAL_NEG_ALLOC = 0x7c0
20014083  size 1, subchannel 2 (0x90c0), offset 0x020c, increment
00001000    NVC0_COMPUTE.WARP_CSTACK_SIZE = 0x1000
20014359  size 1, subchannel 2 (0x90c0), offset 0x0d64, increment
0000000f    NVC0_COMPUTE.CALL_LIMIT_LOG = 0xf
200140c2  size 1, subchannel 2 (0x90c0), offset 0x0308, increment
00000003    NVC0_COMPUTE.CACHE_SPLIT = 48K_SHARED_16K_L1
20014085  size 1, subchannel 2 (0x90c0), offset 0x0214, increment
01000000    NVC0_COMPUTE.SHARED_BASE = 0x1000000
20014093  size 1, subchannel 2 (0x90c0), offset 0x024c, increment
00000000    NVC0_COMPUTE.SHARED_SIZE = 0
200140a8  size 1, subchannel 2 (0x90c0), offset 0x02a0, increment
00008000    NVC0_COMPUTE.UNK02A0 = 0x8000
2001408e  size 1, subchannel 2 (0x90c0), offset 0x0238, increment
00010001    NVC0_COMPUTE.GRIDDIM_YX = { X = 1 | Y = 1 }
2001408f  size 1, subchannel 2 (0x90c0), offset 0x023c, increment
00000001    NVC0_COMPUTE.GRIDDIM_Z = 1
200140eb  size 1, subchannel 2 (0x90c0), offset 0x03ac, increment
00010001    NVC0_COMPUTE.BLOCKDIM_YX = { X = 1 | Y = 1 }
200140ec  size 1, subchannel 2 (0x90c0), offset 0x03b0, increment
00000001    NVC0_COMPUTE.BLOCKDIM_Z = 1
200140b1  size 1, subchannel 2 (0x90c0), offset 0x02c4, increment
00000000    NVC0_COMPUTE.UNK02C4 = FALSE
...

However, dedma fails parsing when the blob uses method data from a different buffer, so you have to do that by hand but it’s pretty easy. You just have to find the data after the 0x20014cef address. In this example, I find 0xaaaa0 which is the value of MP_PM_OP https://github.com/pathscale/envytools/blob/master/rnndb/nvc0_compute.xml#L252.

See you! ;)

MP performance counters are now implemented on nvc0:nvc8

After two weeks of hard work, I managed to add support of MP performance counters on nvc0:nvc8. I tested my implementation only on nvc1 but it should work on other chipsets except nvc8 but I’ll add it in the next few weeks. In order to add this support, I had to implement compute support for nvc0, which is the ability to launch a kernel. My work is based on the compute support implementation of Christoph Bumiller (alias calim) http://people.freedesktop.org/~chrisbmr/90c0.c .

http://lists.freedesktop.org/archives/mesa-dev/2013-July/041448.html

http://lists.freedesktop.org/archives/mesa-dev/2013-July/041449.html

Read only one NVidia’s performance counter through nv_perfmon

nv_perfmon is a tool developed by Ben Skeggs, it allows users to read some NVidia’s performance counters. Currently, only the NVE0 chipset is supported. nv_perfmon provides a ncurses interface in order to be more user-friendly and it displays the performance counters in a continuously mode. In order to only read once a counter, I wrote a little tool based on the original code of Ben Skeggs. That tool takes only one command line argument which is the name of the signal.

#include <core/device.h>
#include <core/class.h>

static struct nouveau_object *client;
static struct nouveau_object *device;
static char **signals;
static int nr_signals;

static void
trace_event(char *name, u32 *value)
{
        struct nv_perfctr_sample sample;
        struct nouveau_object *object;
        struct nv_perfctr_read read;
        int ret;

        ret = nouveau_object_new(client, 0x00000000, 0xc0000000,
                NV_PERFCTR_CLASS,
                &(struct nv_perfctr_class) {
                    .logic_op = 0xaaaa,
                    .signal[0].name = name,
                    .signal[0].size = strlen(name)
                }, sizeof(struct nv_perfctr_class),
                &object);
        assert(ret == 0);

        do {
                ret = nv_exec(object, NV_PERFCTR_SAMPLE, &sample, sizeof(sample));
                assert(ret == 0);

                ret = nv_exec(object, NV_PERFCTR_READ, &read, sizeof(read));
                assert(ret == 0 || ret == -EAGAIN);
                if (ret == 0) {
                        *value = read.ctr;
                }
        } while (ret == -EAGAIN);
}

int
main(int argc, char **argv)
{
	struct nv_perfctr_query args = {};
	struct nouveau_object *object;
        char *signal_name;
        u32 value;
	int ret;

        if (argc < 2) {
                fprintf(stderr, "Usage: %s <signal_name>\n", argv[0]);
                return 1;
        }
        signal_name = argv[1];

	ret = os_client_new(NULL, "error", argc, argv, &client);
	if (ret)
		return ret;

	ret = nouveau_object_new(client, 0xffffffff, 0x00000000,
				 NV_DEVICE_CLASS, &(struct nv_device_class) {
					.device = ~0ULL,
					.disable = ~(NV_DEVICE_DISABLE_MMIO |
						     NV_DEVICE_DISABLE_VBIOS |
						     NV_DEVICE_DISABLE_CORE |
						     NV_DEVICE_DISABLE_IDENTIFY),
					.debug0 = ~((1ULL << NVDEV_SUBDEV_TIMER) |
						    (1ULL << NVDEV_ENGINE_PERFMON)),
				}, sizeof(struct nv_device_class), &device);
	if (ret)
		return ret;

	ret = nouveau_object_new(client, 0x00000000, 0xdeadbeef,
				 NV_PERFCTR_CLASS, &(struct nv_perfctr_class) {
				 }, sizeof(struct nv_perfctr_class), &object);
	assert(ret == 0);
	do {
		u32 prev_iter = args.iter;

		args.name = NULL;
		ret = nv_exec(object, NV_PERFCTR_QUERY, &args, sizeof(args));
		assert(ret == 0);

		if (prev_iter) {
			nr_signals++;
			signals = realloc(signals, nr_signals * sizeof(char*));
			signals[nr_signals - 1] = malloc(args.size);

			args.iter = prev_iter;
			args.name = signals[nr_signals - 1];

			ret = nv_exec(object, NV_PERFCTR_QUERY,
				      &args, sizeof(args));
			assert(ret == 0);
		}
	} while (args.iter != 0xffffffff);

        nouveau_object_del(client, 0x00000000, 0xdeadbeef);

        trace_event(signal_name, &value);
        printf("Name  = %s\n", signal_name);
        printf("Value = %10u\n", value);

	while (nr_signals--)
		free(signals[nr_signals]);
	free(signals);
	return 0;
}