CUPTI: Understand the event collection modes

The event collection mode determines the period over which the events within the enabled event groups will be collected. There are mainly 2 modes :

  • Continuous mode : Events are collected for the entire duration between the cuptiEventGroupEnable and cuptiEventGroupDisable calls. This is the default mode.
  • Kernel mode : Events are collected only for the durations of kernel executions that occur between the cuptiEventGroupEnable and cuptiEventGroupDisable calls. Event collection begins when a kernel execution begins, and stops when kernel execution completes. If multiple kernel executions occur between the cuptiEventGroupEnable and cuptiEventGroupDisable calls then the event values must be read after each kernel launch if those events need to be associated with the specific kernel launch.

1. Program the continuous mode (default mode)
Before configuring the sources selection, the blob initializes the following registers according to the number of sources.

One source :

(w) register: 504660, value: 0000aaaa, mask: ffffffff  { 0 = 0xaaaa | 1 = 0 }
(w) register: 504664, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }
(w) register: 504668, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }
(w) register: 50466c, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }

Two sources :

(w) register: 504660, value: aaaaaaaa, mask: ffffffff  { 0 = 0xaaaa | 1 = 0xaaaa }
(w) register: 504664, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }
(w) register: 504668, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }
(w) register: 50466c, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }

Three sources :

(w) register: 504660, value: aaaaaaaa, mask: ffffffff  { 0 = 0xaaaa | 1 = 0xaaaa }
(w) register: 504664, value: 0000aaaa, mask: ffffffff  { 0 = 0xaaaa | 1 = 0 }
(w) register: 504668, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }
(w) register: 50466c, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }

And so on…

In the trace below, there is only one source.

(Configure signals selection)
(Configure mode)
(w) register: 504660, value: 0000aaaa, mask: ffffffff  { 0 = 0xaaaa | 1 = 0 }
(w) register: 504664, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }
(w) register: 504668, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }
(w) register: 50466c, value: 00000000, mask: ffffffff  { 0 = 0 | 1 = 0 }
(Configure sources selection)
(Read counters)
(w) register: 504660, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0]   => { 0 = 0 | 1 = 0 }
(w) register: 504664, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x1] => { 0 = 0 | 1 = 0 }
(w) register: 504668, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x2] => { 0 = 0 | 1 = 0 }
(w) register: 50466c, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x3] => { 0 = 0 | 1 = 0 }
(w) register: 504660, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0]   => { 0 = 0 | 1 = 0 }
(w) register: 504664, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x1] => { 0 = 0 | 1 = 0 }
(w) register: 504668, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x2] => { 0 = 0 | 1 = 0 }
(w) register: 50466c, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x3] => { 0 = 0 | 1 = 0 }

After reading counters, the blob re-initializes (twice?) these registers to 0 (see above).

2. Program the kernel mode
Before configuring the sources selection, the blob initializes the following registers to 0.

(Configure signals selection)
(Configure mode)
(w) register: 504660, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0]   => { 0 = 0 | 1 = 0 }
(w) register: 504664, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x1] => { 0 = 0 | 1 = 0 }
(w) register: 504668, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x2] => { 0 = 0 | 1 = 0 }
(w) register: 50466c, value: 00000000, mask: ffffffff <== PGRAPH.GPC[0].TP[0].MP.PM_FUNC[0x3] => { 0 = 0 | 1 = 0 }
(Configure sources selection)
(Read counters)

Tested for the following signals (domain c and domain d) (chipset NVC1) :

active_cycles
active_warps
atom_count
branch
divergent_branch
gld_inst_128bit
gld_inst_16bit
gld_inst_32bit
gld_inst_64bit
gld_inst_8bit
gld_request
gred_count
gst_inst_128bit
gst_inst_16bit
gst_inst_32bit
gst_inst_64bit
gst_inst_8bit
gst_request
inst_executed
inst_issued1_0
inst_issued1_1
inst_issued2_0
inst_issued2_1
local_load
local_store
prof_trigger_00
prof_trigger_01
prof_trigger_02
prof_trigger_03
prof_trigger_04
prof_trigger_05
prof_trigger_06
prof_trigger_07
shared_load
shared_store
thread_inst_executed_0
thread_inst_executed_1
thread_inst_executed_2
thread_inst_executed_3
threads_launched
warps_launched

Source

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s