Profiling tools help developers evaluate how their code interacts with the undelying processor. Details on cache access, memory reads/writes and processor clock cycles can be used to provide summarized analysis of a program.

ARM processor provides a Performance Monitoring Unit (PMU) as part of its architecture to enable gathering of processor execution info. Using the PMU we can track events occuring on the core via counters. We will need to configure the PMU to select events which will increment the counter. I will briefly describe the PMU for the ARMv7 architecture but a detailed decription can be found in the ARMv7 architecture reference manual.

Privilege please

But using the PMU from userspace isn’t straight forward. We need to enable userspace access through privileged execution i.e. kernel space. We need to write a kernel module with ARMv7 specific assembly instructions to enable the PMU userspace enable register (PMUSERENR).

PMUSERENR

Bit 0 -> Set to 1 to enable user mode access

The access to this register is provided via the co-processor cp15 of the ARMv7 architecture with the following assembly instruction. p15 is the coprocessor name and c9, c14 are its registers.

mcr p15, 0, <Rt>, c9, c14, 0 ; Write Rt into PMUSERENR

We will need to invoke this from our kernel module. We can execute assembly within C code using GCC’s asm directive. The final call will look like:

static inline void asm_user_enable(void)
{
	asm volatile("mcr p15, 0, %0, c9, c14, 0": : "r"(1));
}
....
asm_user_enable();

Once our kernel modules completes executing this instruction, we have userspace access.

But wait!

Modern computers have support for SMP (Symmetric Multiprocessing) and what this essentially means is that there are multiple cores for executing threads independently.

So for example when we run the above instruction in a dual core processor, we are enabling userspace access to PMU from only core, say CORE0. Now if the userspace application is running on CORE1 and tries to access the PMU, the OS will cry with illegal instruction error (SIGILL).

In order to overcome this, the kernel API provides on_each_cpu to invoke our instruction on all cores.

on_each_cpu(asm_user_enable,NULL,1)

Now irrespective of which core our userspace program runs on, we will have access to the PMU.

Events

The ARM manual provides a list of supported events and their respective event numbers which can be tracked by the PMU. Here we will consider the event Store Instruction architecturally executed, also called as event number 0x7.

The counter increments for every executed memory-writing instruction, including SWP.

Developing the userspace program

Enable counters

In order to enable the counter we need to set Bit 0 of Performance Monitors Control Register (PMCR) to 1.

Bit 0 -> 0 -> All counters disabled

Bit 0 -> 1 -> All counter enabled

static inline void asm_enable_counters(void)
{
	asm volatile("mcr p15, 0, %0, c9, c12, 0": :"r"(1));
}

Bits[15-11] is read only and indicates the number of counters. On my board it indicates the presence of six counters.

Select the counter

The Performance Monitors Event Counter Selection Register (PMSELR) is to select the counter to record the event.

BIT[4:0] is used to indicate the counter number. Counters are numbered beginning with 0.

static inline void asm_select_counter(unsigned long val)
{
	asm volatile("mcr p15, 0, %0, c9, c12, 5": : "r"(val));
}

Select the event

The Performance Monitors Event Type Select Register (PMXEVTYPER) contains Bits[7-0] which is used to indicate the event to be counted by our selected counter from the previous step. For our example it is event number 0x7.

static inline void asm_event_select(unsigned long val)
{
	asm volatile("mcr p15, 0, %0, c9, c13, 1": : "r"(val));
}

Reset counters

Before starting the counter, reset it using the PMCR register.

static inline void asm_reset_counters(void)
{
	unsigned long val;
	asm volatile("mrc p15, 0, %0, c9, c12, 0": "=r"(val):);
	val=val|0x2;
	asm volatile("mcr p15, 0, %0, c9, c12, 0": :"r"(val));
}

Enable counting

We can now start counting the event using the Performance Monitors Count Enable Set register (PMCNTENSET). Depending on the counter to be enabled, the respective bit number needs to be set. For example in order to start Counter0, the Bit 0 needs to be set.

static inline void asm_start_counter(unsigned long val)
{
	asm volatile("mcr p15, 0, %0, c9, c12, 1": : "r"(val));
}

The code which is under investigation must be encapsulated within the start and end of the counting.

Disable counting

Once completed the counter can be disabled using the PMCR register. We do this by setting the Bit 0 to 0.

static inline void asm_disable_counters(void)
{
	asm volatile("mcr p15, 0, %0, c9, c12, 0": :"r"(0));
}

Read the counter

We can now read our selected counter. This is facilitated by the Performance Monitors Event Count Register (PMXEVCNTR). Before accessing this register we need to select the counter which have already done before.

static inline unsigned long asm_read_counter(void)
{
	unsigned long val;
	asm volatile("mrc p15, 0, %0, c9, c13, 2": "=r"(val):);
	return val;
}

Demo code

I considered the following simple for loop to verify the accuracy of the counters. SIZE is defined to be 100.

int a[SIZE],i=0;

for(;i<SIZE;i++)
{
	a[i]=i;
}

Upon disassembly of the loop we find the following the instructions.

[ ... ]
    8634:	ea00000a 	b	8664 <main+0xb8>
    8638:	e51b2008 	ldr	r2, [fp, #-8]
    863c:	e59f3060 	ldr	r3, [pc, #96]	; 86a4 <main+0xf8>
    8640:	e1a02102 	lsl	r2, r2, #2
    8644:	e24b1004 	sub	r1, fp, #4
    8648:	e0812002 	add	r2, r1, r2
    864c:	e0823003 	add	r3, r2, r3
    8650:	e51b2008 	ldr	r2, [fp, #-8]
    8654:	e5832000 	str	r2, [r3]		; STORE instruction
    8658:	e51b3008 	ldr	r3, [fp, #-8]
    865c:	e2833001 	add	r3, r3, #1
    8660:	e50b3008 	str	r3, [fp, #-8]	; STORE instruction
    8664:	e51b3008 	ldr	r3, [fp, #-8]
    8668:	e3530063 	cmp	r3, #99	; 0x63
[ ... ]

We can see that there are two STR(store) instructions. So theoretically, a loop executing 100 times would increment the counter to 200.

When I ran the test I got a reading of 206. Some of these could be attributed to function calls to start and end the counter but others are probably within the kernel code for memory access.

I have uploaded my test code here for both the kernel module and userspace program here.

Feel free let me know your opinion.