L1-DCACHE-Stors, LLC-Stores, кэш-ссылки и счетчик памяти не складываются в Linux Perf? - Цифровое Кемерово

L1-DCACHE-Stors, LLC-Stores, кэш-ссылки и счетчик памяти не складываются в Linux Perf? ⇐ Linux

1 сообщение • Страница 1 из 1

Anonymous

L1-DCACHE-Stors, LLC-Stores, кэш-ссылки и счетчик памяти не складываются в Linux Perf?

Сообщение Anonymous » 01 авг 2025, 03:29

Я пытаюсь измерить производительность, связанную с автобусом, простую программу испытаний на Intel N150 (Twin Lake, которое имеет четыре ядра Gracemont, такие как электронные ядра озера Олдер). Счетчики L1-dcache и uncore имеют смысл, кеш-ссылки чуть меньше, а LLC- [загружает | хранилище] просто странно. Я предположил, что LLC- [Load | store] -misses должен быть напрямую связан с транзакциями на шине памяти: LLC Mass должна привести к доступу к DRAM. Но счетчики не показывают это вообще. Я также не нахожу события LLC в/sys/, поэтому я не знаю, какие необработанные события PMU они назначаются:
$ ls /sys/bus/event_source/devices/*/events/ | grep -i "llc"
$
< /code>
Программа просто инициализирует большой массив данных (1 ГБ) и выполняет тривиальную расчет по ней 32 или 64 раза: < /p>
constexpr int N = 256'000'000;
unsigned int AData[N];

template
T procItem(T item) {
return item & 0b11011101011;
}

int main() {
...
for (unsigned long long i=0; i
cache-* и l1-dcache-* Счетчики увеличиваются на x2 раз, как и ожидалось. Но LLC-* нет. Особенно, LLC-Stores странные. Они на самом деле не изменяются значительно. Но MEMOADS всегда показывает 0 счетов. И счетчики MEM-STORES такие же, как и L1-DCACHE-Stores . (Я не могу найти L1-DCACHE События под/sys/bus/event_source/devices/*/events/, поэтому нельзя сравнить необработанные события и Umask.)
$ cat /sys/bus/event_source/devices/cpu/events/mem-loads
event=0xd0,umask=0x5,ldlat=3
$ cat /sys/bus/event_source/devices/cpu/events/mem-stores
event=0xd0,umask=0x6
$ uname -r
6.11.0-29-generic
< /code>
Then, if I compile it with -Og, я получаю это:
n_proc=32
Performance counter stats for './test':

973 674 502 cache-references (50,01%)
671 750 305 cache-misses # 68,99% of all cache refs (62,51%)
16 764 160 104 L1-dcache-loads (62,50%)
58 190 499 LLC-loads (62,50%)
560 785 LLC-load-misses # 0,96% of all LL-cache accesses (62,50%)
16 958 572 799 L1-dcache-stores (62,49%)
11 632 355 LLC-stores (37,50%)
6 481 939 LLC-store-misses (37,50%)

11,281720398 seconds time elapsed

9,968157000 seconds user
0,311973000 seconds sys

n_proc=64
Performance counter stats for './test':

1 915 396 715 cache-references (50,00%)
1 313 091 424 cache-misses # 68,55% of all cache refs (62,50%)
33 175 225 108 L1-dcache-loads (62,50%)
115 378 508 LLC-loads (62,51%)
1 089 098 LLC-load-misses # 0,94% of all LL-cache accesses (62,50%)
33 354 560 864 L1-dcache-stores (62,50%)
12 073 424 LLC-stores (37,49%)
6 552 391 LLC-store-misses (37,50%)

21,374890682 seconds time elapsed

20,049017000 seconds user
0,318920000 seconds sys
< /code>
L1-dcache уменьшился, как и ожидалось из более эффективного кода. Но LLC -погрузки увеличили w.r.f -rans без -og . LLC-погрузки выполните увеличение по факту x2 от n_proc = 32 до n_proc = 64 , что имеет смысл. Но строки LLC не изменились. В этом случае PERF Stat должен запустить общую систему -A . В противном случае команды Uncore .
perf stat -e cache-references,cache-misses,L1-dcache-loads,LLC-loads,LLC-load-misses,L1-dcache-stores,LLC-stores,LLC-store-misses \
-e unc_m_cas_count_rd,unc_m_cas_count_wr -e uncore_imc_free_running/data_read/ \
-a -- ./test
< /code>
With -Og Компиляция:
n_proc=32
Performance counter stats for 'system wide':

998 009 503 cache-references (49,99%)
675 324 843 cache-misses # 67,67% of all cache refs (62,50%)
16 831 117 958 L1-dcache-loads (62,50%)
61 496 976 LLC-loads (62,51%)
556 737 LLC-load-misses # 0,91% of all LL-cache accesses (62,51%)
16 999 289 798 L1-dcache-stores (62,51%)
12 125 538 LLC-stores (37,49%)
6 425 956 LLC-store-misses (37,49%)
547 439 524 unc_m_cas_count_rd
528 958 625 unc_m_cas_count_wr
33 413,03 MiB uncore_imc_free_running/data_read/

11,515426638 seconds time elapsed

n_proc=64
Performance counter stats for 'system wide':

1 964 026 474 cache-references (50,01%)
1 322 080 946 cache-misses # 67,31% of all cache refs (62,51%)
33 291 196 083 L1-dcache-loads (62,50%)
122 590 187 LLC-loads (62,50%)
1 083 470 LLC-load-misses # 0,88% of all LL-cache accesses (62,50%)
33 430 279 894 L1-dcache-stores (62,50%)
13 117 422 LLC-stores (37,50%)
6 436 536 LLC-store-misses (37,50%)
1 077 224 939 unc_m_cas_count_rd
1 041 069 003 unc_m_cas_count_wr
65 748,53 MiB uncore_imc_free_running/data_read/

21,641199259 seconds time elapsed
< /code>
So, uncore CAS events also make sense. It looks like 1 CAS command corresponds to a transaction of 32 Bytes: 1G read + 1G write commands = 64GB of uncore_imc_free_running/data_read/. Это правильно? Это правильно? Зависит ли это от размера регистра в инструкциях или всегда подсчитывается для каждого байта? Perf List в одной строке говорит, что кеш-ссылки являются оборудованием события , а в другом событии PMU One One Kernel . Если это ядро событие PMU , могут ли эти счетчики быть несколько ненадежными? Т.е. Должны ли кэш-пропуски быть равен UNC_M_CAS_COUNT_RD + unc_m_cas_count_wr ? Или один кэш-проводник может запустить две транзакции памяти: a write и a a the DRAM вместе? Похоже, что LLC-загрузка что-то значит, просто неясно, как это связано с другими метриками. Но LLC-Stores странные. Я не нахожу эти события по/sys/bus/event_source/devices/, но они перечислены в начале перфу :
$ perf list
branch-instructions OR branches [Hardware event]
...

tool:
...

cache:
L1-dcache-loads OR cpu/L1-dcache-loads/
L1-dcache-stores OR cpu/L1-dcache-stores/
L1-icache-loads OR cpu/L1-icache-loads/
L1-icache-load-misses OR cpu/L1-icache-load-misses/
LLC-loads OR cpu/LLC-loads/
LLC-load-misses OR cpu/LLC-load-misses/
LLC-stores OR cpu/LLC-stores/
LLC-store-misses OR cpu/LLC-store-misses/
...
< /code>
I also ran this program in VTune Memory Access analysis. The analysis shows the CAS counters for the memory bus bandwidth on the platform. It looks like VTune uses mem_uops_retired.all_[loads|stores] счетчики как загрузки и хранилища, а L1-DCACHE-* События назначаются точно так же.
perf stat -e L1-dcache-loads,L1-dcache-stores \
-e mem_uops_retired.all_loads,mem_uops_retired.all_stores \
-- ./test

Performance counter stats for './test':

33 190 440 300 L1-dcache-loads
33 364 392 208 L1-dcache-stores
33 190 440 300 mem_uops_retired.all_loads
33 364 392 208 mem_uops_retired.all_stores

31,416008016 seconds time elapsed

29,959729000 seconds user
0,444892000 seconds sys

Подробнее здесь: https://stackoverflow.com/questions/797 ... ter-dont-a

Anonymous

1 сообщение • Страница 1 из 1

Вернуться в «Linux»