Trying to understand DDR memory

Table of Contents

1 Questions
2 A load of articles/blogs/pages to read
- 2.1 Wikipedia articles
- 2.2 More posts
- 2.3 Notes
- 2.4 My own RAM
- 2.5 Continued notes
- 2.6 Address mapping notation
- 2.7 Intel spec
- 2.8 Rank interleaving
- 2.9 Nontemporal reads/writes
3 reMap: using Performance counters
4 Sudoku
- 4.1 Step 1: DRAM addressing functions
- 4.2 Step 2: row/column bits
- 4.3 Step 3: validation
- 4.4 Step 4: which function is what?
- 4.5 Refreshes
- 4.6 Consecutive Accesses
5 Sudoku, now with only 1 DIMM
- 5.1 setup
- 5.2 1. reverse functions
- 5.3 2. identify bits
- 5.4 3. validate mapping
- 5.5 4. decompose functions
6 Final results
7 decode-dimms
- 7.1 Bank groups
- 7.2 Refresh
- 7.3 Random access throughput
8 CPU benchmarks
- 8.1 cpu-benchmarks
  - 8.1.1 random access throughput 1 DIMM
  - 8.1.2 random access throughput 2 DIMM
- 8.2 memory-read-experiment
  - 8.2.1 strided reading 1 DIMM
  - 8.2.2 strided reading 2 DIMM
9 tinymembench
10 Remaining questions

These are chronological (and thus, only lightly organized) notes on my attempt to understand how DDR4 and DDR5 RAM memory work.

See the final results for the conclusion on how virtual addresses map to physical addresses.

1 Questions Link to heading

Specific questions to be answered:

How is data distributed among the chips?
How does dual channel DDR work?
- And does the CPU become a bottleneck here?
How does RAM work?
Are there internal cache sizes?
What is burst?
- Can we disable it?
- Can we read only a single word?
And prefetching? (8n, 16n)
- Why? (reduce downtime / communication overhead)
- Stored how?
Why does DDR5 have 2x32 lanes instead of DDR4’s 64bit lanes?
How large are rows and columns?
What about LPDDR (= lower power DDR)
What’s a transfer anyway
banks? bank groups?
- transferring to different bank groups is faster?
DDR5: 2/4/8 bank groups
How does virtual memory map to physical memory?

2 A load of articles/blogs/pages to read Link to heading

2.1 Wikipedia articles Link to heading

Wikipedia articles:

2.2 More posts Link to heading

2.3 Notes Link to heading

https://en.wikipedia.org/wiki/Synchronous_dynamic_random-access_memory#PREFETCH:

[DDR3] Row accesses might take 50 ns, depending on the speed of the DRAM, whereas column accesses off an open row are less than 10 ns.

DDR4 did not double the internal prefetch width again, but uses the same 8n prefetch as DDR3.[29] Thus, it will be necessary to interleave reads from several banks to keep the data bus busy.

https://api.drum.lib.umd.edu/server/api/core/bitstreams/5ec3f878-7a0d-4df8-a98f-1600b5c35e2b/content

dual channel has 2 chips doing the exact same work, interleaving things at the level of words: 64 bits to one and 64 bits to the other.
- Does this mean an effective RAM cache size of 1024 bits, given burst mode???

https://www.synopsys.com/articles/ddr4-bank-groups.html

prefetch matches cacheline size
DDR4: bank groups, so multiple reads can happen ‘in parallel’
- a prefetch happens within a bank group, but takes time
- 16n prefetch would be larger than a cacheline, not useful
- instead: 2 (or more) bank ground that work independently in parallel
- alternating groups is faster than consecutive reads in a group
  - 4 cycles across groups; more within
- 1600MHz gives up to 3200Mb/s when alternating, or 2133Mb/s within the same group
- alternating between banks: can fully saturate the bandwidth with 4 internal cycles (8 transfers) between bursts
- on the same group, only as low as ~half the throughput

https://edc.intel.com/content/www/us/en/design/ipla/software-development-platforms/client/platforms/alder-lake-desktop/12th-generation-intel-core-processors-datasheet-volume-1-of-2/007/system-memory-controller-organization-mode-ddr4-5-only/#:~:text=Intel%C2%AE%20DDR4%2F5%20Flex,(64%2Dbyte%20boundary).

Dual-Channel Symmetric mode, also known as interleaved mode, provides maximum performance on real world applications. Addresses are ping-ponged between the channels after each cache line (64-byte boundary).

https://edc.intel.com/content/www/us/en/design/ipla/software-development-platforms/client/platforms/alder-lake-desktop/12th-generation-intel-core-processors-datasheet-volume-1-of-2/007/technology-enhancements-of-intel-fast-memory-access-intel-fma/

out of all pending requests, efficient ones to be answered are chosen

https://en.wikipedia.org/wiki/Interleaved_memory

row buffer typically has same size as OS memory page

https://ieeexplore.ieee.org/document/898056 (Zhang, Zhu, and Zhang 2000)

https://superuser.com/questions/606742/what-do-channel-interleave-and-rank-interleave-settings-mean-in-bios

channel interleave and rank interleave can be changed in bios?

https://ee.usc.edu/~redekopp/ee457/slides/EE457Unit7b_Interleaving.pdf bank: byte-size memory

https://chipress.online/2024/04/12/what-is-dram-hierarchy/#:~:text=Multiple%20Bank%20Groups%20and%20Banks,latency%20and%20improve%20DRAM%20throughput

nice figure

https://www.systemverilog.io/design/ddr4-basics/

good figures
each bank has multiple (4 or 8) memory arrays
512/1024/2048 page size = bits/row
1024 columns always
variable number of rows
width cascading: 128bit words by having half from each chip

https://www.systemverilog.io/design/lpddr5-tutorial-physical-structure/

synchronous vs async ram:

sync: common clock frequency

2.4 My own RAM Link to heading

laptop model:

HMAA4GS6AJR8N-XN
https://www.compuram.biz/memory_module/sk-hynix/hmaa4gs6ajr8n-xn.htm
3200 MT/s
1600 MHz
32GB (2Rx8): 2 ranks, 8 bit data width / 8 banks
260pin
2 ranks
number of dram: 16
2Gx8: 2GB per chip, within each rank?
construction type: FBGA(78ball)
unbuffered: more reliability by buffering data before sending it over
organization: 4Gx64
number of DRAM: 16
DRAM organization: 2Gx8

2.5 Continued notes Link to heading

jedec standard: https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/196/JESD79_2D00_4.pdf

page 15: 2Gb x8 addressing:

4 bank groups (2bits)
4 banks inside a group (2bits)
17bit row ID => 128ki rows
10bit col ID => 1024 cols
1KB page size (#cols * 8 banks)

=> total page size for 8 chips: 8KB.

Possible experiment:

read a[i], where i iterates over the integers with only a fixed subset of bits set, like a mask of 111101001110000. Trying all patterns, we can try to hit consecutive rows on the same bank.
multithreaded setting; always have a certain bit of the index at 0, the rest 1
- try to always hit the same bank
- try to always hit the same rank

Side note:

L1 prefetch needs fill buffers, but the L2 hardware prefetcher does not!

gemini: cache lines are alternated between dimms/channels.

https://arxiv.org/html/2512.18300v1 (Jattke et al. 2024)

For example, modern memory mappings, such as AMD-Zen [11], split a 4KB page such that it gets distributed across 32 banks (with only two lines of a page co-resident in the same bank). Such mappings exploit the available bank-level parallelism for reads, even at the expense of the row-buffer hit-rate.

interesting indeed: reading from different banks is faster than consecutive reads from a single row then
more generally, I guess cachelines are round-robin between banks then

https://os.itec.kit.edu/3617_3389.php (Hillenbrand 2017)

https://docs.amd.com/r/en-US/pg313-network-on-chip/DDR4-Component-Choice

For a 64-bit DDR4 interface, this means the controller can support a single x64 channel, or dual x32 channels.

ie, there are cases where the two dimms do exactly the same work but on half the bits. Then additional bank (groups) are available for parallellism

with two memory controllers (same as channels or not?):

memory can be interleaved at eg 1KB block size, or
each controller is half the address space

https://news.ycombinator.com/item?id=30178764

https://ashvardanian.com/posts/ddr4-bandwidth/

https://adaptivesupport.amd.com/s/article/41169?language=en_US

https://docs.amd.com/r/en-US/ug585-zynq-7000-SoC-TRM/DDR-Efficiency

https://news.ycombinator.com/item?id=16172686

https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy/43574756#43574756

https://www.reddit.com/r/cpp/comments/9ccb88/optimizing_cache_usage_with_nontemporal_accesses/

2.6 Address mapping notation Link to heading

This pages is really nice:

https://docs.amd.com/r/en-US/pg313-network-on-chip/Address-Mapping-Notation

Example: 16R-2B-2BG-10C:

ignore low 3 bits, because we get 64b = 8B = 2^3 B back.
10 bits for column. Ie a linear piece of memory fills a row.
- Each column is 64b = 8B across all banks, 1024 columns per row = 8KiB
2 bits for the bank group, ie spread 4 rows over 4 distinct bank groups
2 bits for bank, ie then take a row in each bank of each group
15 bits for row, ie then use the other rows in each bank

Reading consecutively from the same row (in the same bank) is slow though. So instead, we can alternate bank group after every 64B cache line (aka bank group optimization): 16R-2B-1BG-7C-1BG-3C: this moves 1 BG bit to the end, so we interleave reads to two bank groups to be able to fully saturate the bandwidth. This has 95% efficiency, compared to 54% of the simple row-bank-column approach.

If addressing is random and transaction size is equal to the DRAM basic access unit, then no address mapping has an advantage. The DRAM controller will reorder transactions to hide page access overhead as much as possible. Expected efficiency is around 40%.

2.7 Intel spec Link to heading

Intel spec: https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/10th-gen-core-families-datasheet-vol-1-datasheet.pdf

Addresses are ping-ponged between the channels after each cache line (64-byte boundary).

5.1.5:

just-in-time: most efficient request scheduled first, e.g., one that hits a different bank than the last instruction.
command overlap: send stuff to other banks while one is doing slow activate/pre-charge/read/write
out-of-order: batch requests to the same open page

2.8 Rank interleaving Link to heading

https://blog.cloudflare.com/ddr4-memory-organization-and-how-it-affects-memory-bandwidth/#:~:text=Multi%2Drank%20memory%20modules%20use,rank%20can%20start%20its%20transmission.

Multi-rank memory modules use a process called rank interleaving, where the ranks that are not accessed go through their refresh cycles in parallel

On the other hand, there is some I/O latency penalty with multi-rank memory modules, since memory controllers need additional clock cycles to move from one rank to another.

one row can remain active per bank

2.9 Nontemporal reads/writes Link to heading

https://vgatherps.github.io/2018-09-02-nontemporal/

https://www.reddit.com/r/cpp/comments/9ccb88/optimizing_cache_usage_with_nontemporal_accesses/

because they are only beneficial if you can prove that the address being stored to is not in cache, and would not be present in cache at next use even if a normal store were used.
they break some of the usual ordering guarantees.
the performance is catastrophically bad (significantly worse than normal stores) when the address is present in cache.

https://sites.utexas.edu/jdm4372/2018/01/01/notes-on-non-temporal-aka-streaming-stores/

https://blogs.fau.de/hager/archives/2103

3 reMap: using Performance counters Link to heading

https://github.com/helchr/reMap

Helm, Akiyama, and Taura (2020) uses performance counters to read directly how many times each rank/bank group/bank is accessed.

It uses stuff from /sys/bus/event_source/devices/uncore_imc/

type is 14.

here https://stackoverflow.com/questions/64923795/using-linux-perf-tool-to-measure-the-amount-of-times-the-cpu-has-to-acccess-the I read that only Xeon/servers have these cas counters.

Performance counter monitor: https://github.com/intel/pcm

pcm-memory with per-bank stats only works on Xeon unfortunately

4 Sudoku Link to heading

Really cool paper and software: “Sudoku: Decomposing Dram Address Mapping into Component Functions” (Wi et al. 2025). https://github.com/scale-snu/Sudoku

hugepages setup:

1
2
3
4
5
6
7
cd /sys/devices/system/node/node0/hugepages/hugepages-1048576kB
cat nr_hugepages # likely prints 0
cat free_hugepages # likely prints 0
# just over half of the 64GB memory
echo 40 | sudo tee nr_hugepages
cat nr_hugepages # should print 40
cat free_hugepages # likely prints 40

speed:

1
#define SBDR_LOWER_BOUND_ 350

memory config:

1
{DDRType::DDR4, 64ULL * GB, 8, 2, 2, 17, 10, 8 * 1024, 8}

I also incremented max_bits_ by 2 in the Sudoku constructor. Not exactly sure why it’s needed.

We also pin the cpu frequency to 3.0GHz (slightly above the native 2.6GHz):

1
sudo cpupower frequency-set --governor powersave -d 3.0GHz -u 3.0GHz

4.1 Step 1: DRAM addressing functions Link to heading

ie things that map to the same bank

command:

1
sudo ./reverse_functions -o out -p 40 -t ddr4 -n 2 -s 32 -r 2 -w 8 -d -v -l

functions:

4.2 Step 2: row/column bits Link to heading

1
2
sudo ./identify_bits -o out -p 40 -t ddr4 -n 2 -s 32 -r 2 -w 8 -d -v -l\
    -f 0x4080,0x88000,0x110000,0x220000,0x440000,0x4b300

result with masking: column bits: 0xdc0 row_bits,0xffff80000 column_bits,0xdc0

Note that some column bits are still missing, as 0xdc0 = 1101 1100b only has 5 instead of 7 bits set. (We are looking for 7 bits, because each column has \(2^{10}\) 64-bit words and thus \(2^7\) cache lines.)

4.3 Step 3: validation Link to heading

1
2
3
4
sudo ./validate_mapping -o out -p 40 -t ddr4 -n 2 -s 32 -r 2 -w 8 -d -v -l\
    -f 0x4080,0x88000,0x110000,0x220000,0x440000,0x4b300 \
    -R 0xffff80000 \
    -C 0xdc0

1
2
3
4
[-] There are incomplete disjoint sets.
0x4cb300,
Insert bit 9 to column_bits
Insert bit 12 to column_bits

So real column should be 0x1FC0, exactly the same as in Table 2 in the paper for another intel DDR4 machine. (That only has 7 bits, because the last 3 column bits indicate 8 64bit words that always share a cacheline [I think].)

We get:

1
2
3
4
Validated DRAM address mapping:
  functions:0x4080,0x88000,0x110000,0x220000,0x440000,0x4b300,
  row_bits:0xffff80000
  column_bits:0x1fc0

4.4 Step 4: which function is what? Link to heading

Lets first have a manual look at these functions:

1
2
3
4
5
6
7
8
000000000000000000000001111111000000 -- col
000000000000000000000100000010000000 -- dimm/rank?
000000000000000001001011001100000000 -- channel?
000000000000000010001000000000000000 -- bank/group?
000000000000000100010000000000000000 -- bank/group?
000000000000001000100000000000000000 -- bank/group?
000000000000010001000000000000000000 -- bank/group?
111111111111111110000000000000000000 -- row

They use 36 bits, corresponding to \(2^{36} = 64\) GiB of data.

Also note that the low 6 bits are 0, because we have \(2^6=64\) byte cache lines.

Some [wrong] guesses. We already know the row and column bits, but the rest is speculation for now. Looking at the Sudoku paper, we need 4 functions for banks and bank groups, so I would speculate that’s the four very similar ones. Then we need 2 to select the DIMM and rank within a channel (because each channel could have 2 rank-2 memory sticks), so I’d guess the two similar patterns that remain. Specifically, that would mean the highest bit determines the DIMM, of which I only have 1 in each channel, and thus be unused.

That leaves one ‘weird’/‘chaotic’ pattern for the channel, which also seems to match the Sudoku table (Wi et al. 2025):

4.5 Refreshes Link to heading

Now lets have a look at the output of decompose_functions, which does additional experiments to distinguish DIMMs/ranks/bank groups/bank.

1
2
3
4
sudo ./decompose_functions -o out -p 40 -t ddr4 -n 2 -s 32 -r 2 -w 8 -d -v -l\
    -f 0x4080,0x4b300,0x88000,0x110000,0x220000,0x440000 \
    -R 0xffff80000 \
    -C 0x1fc0

Specifically, it chooses two addresses that only differ by a given function and then alternatingly reads them. DRAM needs regular ‘refreshing’, every 64ms or so, which for my hardware takes tRFC1 = 350ns. This can be done one either one bank at a time, or for all banks simultaneously.

If we now choose 2 addresses and read them alternatingly, we will see one of two patterns. If they are in the same refresh group, we will see a single refresh delay every refresh cycle. If they are in different refresh groups, we will see two refresh patterns interleaved.

You might need to play a bit with the REGULAR_REFRESH_INTERVAL_THRESHOLD parameter. We get:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
[+] Check refresh intervals of function 0x4080
Functions: 0x4080, tREFI: 1015, tREFI/2: 9
[+] Check refresh intervals of function 0x4b300
Functions: 0x4b300, tREFI: 1016, tREFI/2: 8
[+] Check refresh intervals of function 0x88000
Functions: 0x88000, tREFI: 1018, tREFI/2: 6
[+] Check refresh intervals of function 0x110000
Functions: 0x110000, tREFI: 0, tREFI/2: 1024          <------ channel selector?
[+] Check refresh intervals of function 0x220000
Functions: 0x220000, tREFI: 1019, tREFI/2: 5
[+] Check refresh intervals of function 0x440000
Functions: 0x440000, tREFI: 1020, tREFI/2: 4

These numbers mean the following: for 5 of the 6 functions, we consistently observe a tREFI delay, which is the refresh interval. (I still have not found a way to extract the actual value of number from my hardware.) The 0x10000 function on the other hand consistently shows a smaller interval between refreshes, meaning that this switches data to a different refresh group.

Thus, I speculate that 0x110000 determines the DIMM, and all banks on each DIMM (across both ranks) refresh at the same time.

See also Auto-refresh (AR) on wikipedia.

4.6 Consecutive Accesses Link to heading

Now we look at the time between consecutive accesses, tRDRD (time ReaD ReaD) and specifically tCCD_S and tCCD_L, the time between consecutive reads in different (S, short) and the same (L, long) bank groups, assuming row-buffer hits. For this, the tool schedules 4 consecutive reads from each of two rows and lets the memory controller execute them as fast as possible.

We get this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Functions: 0x4080, Avg RDRD latency: 347                   <---- bank group/channel
[+] Check consecutive memory accesses of function 0x4b300
Functions: 0x4b300, Avg RDRD latency: 346                  <---- bank group/channel
[+] Check consecutive memory accesses of function 0x88000
Functions: 0x88000, Avg RDRD latency: 345                  <---- bank group/channel
[+] Check consecutive memory accesses of function 0x110000
Functions: 0x110000, Avg RDRD latency: 400                 <---- rank (Before: DIMM?)
[+] Check consecutive memory accesses of function 0x220000
Functions: 0x220000, Avg RDRD latency: 396                 <---- bank
[+] Check consecutive memory accesses of function 0x440000
Functions: 0x440000, Avg RDRD latency: 397                 <---- bank

Latency here represents the number of clock cycles to read a total of 8 values.

Based on the paper, alternating between channels should be fastest, followed by bank groups. Alternating between banks and Rank is slower.

My guesses are based on correlating these results with the plot from the Sudoku paper:

	refresh group	read-read	what
`000000000000000000000001111111000000`			col
`000000000000000000000100000010000000`	same	347	bank group / channel
`000000000000000001001011001100000000`	same	346	bank group / channel
`000000000000000010001000000000000000`	same	345	bank group / channel
`000000000000000100010000000000000000`	different	400	rank/channel?
`000000000000001000100000000000000000`	same	396	bank
`000000000000010001000000000000000000`	same	397	bank
`111111111111111110000000000000000000`			row

Correlating with row Intel-A 2-channels 1-dimm-per-channel from the table above:

	what
`000000000000000000000001111111000000`	col
`000000000000000010000010011000000000`	channel
`000000000000000000000101010000000000`	group
`001001001000000010001000000000000000`	group
`000000000000000100010000000000000000`	rank
`010010010011001000100000000000000000`	bank
`100100100100110001000000000000000000`	bank
`111111111111111110000000000000000000`	row

Things look quite similar and so my guess is that the order of things is indeed the same:

	refresh group	read-read	what
`000000000000000000000001111111000000`			col
`000000000000000000000100000010000000`	same??	347	channel
`000000000000000001001011001100000000`	same	346	group
`000000000000000010001000000000000000`	same	345	group
`000000000000000100010000000000000000`	different	400	rank
`000000000000001000100000000000000000`	same	396	bank
`000000000000010001000000000000000000`	same	397	bank
`111111111111111110000000000000000000`			row

5 Sudoku, now with only 1 DIMM Link to heading

5.1 setup Link to heading

Allocate 20 1GB hugepages

Add an entry to the table with 32 GB.

1
{DDRType::DDR4, 32ULL * GB, 8, 2, 2, 17, 10, 8 * 1024, 8},

5.2 1. reverse functions Link to heading

run the same command as before, but now for 1 DIMM and 20 pages:

1
sudo ./reverse_functions -o out -p 17 -t ddr4 -n 1 -s 32 -r 2 -w 8 -d -v -l

Changes to make it work: Change timing to

1
2
#define SBDR_LOWER_BOUND_ 330
#define SBDR_UPPER_BOUND_ 400

and use the median instead of average time.

This outputs:

1
2
3
4
5
6
Found functions:
  0x2040
  0x44000
  0x88000
  0x110000
  0x220000

5.3 2. identify bits Link to heading

1
2
sudo ./identify_bits -o out -p 17 -t ddr4 -n 2 -s 32 -r 2 -w 8 -d -v -l\
    -f 0x2040,0x44000,0x88000,0x110000,0x220000

with original max_bits or +1:

1
2
3
Found bits:
  row_bits,0x3fffc0000
  column_bits,0x1fc0

The only issue here is that this has only 16 row bits. Likely it doesn’t detect the last one because I only have slightly over half of the memory used. Either way we can assume that the real row bits should be 0x7fffc0000

5.4 3. validate mapping Link to heading

1
2
3
4
sudo ./validate_mapping -o out -p 17 -t ddr4 -n 1 -s 32 -r 2 -w 8 -d -v -l\
    -f 0x2040,0x44000,0x88000,0x110000,0x220000 \
    -R 0x3fffc0000 \
    -C 0x1fc0

5.5 4. decompose functions Link to heading

1
2
3
4
sudo ./decompose_functions -o out -p 17 -t ddr4 -n 1 -s 32 -r 2 -w 8 -d -v -l\
    -f 0x2040,0x44000,0x88000,0x110000,0x220000 \
    -R 0x3fffc0000 \
    -C 0x1fc0

The situation is more clear now: the interleaved refresh intervals are across the two ranks. And then the 2 fast RDRD latencies are across groups and the two slow ones across banks.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
[2026-01-23 19:54:39.689] [info] [+] DecomposeUsingRefreshes
Functions: 0x2040, tREFI: 978, tREFI/2: 46
Functions: 0x44000, tREFI: 981, tREFI/2: 43
Functions: 0x88000, tREFI: 2, tREFI/2: 1022    -- ranks
Functions: 0x110000, tREFI: 992, tREFI/2: 32
Functions: 0x220000, tREFI: 916, tREFI/2: 108
[2026-01-23 19:54:41.602] [info] [+] DecomposeUsingConsecutiveAccesses
Functions: 0x2040, Avg RDRD latency: 346       -- group
Functions: 0x44000, Avg RDRD latency: 342      -- group
Functions: 0x88000, Avg RDRD latency: 403      -- rank
Functions: 0x110000, Avg RDRD latency: 390     -- bank
Functions: 0x220000, Avg RDRD latency: 393     -- bank

New table:

	refresh group	read-read	what
`00000000000000000000001111111000000`			col
`00000000000000000000010000001000000`	same	346	group
`00000000000000001000100000000000000`	same	342	group
`00000000000000010001000000000000000`	different	403	rank
`00000000000000100010000000000000000`	same	390	bank
`00000000000001000100000000000000000`	same	393	bank
`11111111111111111000000000000000000`			row

6 Final results Link to heading

1 DIMM (32 GB)	what	2 DIMM (64 GB)	what
`00000000000000000000001111111`	col	`000000000000000000000001111111`	col
`00000000000000000000010000001`	group	`000000000000000000000100000010`	channel
	-	`000000000000000001001011001100`	group
`00000000000000001000100000000`	group	`000000000000000010001000000000`	group
`00000000000000010001000000000`	rank	`000000000000000100010000000000`	rank
`00000000000000100010000000000`	bank	`000000000000001000100000000000`	bank
`00000000000001000100000000000`	bank	`000000000000010001000000000000`	bank
`11111111111111111000000000000`	row	`111111111111111110000000000000`	row

Summary on how to interpret this:

col indicates that the low 7 bits are used to indicate the column to read from.
row indicates that the high 16 or 17 bits determine the row to read from.
All other masks indicate that the xor of the indicated address-bits is taken, and the parity determines the channel/rank/group/bank. Each rank has 4 groups containing 4 banks each, so these get 2 bits assigned to them.

In general these look very similar. Differences:

With 2 DIMMs, pairs of cachelines go next to each other in the same row. With 1, they are scattered across groups.
With 2 DIMMs, the lowest level scattering is changed to channels. The first group-scatter gets a somewhat random mask.

7 `decode-dimms` Link to heading

Here I’m logging the output of running some analysis tools.

decode-dimms, part of i2c-tools, on arch is a nice tool to read information from memory module EEPROM. But note that this is static information, and not the current mode. For me it prints eg:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
...
Fundamental Memory type                          DDR4 SDRAM

---=== Memory Characteristics ===---
Maximum module speed                             3200 MT/s (PC4-25600)
Size                                             32768 MB
Banks x Rows x Columns x Bits                    16 x 17 x 10 x 64
SDRAM Device Width                               8 bits
Ranks                                            2
Rank Mix                                         Symmetrical
Primary Bus Width                                64 bits
AA-RCD-RP-RAS (cycles)                           22-22-22-52
Supported CAS Latencies                          24T, 22T, 21T, 20T, 19T, 18T, 17T, 16T, 15T, 14T, 13T, 12T, 11T, 10T

---=== Timings at Standard Speeds ===---
AA-RCD-RP-RAS (cycles) as DDR4-3200              22-22-22-52
AA-RCD-RP-RAS (cycles) as DDR4-2933              21-21-21-47
AA-RCD-RP-RAS (cycles) as DDR4-2666              19-19-19-43
AA-RCD-RP-RAS (cycles) as DDR4-2400              17-17-17-39
AA-RCD-RP-RAS (cycles) as DDR4-2133              15-15-15-35
AA-RCD-RP-RAS (cycles) as DDR4-1866              13-13-13-30
AA-RCD-RP-RAS (cycles) as DDR4-1600              11-11-11-26

---=== Timing Parameters ===---
Minimum Cycle Time (tCKmin)                      0.625 ns
Maximum Cycle Time (tCKmax)                      1.600 ns
Minimum CAS Latency Time (tAA)                   13.750 ns
Minimum RAS to CAS Delay (tRCD)                  13.750 ns
Minimum Row Precharge Delay (tRP)                13.750 ns
Minimum Active to Precharge Delay (tRAS)         32.000 ns
Minimum Active to Auto-Refresh Delay (tRC)       45.750 ns
Minimum Recovery Delay (tRFC1)                   350.000 ns
Minimum Recovery Delay (tRFC2)                   260.000 ns
Minimum Recovery Delay (tRFC4)                   160.000 ns
Minimum Four Activate Window Delay (tFAW)        21.000 ns
Minimum Row Active to Row Active Delay (tRRD_S)  2.500 ns
Minimum Row Active to Row Active Delay (tRRD_L)  4.900 ns
Minimum CAS to CAS Delay (tCCD_L)                5.000 ns
Minimum Write Recovery Time (tWR)                15.000 ns
Minimum Write to Read Time (tWTR_S)              2.500 ns
Minimum Write to Read Time (tWTR_L)              7.500 ns

...

Part Number                                      HMAA4GS6AJR8N-XN

Number of SDRAM DIMMs detected and decoded: 1

This confirms some information we already knew by now:

2 ranks per DIMM
10 column bits, 17 row bits, and 16 banks.

The question is how relevant the timing parameters are if these are not necessarily the current values.

7.1 Bank groups Link to heading

reading from different bank groups takes the minimum of 4 cycles (in which 8 words are transmitted)

7.2 Refresh Link to heading

https://www.reddit.com/r/overclocking/comments/1gdz1bv/are_trefi_and_trfc_the_only_temperature_sensitive/

64ms between refreshes, split in 8192 groups, so 64ms/8192=7.8us on average between refresh calls

Each call is around 350ns=4.5% long, blocking the entire rank.

7.3 Random access throughput Link to heading

Limitations:

opening a single row in a bank
- tRCD = 13.750 ns to move row into amplifiers
- tRAS = 32.000 ns to move it back
between reads in a group: 4.9ns
between reads across groups: 2.5ns
but either way: at most 4 reads on the rank every 21ns => 5.25 ns/read limit per rank (or DIMM?)

8 CPU benchmarks Link to heading

Lastly, some benchmarks, run with both 1 and 2 DIMMs.

8.1 cpu-benchmarks Link to heading

See /posts/cpu-benchmarks/ and github:RagnarGrootKoerkamp/cpu-benchmarks.

8.1.1 random access throughput 1 DIMM Link to heading

1
2
3
cpu-benchmarks > cargo run -r --bin cpu-benchmarks -- --dense -t 30 -j6 mt-throughput
...
bytes 1536.000 MiB threads 6  thrps    5.399 ns/q

1
2
3
cpu-benchmarks > cargo run -r --bin cpu-benchmarks -- --dense -t 30 -j12 mt-throughput
...
bytes 1792.000 MiB threads 12  thrps    5.034 ns/q

8.1.2 random access throughput 2 DIMM Link to heading

1
2
3
cpu-benchmarks > cargo run -r --bin cpu-benchmarks -- --dense -t 30 -j6 mt-throughput
...
bytes 1536.000 MiB threads 6  thrps    2.679 ns/q

1
2
3
cpu-benchmarks > cargo run -r --bin cpu-benchmarks -- --dense -t 30 -j12 mt-throughput
...
bytes 1792.000 MiB threads 23  thrps    2.618 ns/q

8.2 memory-read-experiment Link to heading

See github:feldroop/memory-read-experiment

8.2.1 strided reading 1 DIMM Link to heading

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
memory-read-experiment > cargo run -r
Method         (ns/read)  |   1 t |   6 t |  12 t |
sequential                | 3.612 | 3.217 | 3.301 |
sequential pf             | 3.643 | 3.166 | 3.305 |
sequential offset         | 3.598 | 3.166 | 3.299 |
sequential offset pf      | 3.631 | 3.166 | 3.303 |
dbl sequential            | 5.692 | 3.587 | 3.616 |
dbl sequential pf         | 5.906 | 3.600 | 3.688 |
trip sequential           | 5.769 | 3.577 | 3.715 |
trip sequential pf        | 5.842 | 3.525 | 3.843 |
quad sequential           | 6.859 | 3.722 | 3.816 |
quad sequential pf        | 6.749 | 3.712 | 4.151 |
oct sequential            | 7.374 | 4.341 | 4.484 |
oct sequential pf         | 7.326 | 4.334 | 4.825 |
hex sequential            | 7.719 | 6.002 | 3.389 |
hex sequential pf         | 7.656 | 6.113 | 5.290 |
32 sequential             | 8.074 | 6.249 | 3.550 |
32 sequential pf          | 8.360 | 5.838 | 3.388 |
64 sequential             | 6.963 | 4.069 | 4.079 |
64 sequential pf          | 7.278 | 3.764 | 4.826 |
128 sequential            | 7.320 | 4.481 | 4.485 |
256 sequential            | 7.995 | 4.946 | 4.790 |
512 sequential            | 8.052 | 6.466 | 5.687 |
1024 sequential           | 8.021 | 17.248 | 11.497 |
2048 sequential           | 8.026 | 16.849 | 11.481 |
4096 sequential           | 8.108 | 16.966 | 11.573 |
8192 sequential           | 9.532 | 26.091 | 17.547 |
16384 sequential          | 13.224 | 47.222 | 31.138 |
32768 sequential          | 24.741 | 95.730 | 61.251 |
random                    | 9.101 | 5.419 | 4.486 |
random safe               | 8.276 | 5.022 | 4.481 |
random pf                 | 7.838 | 4.812 | 4.490 |
stride                    | 7.889 | 5.076 | 4.486 |
stride safe               | 7.886 | 5.164 | 4.488 |
stride pf                 | 8.024 | 4.934 | 4.495 |

8.2.2 strided reading 2 DIMM Link to heading

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
memory-read-experiment > cargo run -r
Method         (ns/read)  |   1 t |   6 t |  12 t |
sequential                | 3.289 | 1.742 | 1.831 |
sequential pf             | 3.331 | 1.725 | 1.833 |
sequential offset         | 3.292 | 1.729 | 1.835 |
sequential offset pf      | 3.232 | 1.730 | 1.835 |
dbl sequential            | 4.962 | 1.877 | 2.011 |
dbl sequential pf         | 5.109 | 1.857 | 2.040 |
trip sequential           | 5.787 | 1.952 | 2.094 |
trip sequential pf        | 6.077 | 2.009 | 2.207 |
quad sequential           | 6.610 | 2.031 | 2.110 |
quad sequential pf        | 6.502 | 1.980 | 2.223 |
oct sequential            | 6.838 | 2.346 | 2.490 |
oct sequential pf         | 6.808 | 2.328 | 2.548 |
hex sequential            | 7.208 | 3.227 | 1.879 |
hex sequential pf         | 7.166 | 3.328 | 2.968 |
32 sequential             | 7.896 | 3.409 | 1.940 |
32 sequential pf          | 8.168 | 3.337 | 1.854 |
64 sequential             | 6.875 | 2.019 | 2.063 |
64 sequential pf          | 7.267 | 1.951 | 2.585 |
128 sequential            | 6.997 | 2.264 | 2.271 |
256 sequential            | 7.018 | 2.275 | 2.367 |
512 sequential            | 7.200 | 3.487 | 3.039 |
1024 sequential           | 7.545 | 10.966 | 6.884 |
2048 sequential           | 7.294 | 9.182 | 6.656 |
4096 sequential           | 7.287 | 8.673 | 6.241 |
8192 sequential           | 7.934 | 13.559 | 9.274 |
16384 sequential          | 9.594 | 25.168 | 17.262 |
32768 sequential          | 13.397 | 45.962 | 30.245 |
65536 sequential          | 21.942 | 93.211 | 49.382 |
random                    | 8.175 | 2.609 | 2.328 |
random safe               | 7.599 | 2.484 | 2.323 |
random pf                 | 7.225 | 2.395 | 2.315 |
stride                    | 7.245 | 2.380 | 2.284 |
stride safe               | 7.269 | 2.394 | 2.282 |
stride pf                 | 7.199 | 2.365 | 2.278 |

9 `tinymembench` Link to heading

https://github.com/ssvb/tinymembench

Nontemporal writes get to 44 GB/s, which is not too far of the max of 2*25.6=51.2 GB/s. This corresponds to 1.45 ns/cacheline, which is a bit faster than the best of 1.72 ns/cacheline we observed above for streaming reads.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
tinymembench v0.4.9 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :  11074.4 MB/s (1.1%)
 C copy backwards (32 byte blocks)                    :  11077.6 MB/s (0.5%)
 C copy backwards (64 byte blocks)                    :  11072.9 MB/s (0.7%)
 C copy                                               :  11689.6 MB/s (0.8%)
 C copy prefetched (32 bytes step)                    :  11834.0 MB/s (0.6%)
 C copy prefetched (64 bytes step)                    :  11827.2 MB/s (0.3%)
 C 2-pass copy                                        :   8492.4 MB/s (0.7%)
 C 2-pass copy prefetched (32 bytes step)             :   8507.4 MB/s (0.6%)
 C 2-pass copy prefetched (64 bytes step)             :   8508.0 MB/s (0.9%)
 C fill                                               :  19738.5 MB/s (2.0%)
 C fill (shuffle within 16 byte blocks)               :  19748.1 MB/s (1.9%)
 C fill (shuffle within 32 byte blocks)               :  19735.2 MB/s (2.3%)
 C fill (shuffle within 64 byte blocks)               :  19733.9 MB/s (1.2%)
 ---
 standard memcpy                                      :  18125.0 MB/s (1.8%)
 standard memset                                      :  43982.1 MB/s
 ---
 MOVSB copy                                           :  12307.6 MB/s (0.2%)
 MOVSD copy                                           :  12320.4 MB/s (0.5%)
 SSE2 copy                                            :  12596.6 MB/s (0.7%)
 SSE2 nontemporal copy                                :  17633.6 MB/s (1.2%)
 SSE2 copy prefetched (32 bytes step)                 :  12357.9 MB/s (0.4%)
 SSE2 copy prefetched (64 bytes step)                 :  12393.2 MB/s (0.3%)
 SSE2 nontemporal copy prefetched (32 bytes step)     :  16206.2 MB/s (0.4%)
 SSE2 nontemporal copy prefetched (64 bytes step)     :  16162.0 MB/s (0.4%)
 SSE2 2-pass copy                                     :  10079.1 MB/s (0.4%)
 SSE2 2-pass copy prefetched (32 bytes step)          :   9736.3 MB/s (0.6%)
 SSE2 2-pass copy prefetched (64 bytes step)          :   9787.0 MB/s (0.3%)
 SSE2 2-pass nontemporal copy                         :   6675.7 MB/s (0.5%)
 SSE2 fill                                            :  20929.1 MB/s (3.1%)
 SSE2 nontemporal fill                                :  44182.7 MB/s (0.3%)

==========================================================================
== Framebuffer read tests.                                              ==
==                                                                      ==
== Many ARM devices use a part of the system memory as the framebuffer, ==
== typically mapped as uncached but with write-combining enabled.       ==
== Writes to such framebuffers are quite fast, but reads are much       ==
== slower and very sensitive to the alignment and the selection of      ==
== CPU instructions which are used for accessing memory.                ==
==                                                                      ==
== Many x86 systems allocate the framebuffer in the GPU memory,         ==
== accessible for the CPU via a relatively slow PCI-E bus. Moreover,    ==
== PCI-E is asymmetric and handles reads a lot worse than writes.       ==
==                                                                      ==
== If uncached framebuffer reads are reasonably fast (at least 100 MB/s ==
== or preferably >300 MB/s), then using the shadow framebuffer layer    ==
== is not necessary in Xorg DDX drivers, resulting in a nice overall    ==
== performance improvement. For example, the xf86-video-fbturbo DDX     ==
== uses this trick.                                                     ==
==========================================================================

 MOVSD copy (from framebuffer)                        :    296.5 MB/s (1.6%)
 MOVSD 2-pass copy (from framebuffer)                 :    289.8 MB/s (0.6%)
 SSE2 copy (from framebuffer)                         :    157.6 MB/s
 SSE2 2-pass copy (from framebuffer)                  :    157.1 MB/s

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    0.9 ns          /     1.3 ns
    131072 :    1.3 ns          /     1.6 ns
    262144 :    2.5 ns          /     3.5 ns
    524288 :    6.2 ns          /     8.2 ns
   1048576 :    8.5 ns          /    10.2 ns
   2097152 :    9.7 ns          /    10.9 ns
   4194304 :   10.3 ns          /    11.1 ns
   8388608 :   12.7 ns          /    14.3 ns
  16777216 :   30.7 ns          /    43.3 ns
  33554432 :   52.7 ns          /    66.8 ns
  67108864 :   64.9 ns          /    75.9 ns

block size : single random read / dual random read, [MADV_HUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns
      2048 :    0.0 ns          /     0.0 ns
      4096 :    0.0 ns          /     0.0 ns
      8192 :    0.0 ns          /     0.0 ns
     16384 :    0.0 ns          /     0.0 ns
     32768 :    0.0 ns          /     0.0 ns
     65536 :    0.9 ns          /     1.3 ns
    131072 :    1.3 ns          /     1.6 ns
    262144 :    1.6 ns          /     1.8 ns
    524288 :    5.1 ns          /     6.9 ns
   1048576 :    6.8 ns          /     8.3 ns
   2097152 :    7.8 ns          /     8.7 ns
   4194304 :    8.3 ns          /     8.9 ns
   8388608 :    8.3 ns          /     8.9 ns
  16777216 :   24.7 ns          /    36.2 ns
  33554432 :   45.3 ns          /    59.7 ns
  67108864 :   55.3 ns          /    66.6 ns

10 Remaining questions Link to heading

Why do we not so interleaved refresh intervals between reads from different DIMMs?
- Speculative answer: they auto-synchronize. As soon as 1 starts a refresh, the other becomes idle and will also refresh.
Why is alternating reads between channels not faster than alternating between groups?
- I guess because both are maximally fast?
Why is alternating reads between ranks quite a bit slower?
- No ideas here; the rank-to-rank switching time should be only a single cycle or so.
How do things change on big servers (EPYC) with multiple memory controllers?
Does sending read instructions consume a cycle? Can reading and controlling happen in parallel?

References Link to heading

Helm, Christian, Soramichi Akiyama, and Kenjiro Taura. 2020. “Reliable Reverse Engineering of Intel Dram Addressing Using Performance Counters.” In 2020 28Th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (Mascots), 1–8. IEEE. https://doi.org/10.1109/mascots50786.2020.9285962.

Hillenbrand, Matrius. 2017. “Physical Address Decoding in Intel Xeon v3/v3 CPUs: A Supplemental Datasheet.” KIT. https://os.itec.kit.edu/21_3389.php.

Jattke, Patrick, Max Wipfli, Flavien Solt, Michele Marazzi, Matej Bölcskei, and Kaveh Razavi. 2024. “Zenhammer: Rowhammer Attacks on Amd Zen-Based Platforms.” In Proceedings of the 33rd Usenix Conference on Security Symposium. Sec ’24. Philadelphia, PA, USA: USENIX Association.

Mahling, Fabian, Marcel Weisgut, and Tilmann Rabl. 2025. “Fetch Me If You Can: Evaluating Cpu Cache Prefetching and Its Reliability on High Latency Memory.” In Proceedings of the 21st International Workshop on Data Management on New Hardware, 1–9. Damon ’25. ACM. https://doi.org/10.1145/3736227.3736231.

Wi, Minbok, Seungmin Baek, Seonyong Park, Mattan Erez, and Jung Ho Ahn. 2025. “Sudoku: Decomposing Dram Address Mapping into Component Functions.” arXiv. https://doi.org/10.48550/ARXIV.2506.15918.

Zhang, Zhao, Zhichun Zhu, and Xiaodong Zhang. 2000. “A Permutation-Based Page Interleaving Scheme to Reduce Row-Buffer Conflicts and Exploit Data Locality.” In Proceedings of the 33rd Annual Acm/Ieee International Symposium on Microarchitecture, 32–41. Micro00. ACM. https://doi.org/10.1145/360128.360134.