68kMLA Classic Interface
This is a version of the 68kMLA forums for viewing on your favorite old mac. Visitors on modern platforms may prefer the main site.
| Click here to select a new forum. | | Testing a 6200 and comparison with 6100 | Posted by: zigzagjoe on 2026-02-22 11:59:26 You might want to peek at the assembly coming out of your PPC compiler. Perhaps it's unoptimized or unusually bad. Hand-written assembly probably would make more sense to sidestep compiler shenanigans and improve accuracy. | Posted by: David Cook on 2026-02-22 12:03:14
What speed does the 040 L1 run at? 33 or 66MHz? (I assume 33, but sometimes worth asking the dumb questions).
According to me, 33 MHz. According to Apple and Motorola marketing department 66 MHz. | Posted by: David Cook on 2026-02-22 12:27:59
Perhaps it's unoptimized or unusually bad.
Agreed. That wouldn't surprise me. That's my big caveat to all of this. I'm using a period-correct compiler (Metrowerks CodeWarrior 11 Gold) with pure C code that is not specifically tailored for a PowerPC processor. I am positive that if I wrote this code differently and chose 603 instruction ordering it could do better.
As you know, the cache tester is really simple. It's purpose is just to detect the existence of a cache at various steps. It doesn't exercise the cache with writes or random accesses. And, it is focused on data, not code. | Posted by: Phipli on 2026-02-22 12:32:35
According to me, 33 MHz. According to Apple and Motorola marketing department 66 MHz. Fair, just double checking given the performance difference. | Posted by: David Cook on 2026-02-22 16:57:04 The performance portion of the code is a unrolled loop that copies 32 bytes per loop. So, basically, nothing else is as impactful on the result as this portion of code.
The addition operation is used to verify that memory is valid. The buffer has been preloaded with an incrementing value where the end sum is known. (This is a cache checker program.)
sum += *((unsigned long*)currentBufferPtr)++;
sum += *((unsigned long*)currentBufferPtr)++;
sum += *((unsigned long*)currentBufferPtr)++;
sum += *((unsigned long*)currentBufferPtr)++;
sum += *((unsigned long*)currentBufferPtr)++;
sum += *((unsigned long*)currentBufferPtr)++;
sum += *((unsigned long*)currentBufferPtr)++;
sum += *((unsigned long*)currentBufferPtr)++;
I've checked the PowerPC disassembly and it looks fine. There are two operations per C line on PPC, as opposed to a single 040 operation, which is to be expected on RISC vs CISC. However, someone with more expertise in PPC assembly might know of an optimization.
The disassembly is interesting in that the PPC code switches between loading one register and then another. I assume using multiple registers allows a performance gain where it can execute two or more operations (a read to one register using the load/store unit and an add to another register using the integer unit) in parallel. Cool.
00000098: 807C0000 lwz r3,0(r28)
0000009C: 841C0004 lwzu r0,4(r28)
000000A0: 7CC61A14 add r6,r6,r3
000000A4: 849C0004 lwzu r4,4(r28)
000000A8: 7CC60214 add r6,r6,r0
000000AC: 847C0004 lwzu r3,4(r28)
000000B0: 7CC62214 add r6,r6,r4
000000B4: 841C0004 lwzu r0,4(r28)
000000B8: 7CC61A14 add r6,r6,r3
000000BC: 849C0004 lwzu r4,4(r28)
000000C0: 7CC60214 add r6,r6,r0
000000C4: 847C0004 lwzu r3,4(r28)
000000C8: 841C0004 lwzu r0,4(r28)
000000CC: 7CC62214 add r6,r6,r4
000000D0: 7CC61A14 add r6,r6,r3
[two operations to prepare to loop and finally]
000000DC: 7CC60214 add r6,r6,r0
Here's the 040:
0000007E: D69A ADD.L (A2)+,D3
00000080: D69A ADD.L (A2)+,D3
00000082: D69A ADD.L (A2)+,D3
00000084: D69A ADD.L (A2)+,D3
00000086: D69A ADD.L (A2)+,D3
00000088: D69A ADD.L (A2)+,D3
0000008A: D69A ADD.L (A2)+,D3
0000008C: D69A ADD.L (A2)+,D3
- David | Posted by: croissantking on 2026-02-25 04:28:08
What speed does the 040 L1 run at? 33 or 66MHz? (I assume 33, but sometimes worth asking the dumb questions). Ask @Melkhior | | < 4 |
|