... darkrealms ...

Message 242,604 of 243,097
BGB to All
Re: srand(0)
25 Dec 25 15:29:59
From: cr88192@gmail.com

On 12/25/2025 1:31 PM, Lawrence D’Oliveiro wrote:
> On Thu, 25 Dec 2025 03:07:03 -0600, BGB wrote:
>
>> One entropy-mining process is to use "clock()" or similar and then
>> spin in a loop for a certain amount of time effectively building a
>> hash of the values returned by clock. The exact timing when the
>> values change will tend to carry a certain amount of entropy.
>
> The turbulence of the air/gas inside disk drives is apparently a good
> source of randomness.

Yeah, but one doesn't easily have access to this information.
Likewise to access from the low order bits of CPU thermometers or
similar, etc.

For some of my targets, there is also no HDD (typically, everything runs
off of SD cards).


FWIW, in my own CPU design, there is actually a hardware RNG where
internal signals are basically gathered up and fed around the bus in a
special noise channel and used to continuously feed into a hardware RNG
for which a value can be read with a special CPU instruction.

But, alas, mainline CPUs lack such a feature.

On x86, it is also possible to get some level of entropy from mining
RDTSC, but this is non-portable.




But, yeah, tested out a few more RNG designs, and ATM:
	seed1 ^= ~(seed2>>47);	seed2 ^= ~(seed1>>43);  // 4 cycles
	seed1 ^=  (seed1<<13);	seed2 ^=  (seed2>>11);  // 4 cycles
	seed1 ^=  (seed1>>19);	seed2 ^=  (seed2<<17);  // 4 cycles
	val = ((seed1 ^ seed2) >> 32) & 0x7FFF;         // 6 cycles
Seems to be working pretty OK (decent randomness), and is moderately fast.

Add cost of +4 cycles for LD (2c penalty), +2 ST
Est cost: Around 24 clock cycles.


Though, breaking up the shifts and xors using temporaries could be used
to micro-optimize it a little more (vs trying to rely on compile-time
instruction shuffling).

Downside as that this particular approach (XOR'ing values with
themselves and modifying the original variable before the next step),
creates a lot of dependencies which limits the potential ILP (can't get
ILP over 2 in this case).

Where, the interleaved "seed1 = (seed1<>SH2);" pattern
allows for slightly higher ILP, but seemingly gets less randomness per
shift (so the total dependency length ends up similar).

In the CPU in question, would effectively need 4 seed values to get
better ILP, say:
	seed1 ^= ~(seed2>>47);	seed2 ^= ~(seed3>>43);
	seed3 ^= ~(seed4>>53);	seed4 ^= ~(seed1>>41);
           // ~ 4 cycles
	seed1 ^=  (seed1<<13);	seed2 ^=  (seed2>>11);
	seed3 ^=  (seed3<< 7);	seed4 ^=  (seed4>> 5);
	seed1 ^=  (seed1>>19);	seed2 ^=  (seed2<<17);
	seed3 ^=  (seed3>>23);	seed4 ^=  (seed4<<29);
           // ~ 6 cycles
	val = ((seed1 ^ seed2) >> 32) & 0x7FFF;  // 6 cycles
         //+ 8 cycles (Global LD/ST)
Est Cost: Around 24 clock cycles.

(24 cycle estimate assuming compiler shuffles instructions into a
roughly optimal order).


Though, with more operations still likely to be slower, despite the
higher ILP potential (the higher ILP would merely reduce the slowdown
from having more operations).

In this case, ideally one wants to be able to have groups of 6
non-dependent ALU operations on a 3-wide CPU with 2-cycle ALU latency.
Though, in this case, achieving peak ILP would actually require ~ 6 or 8
seed registers. Though, in this case, would require using XG3, for
RISC-V it would also suffer due to RV64 only having 32 GPRs rather than
64, so trying to use 8 seeds in parallel would run out of GPRs on RISC-V
(don't want to spill, this will hurt a lot worse than the lost ILP).

Where, if ILP is maximized, one can get to around 0.3 clock cycles per
operation.



But, on the target in question, 2 or 4 seeds is still likely to be
considerably faster than, say:
	seed = seed * 0x0000FC4BA2F7ACABULL + 1;

Which suffers from the "great evil" known as "slow 64-bit multiply"
(needs 68 cycles to do a 64-bit multiply, similar to performing a divide).

Where, the cost of the 64-bit integer multiply will stomp all over the
cycle cost of the former (well, and not like "rand()" tends to be used
all that heavily in any case).


Well, contrast to the naive lookup-table approach:
   index=(index+1)&255; //latency = 8 cycles (3 LD, 2 ADD, 2 AND, 1 ST)
   val=table[index];    //latency = 5 cycles (2 ADD, 3 LD).
So: ~ 13 cycles of latency.

So, looks simple, but not great in this case (both weak, and latency
isn't great for its weakness).


Then again:
   seed = seed * 0x0000FC4BA2F7ACABULL + 1;
     //75 cycles: 3 LD, 1 const, 68 MUL, 2 ADD, 1 ST
   ret = (myseed >> 48) & 0x7FFF;  //4 cycles: 2 SHR, 2 AND
Cost: Around 80 cycles.


...

--- SoupGate-Win32 v1.05
 * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]