... darkrealms ...

Message 242,206 of 243,097
David Brown to Michael S
Re: _BitInt(N) (1/2)
28 Nov 25 12:45:58
From: david.brown@hesbynett.no

On 28/11/2025 12:12, Michael S wrote:
> On Fri, 28 Nov 2025 09:46:56 +0100
> David Brown  wrote:
>
>> On 27/11/2025 23:15, Michael S wrote:
>>> On Thu, 27 Nov 2025 21:15:53 +0100
>>> David Brown  wrote:
>>>
>>>> On 27/11/2025 15:02, Michael S wrote:
>>>>> On Thu, 27 Nov 2025 14:02:38 +0100
>>>>> David Brown  wrote:
>>>>>
>>>>
>>>>>
>>>>> MSVC compilers compile your code and produce correct result, but
>>>>> the code
>>>>> looks less nice:
>>>>> 0000000000000000 :
>>>>>       0:   f2 0f 11 44 24 08       movsd  %xmm0,0x8(%rsp)
>>>>>       6:   48 8b 44 24 08          mov    0x8(%rsp),%rax
>>>>>       b:   48 c1 e8 34             shr    $0x34,%rax
>>>>>       f:   25 ff 07 00 00          and    $0x7ff,%eax
>>>>>      14:   c3                      ret
>>>>>
>>>>> Although on old AMD processors it is likely faster than nicer code
>>>>> generated by gcc and clang. On newer processor gcc code is likely
>>>>> a bit better, but the difference is unlikely to be detected by
>>>>> simple measurements.
>>>>
>>>> I think it is unlikely that this version - moving from xmm0 to rax
>>>> via memory instead of directly - is faster on any processor.  But I
>>>> fully agree that it is unlikely to be a measurable difference in
>>>> practice.
>>>
>>> I wonder, how do you have a nerve "to think" about things that you
>>> have absolutely no idea about?
>>
>> I think about many things - and these are things I /do/ know about.
>> But I don't know all the details, and am happy to be corrected and
>> learn more.
>>
>>>
>>> Instead of "thinking" you could just as well open Optimization
>>> Reference manuals of AMD Bulldozer family or of Bobcat. Or to read
>>> Agner Fog's instruction tables. Move from XMM to GPR on these
>>> processors is very slow: 8 clocks on BD, 7 on BbC.
>>>
>>
>> Okay.  But storing data to memory from xmm0 is also going to be slow,
>> and loading it to rax from memory is going to be slow.  I am not an
>> expert at the x86 world or reading Fog's tables, but it looks to me
>> that on a Bulldozer, storing from xmm0 to memory has a latency of 6
>> cycles and reading the memory into rax has a latency of 4 cycles.
>> That adds up to more than the 8 cycles for the direct register
>> transfer, and I expect (but do not claim to know for sure!) that the
>> dependency limits the scope for pipeline overlap - decode and address
>> calculations can be done, but the data can't be fetched until the
>> previous store is complete.
>>
>> So all in all, my estimate was, I think, quite reasonable.  There may
>> be unusual circumstances on particular cores if the instruction
>> scheduling and pipelining, combined with the stack engine, make that
>> sequence faster than the single register move.
>>
>
> It seems, you are correct in this particular case.
> Latency tables, esp. those that are measured by software rather
> than supplied by designer, are problematic in case of moves between
> registers of different types, memory stores of all types and even
> memory loads, with exception of memory load into GPR. Agner explains why
> they are problematic in te preface to his tables. In short, there is no
> direct way to measure this things in isolation, so one has to measure
> latency of the sequence of instructions and then to apply either
> guesswork or manufacturer's docs to somehow divide the combined
> latency into individual parts.
>

Well, if even Agner thinks it is difficult, then I don't feel bad for
having trouble!

> So, the best way is to go by recommendations of the vendor in Opt.
> Reference Manual.
> There are no relevant recommendations for K8, unfortunately. I suspect
> that all methods are slow here.
> For Bobcat, there should be recommendations, but I don't have them and
> too lazy to look for.
>

Fair enough.  It is not information that is likely to be useful to
anyone here, so it's all for fun and interest.  I certainly wouldn't
want you to spend effort finding out the details just for me.

> For Family 10h (Barcelona and derivatives):
> "When moving data from a GPR to an MMX or XMM register, use separate
> store and load instructions to move the data first from the source
> register to a temporary location in memory and then from memory into
> the destination register, taking the memory latency into account when
> scheduling both stages of the load-store sequence.
>
> When moving data from an MMX or XMM register to a general-purpose
> register, use the MOVD instruction.
>
> Whenever possible, use loads and stores of the same data length. (See
> 5.3, ‘Store-to-Load Forwarding Restrictions” on page 74 for more
> information.)"

How much does advice like this take into account surrounding code?
That's what makes generating optimal code /really/ hard.  And it means
micro-optimising a short instruction sequence can be ineffective for
real-world code.  After all, no one is actually interested in minimising
the number of nanoseconds it takes to extract the exponent of a floating
point number - the speed only matters if you are doing lots of these,
probably in a big loop with data moving into and out of memory all the time.

This stuff was all /so/ much easier when we used PIC's and AVR's...

>
> For Family 15h (Bullozer and derivatives):
> "When moving data from a GPR to an XMM register, use separate store and
> load instructions to move the data first from the source register to a
> temporary location in memory and then from memory into the destination
> register, taking the memory latency into account when scheduling both
> stages of the load-store sequence.
>
> When moving data from an XMM register to a general-purpose register,
> use the VMOVD instruction.
>
> Whenever possible, use loads and stores of the same data length. (See
> 6.3, ‘Store-to-Load Forwarding Restrictions” on page 98 for more
> information.)"
>
> So, for both families, vendor recommends register move in direction from
> SIMD to GPR and Store/Load sequence in direction from GPR to SIMD.
> The suspect point here is specific mentioning of EVEX-encoded form
> (VMOVD) in case of BD. It can mean that "legacy" (SSE-encoded) form is
> slower or it can mean nothing. I suspect the latter.
>
>> I've now had a short look at the relevant table from Fog's site.  My
>> conclusion from that is that the register move - though surprisingly
>> slow - is probably marginally faster than passing it through memory.
>> Perhaps if I spend enough time studying the details, I might find out

[continued in next message]

--- SoupGate-Win32 v1.05
 * Origin: you cannot sedate... all the things you hate (1:229/2)
[ << oldest | < older | list | newer > | newest >> ]