Did any compiler fully use 80-bit floating point? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30 pm US/Eastern)128-bit floating pointAre there any articles elucidating the history of the POPCOUNT instruction?How adequate would 48-bit floating point be?Why not use fractions instead of floating point?Floating Point on Konrad Zuse's computersWhich pre-IEEE computers had a single precision FPU and implemented double precision floats in software?Commodore BASIC and binary floating point precisionHow was dataflow analysis performed before SSA?Did any early computers use a different radix to improve accuracy of rational arithmetic?Why did some early computer designers eschew integers?
Why would the Overseers waste their stock of slaves on the Game?
Is Bran literally the world's memory?
How to begin with a paragraph in latex
What is a 'Key' in computer science?
In search of the origins of term censor, I hit a dead end stuck with the greek term, to censor, λογοκρίνω
"Working on a knee"
Why do people think Winterfell crypts is the safest place for women, children & old people?
Philosophers who were composers?
What's the difference between using dependency injection with a container and using a service locator?
Coin Game with infinite paradox
Will I lose my paid in full property
Marquee sign letters
Is there a verb for listening stealthily?
Will I be more secure with my own router behind my ISP's router?
Is it accepted to use working hours to read general interest books?
TV series episode where humans nuke aliens before decrypting their message that states they come in peace
Suing a Police Officer Instead of the Police Department
What is the purpose of the side handle on a hand ("eggbeater") drill?
Processing ADC conversion result: DMA vs Processor Registers
Protagonist's race is hidden - should I reveal it?
Why isn't everyone flabbergasted about Bran's "gift"?
Why aren't road bicycle wheels tiny?
Test if all elements of a Foldable are the same
How did Elite on the NES work?
Did any compiler fully use 80-bit floating point?
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30 pm US/Eastern)128-bit floating pointAre there any articles elucidating the history of the POPCOUNT instruction?How adequate would 48-bit floating point be?Why not use fractions instead of floating point?Floating Point on Konrad Zuse's computersWhich pre-IEEE computers had a single precision FPU and implemented double precision floats in software?Commodore BASIC and binary floating point precisionHow was dataflow analysis performed before SSA?Did any early computers use a different radix to improve accuracy of rational arithmetic?Why did some early computer designers eschew integers?
There is a paradox about floating point that I'm trying to understand.
Floating point is an eternal struggle with the problem that real numbers are both essential and incomputable. It's the best solution we have for most calculations involving physical quantities, but has the perennial problems of limited precision and range; many volumes have been written about how to deal with these problems, even down to hardware engineers getting headaches implementing support for subnormal numbers, only to have programmers promptly turning this off because it kills performance on workloads where many numbers iterate to zero. (The usual reference here is the introductory document What every computer scientist should know about floating-point arithmetic, but for a more in-depth discussion, it's worth reading the writings of William Kahan, one of the world's foremost experts on the topic, and a very clear writer.)
The usual standard for floating point where substantial precision is required is IEEE-754 double precision, 64 bits. It's the best most hardware provides; doing even slightly better typically requires switching to a software solution for a dramatic slowdown.
The x87 went one better and provided extended precision, 80 bits. A Google search finds many articles about this, and almost all of them lament the problem that when compilers spill temporaries from registers to memory, they round to 64 bits, so the exact results very quasi-randomly depending on the behavior of the optimizer, which admittedly is a problem indeed.
The obvious solution is for the in-memory format to be also 80 bits, so that you get both extended precision and consistency. But I have not encountered any mention, ever, of this being used. It's moot now that one uses SSE2 which doesn't provide extended precision, but I would expect it to have been used in the days when x87 was the available floating-point instruction set.
The paradox is this: on the one hand, there is much discussion of limited precision being a big problem. On the other hand, Intel provided a solution with an extra eleven bits of precision and five bits of exponent, that would cost very little performance to use (since the hardware implemented it whether you used it or not), and yet everyone seemed to behave as though this had no value, and to positively celebrate the move to SSE2 where extended precision is no longer available.
So my question is:
Did any compilers ever make full use of extended precision (i.e. 80 bits in memory as well as in registers)? If not, why not?
history compilers floating-point
add a comment |
There is a paradox about floating point that I'm trying to understand.
Floating point is an eternal struggle with the problem that real numbers are both essential and incomputable. It's the best solution we have for most calculations involving physical quantities, but has the perennial problems of limited precision and range; many volumes have been written about how to deal with these problems, even down to hardware engineers getting headaches implementing support for subnormal numbers, only to have programmers promptly turning this off because it kills performance on workloads where many numbers iterate to zero. (The usual reference here is the introductory document What every computer scientist should know about floating-point arithmetic, but for a more in-depth discussion, it's worth reading the writings of William Kahan, one of the world's foremost experts on the topic, and a very clear writer.)
The usual standard for floating point where substantial precision is required is IEEE-754 double precision, 64 bits. It's the best most hardware provides; doing even slightly better typically requires switching to a software solution for a dramatic slowdown.
The x87 went one better and provided extended precision, 80 bits. A Google search finds many articles about this, and almost all of them lament the problem that when compilers spill temporaries from registers to memory, they round to 64 bits, so the exact results very quasi-randomly depending on the behavior of the optimizer, which admittedly is a problem indeed.
The obvious solution is for the in-memory format to be also 80 bits, so that you get both extended precision and consistency. But I have not encountered any mention, ever, of this being used. It's moot now that one uses SSE2 which doesn't provide extended precision, but I would expect it to have been used in the days when x87 was the available floating-point instruction set.
The paradox is this: on the one hand, there is much discussion of limited precision being a big problem. On the other hand, Intel provided a solution with an extra eleven bits of precision and five bits of exponent, that would cost very little performance to use (since the hardware implemented it whether you used it or not), and yet everyone seemed to behave as though this had no value, and to positively celebrate the move to SSE2 where extended precision is no longer available.
So my question is:
Did any compilers ever make full use of extended precision (i.e. 80 bits in memory as well as in registers)? If not, why not?
history compilers floating-point
Comments are not for extended discussion; this conversation has been moved to chat.
– Chenmunka♦
yesterday
Can the title be edited to make it clear we're talking about a particular 80-bit implementation? Would 'x87' or 'Intel' be the best word to add?
– another-dave
10 hours ago
add a comment |
There is a paradox about floating point that I'm trying to understand.
Floating point is an eternal struggle with the problem that real numbers are both essential and incomputable. It's the best solution we have for most calculations involving physical quantities, but has the perennial problems of limited precision and range; many volumes have been written about how to deal with these problems, even down to hardware engineers getting headaches implementing support for subnormal numbers, only to have programmers promptly turning this off because it kills performance on workloads where many numbers iterate to zero. (The usual reference here is the introductory document What every computer scientist should know about floating-point arithmetic, but for a more in-depth discussion, it's worth reading the writings of William Kahan, one of the world's foremost experts on the topic, and a very clear writer.)
The usual standard for floating point where substantial precision is required is IEEE-754 double precision, 64 bits. It's the best most hardware provides; doing even slightly better typically requires switching to a software solution for a dramatic slowdown.
The x87 went one better and provided extended precision, 80 bits. A Google search finds many articles about this, and almost all of them lament the problem that when compilers spill temporaries from registers to memory, they round to 64 bits, so the exact results very quasi-randomly depending on the behavior of the optimizer, which admittedly is a problem indeed.
The obvious solution is for the in-memory format to be also 80 bits, so that you get both extended precision and consistency. But I have not encountered any mention, ever, of this being used. It's moot now that one uses SSE2 which doesn't provide extended precision, but I would expect it to have been used in the days when x87 was the available floating-point instruction set.
The paradox is this: on the one hand, there is much discussion of limited precision being a big problem. On the other hand, Intel provided a solution with an extra eleven bits of precision and five bits of exponent, that would cost very little performance to use (since the hardware implemented it whether you used it or not), and yet everyone seemed to behave as though this had no value, and to positively celebrate the move to SSE2 where extended precision is no longer available.
So my question is:
Did any compilers ever make full use of extended precision (i.e. 80 bits in memory as well as in registers)? If not, why not?
history compilers floating-point
There is a paradox about floating point that I'm trying to understand.
Floating point is an eternal struggle with the problem that real numbers are both essential and incomputable. It's the best solution we have for most calculations involving physical quantities, but has the perennial problems of limited precision and range; many volumes have been written about how to deal with these problems, even down to hardware engineers getting headaches implementing support for subnormal numbers, only to have programmers promptly turning this off because it kills performance on workloads where many numbers iterate to zero. (The usual reference here is the introductory document What every computer scientist should know about floating-point arithmetic, but for a more in-depth discussion, it's worth reading the writings of William Kahan, one of the world's foremost experts on the topic, and a very clear writer.)
The usual standard for floating point where substantial precision is required is IEEE-754 double precision, 64 bits. It's the best most hardware provides; doing even slightly better typically requires switching to a software solution for a dramatic slowdown.
The x87 went one better and provided extended precision, 80 bits. A Google search finds many articles about this, and almost all of them lament the problem that when compilers spill temporaries from registers to memory, they round to 64 bits, so the exact results very quasi-randomly depending on the behavior of the optimizer, which admittedly is a problem indeed.
The obvious solution is for the in-memory format to be also 80 bits, so that you get both extended precision and consistency. But I have not encountered any mention, ever, of this being used. It's moot now that one uses SSE2 which doesn't provide extended precision, but I would expect it to have been used in the days when x87 was the available floating-point instruction set.
The paradox is this: on the one hand, there is much discussion of limited precision being a big problem. On the other hand, Intel provided a solution with an extra eleven bits of precision and five bits of exponent, that would cost very little performance to use (since the hardware implemented it whether you used it or not), and yet everyone seemed to behave as though this had no value, and to positively celebrate the move to SSE2 where extended precision is no longer available.
So my question is:
Did any compilers ever make full use of extended precision (i.e. 80 bits in memory as well as in registers)? If not, why not?
history compilers floating-point
history compilers floating-point
edited Apr 19 at 9:59
unautre
31
31
asked Apr 19 at 5:19
rwallacerwallace
11.5k558170
11.5k558170
Comments are not for extended discussion; this conversation has been moved to chat.
– Chenmunka♦
yesterday
Can the title be edited to make it clear we're talking about a particular 80-bit implementation? Would 'x87' or 'Intel' be the best word to add?
– another-dave
10 hours ago
add a comment |
Comments are not for extended discussion; this conversation has been moved to chat.
– Chenmunka♦
yesterday
Can the title be edited to make it clear we're talking about a particular 80-bit implementation? Would 'x87' or 'Intel' be the best word to add?
– another-dave
10 hours ago
Comments are not for extended discussion; this conversation has been moved to chat.
– Chenmunka♦
yesterday
Comments are not for extended discussion; this conversation has been moved to chat.
– Chenmunka♦
yesterday
Can the title be edited to make it clear we're talking about a particular 80-bit implementation? Would 'x87' or 'Intel' be the best word to add?
– another-dave
10 hours ago
Can the title be edited to make it clear we're talking about a particular 80-bit implementation? Would 'x87' or 'Intel' be the best word to add?
– another-dave
10 hours ago
add a comment |
5 Answers
5
active
oldest
votes
Yes. For example, the C math library has had full support for long double
, which on x87 was 80 bits wide, since C99. Previous versions of the standard library supported only the double
type. Conforming C and C++ compilers also perform long double
math if you give the operations a long double
argument. (Recall that, in C, 1.0/3.0
divides a double
by another double
, producing a double
-precision result, and to get long double
precision, you would write 1.0L/3.0L
.)
GCC, in particular, even has options such as -ffloat-store
to turn off computing intermediate results to a higher precision than a double
is supposed to have. That is, on some architectures, the fastest way to perform some operations on double
arguments is to use extra precision, but that might produce a non-portable result, so GCC has an option to always round double
intermediate values off.
Testing with godbolt.org, GCC, Clang and ICC in x87 mode all perform 80-bit computations and memory stores with long double
variables—except that they will optimize constants such as 0.5L
to double
when that will save memory at no loss of precision. MSVC 2017, however, only supports 64-bit long double
.
Although you asked specifically about x87, the 68K architecture also had 80-bit FP hardware, and GCC made long double
80 bits wide on that target.
Fortran 95 finally provided a reasonably-portable way to specify a type with at least the precision of an 80-bit float (with kind
and SELECTED_REAL_KIND()
). These might give you double-double or 128-bit math on other implementations. Even before then, some Fortran compilers provided extensions such as REAL*10
. Ada was another language that allowed the programmer to specify a minimum number of DIGITS
of precision. There were other compilers that supported 80-bit math to some degree as well. For example, Turbo Pascal had an extended
type, although its math library supported only real
arguments.
Another possible example is Haskell, which provided both exact Rational
types and arbitrary-precision floating-point through Data.Number.CReal
. So far as I know, no implementation used x87 80-bit hardware, but it might still be an answer to your question.
Comments are not for extended discussion; this conversation has been moved to chat.
– Chenmunka♦
yesterday
add a comment |
TL:DR: no, none of the major C compilers had an option to force promoting double
locals/temporaries to 80-bit even across spill/reload, only keeping them as 80-bit when it was convenient to keep them in registers anway.
Bruce Dawson's Intermediate Floating-Point Precision article is essential reading if you're wondering about whether extra precision for temporaries is helpful or harmful. He has examples that demonstrate both, and links to articles that conclude one way and the other.
Also very importantly, he has lots of specific details about what Visual Studio / MSVC actually does, and what gcc actually does, with x87 and with SSE/SSE2. Fun fact: MSVC before VS2012 used double
for float * float
even when using SSE/SSE2 instructions! (Presumably to match the numerical behaviour of x87 with its precision set to 53-bit significand; which is what MSVC does without SSE/SSE2.)
His whole series of FP articles is excellent; index in this one.
that would cost very little performance to use (since the hardware implemented it whether you used it or not)
This is an overstatement. Working with 80-bit long double
in x87 registers has zero extra cost, but as memory operands they are definitely 2nd-class citizens in both ISA design and performance. Most x87 code involves a significant amount of loading and storing, something like Mandelbrot iterations being a rare exception at the upper end of computational intensity. Some round constants can be stored as float
without precision loss, but runtime variables usually can't make any assumptions.
Compilers that always promoted temporaries / local variables to 80-bit even when they needed to be spilled/reloaded would create slower code (as @Davislor's answer seems to suggest would have been an option for gcc to implement). See below about when compilers actually round C double
temporaries and locals to IEEE binary64: any time they store/reload.
80-bit REAL10 /
long double
can't be a memory operand forfadd
/fsub
/fmul
/fdiv
/ etc. Those only support using 32 or 64-bit float/double memory operands.So to work with an 80-bit value from memory, you need an extra
fld
instruction. (Unless you want it in a register separately anyway, then the separatefld
isn't "extra"). On P5 Pentium, memory operands for instructions likefadd
have no extra cost, so if you already had to spill a value earlier, adding it from memory is efficient forfloat
/double
.And you need an extra x87 stack register to load it into.
fadd st5, qword [mem]
isn't available (only memory source with the top of the register stackst0
as an implicit destination), so memory operands didn't help much to avoidfxch
, but if you were close to filling up all 8st0..7
stack slots then having to load might require you to spill something else.fst
to storest0
to memory without popping the x87 stack is only available form32
/m64
operands (IEEE binary32float
/ IEEE binary64double
).fstp m32/m64/m80
to store-and-pop is used more often, but there are some use-cases where you want to store and keep using a value. Like in a computation where one result is also part of a later expression, or an array calc wherex[i]
depends onx[i-1]
.If you want to store 80-bit
long double
,fstp
is your only option. You might need usefld st0
to duplicate it, thenfstp
to pop that copy off. (You canfld
/fstp
with a register operand instead of memory, as well asfxch
to swap a register to the top of the stack.)
80-bit FP load/store is significantly slower than 32-bit or 64-bit, and not (just) because of larger cache footprint. On original Pentium, it's close to what you might expect from 32/64-bit load/store being a single cache access, vs. 80-bit taking 2 accesses (presumably 64 + 16 bit), but on later CPUs it's even worse.
Some perf numbers from Agner Fog's instruction tables for some 32-bit-only CPUs that were relevant in the era before SSE2 and x86-64. I don't have 486 numbers; Agner Fog only covers Pentium and earlier, and http://instlatx64.atw.hu/ only has CPUID from a 486, not instruction latencies. And its ppro / PIII latency/throughput numbers don't cover fld/fstp. It does show fsqrt
and fdiv
performance being slower for full 80-bit precision, though.
P5 Pentium (in-order pipelined dual issue superscalar):
fld m32/m64
(load float/double into 80-bit x87 ST0): 1 cycle, pairable with fxchg.fld m80
: 3 cycles, not pairable, and (unlikefadd
/fmul
which are pipelined), not overlapable with later FP or integer instructions.fst(p) m32/m64
(round 80-bit ST0 to float/double and store): 2 cycles, not pairable or overlapablefstp m80
: (note only available inpop
version that frees the x87 register): 3 cycles, not pairable
P6 Pentium Pro / Pentium II / Pentium III. (out-of-order 3-wide superscalar, decodes to 1 or more RISC-like micro-ops that can be scheduled independently)
(Agner Fog doesn't have useful latency numbers for FP load/store on this uarch)fld m32/m64
is 1 uop for the load port.fld m80
: 4 uops total: 2 ALU p0, 2 load portfst(p) m32/m64
2 uops (store-address + store-data, not micro-fused because that only existed on P-M and later)fstp m80
: 6 uops total: 2 ALU p0, 2x store-address, 2x store-data. I guess ALU extract into 64-bit and 16-bit chunks, as inputs for 2 stores.
Multi-uop instructions can only be decoded by the "complex" decoder on Intel CPUs (while simple instructions can decode in parallel, in patterns like 1-1-1 up to 4-1-1), so 4-uop fld m80
can lead to the previous cycle only producing 1 uop in the worst case. 6 uops for fstp m80
is more than 4, so decoding it requires the microcode sequencer. These decode bottlenecks could lead to bubbles in the front-end, as well as / instead of possible back-end bottlenecks. (P6-family CPUs, especially later ones with better back-end throughput, can bottleneck on instruction fetch/decode in the front-end if you aren't careful; see Agner Fog's microarch pdf. Keeping the issue/rename stage fed with 3 uops / clock can be hard, or 4 on Core2 and later.)
Agner doesn't have latencies or throughputs for FP loads/stores on original P6 (the "1 cycle" latency in a couple columns appears bogus). But it's probably similar to later CPUs, where m80
has worse throughput than you'd expect from the uop counts / ports.
- Pentium-M: 1 per 3 cycle throughput for
fstp m80
6 uops. vs. 1 uop / 1-per-clock forfst(p) m32/m64
, with micro-fusion of the store-address and store-data uops into a single fused-domain uop that can decode in any slot on the simple decoders. - Core 2 (Merom) / Nehalem:
fld m80
: 1 per 3 cycles (4 uops)fstp m80
1 per 5 cycles (7 uops: 3 ALU + 2x each store-address and store-data). Agner's latency numbers show 1 extra cycle for both load and store. - Pentium 4 (pre-Prescott):
fld m80
3+4 uops, 1 per 6 cycles vs. 1-uop pipelined.fstp m80
: 3+8 uops, 1 per 8 cycles vs. 2+0 uops with 2 to 3c throughput. Prescott is similar - Skylake:
fld m80
: 1 per 2 cycles (4 uops) vs. 1 per 0.5 cycles for m32/m64.fstp m80
: Still 7 uops, 1 per 5 cycles vs. 1 per clock for normal stores.
AMD K7/K8:
fld m80
: 7 m-ops, 1 per 4-cycle throughput (vs. 1 per 0.5c for 1 m-opfld m32/m64
).fstp m80
: 10 m-ops, 1 per 5-cycle throughput. (vs. 1 m-op fully pipelinedfst(p) m32/m64
). The latency penalty on these is much higher than on Intel, e.g. 16 cycle m80 loads vs. 4-cycle m32/m64.AMD Bulldozer:
fld m80
: 8 ops/14c lat/4c tput. (vs. 1 op/8c lat/1c tput for m32/m64). Interesting that even regularfloat
/double
x87 loads have half throughput of SSE2 / AVX loads.fstp m80
: 13 ops/9c lat/20c tput. (vs. 1 op/8c lat/1c tput). Piledriver/Steamroller are similar, that catastrophic store throughput of one per 20 or 19 cycles is real.
(Bulldozer-family's high load/store latencies for regular m32/m64 operands is related to having a "cluster" of 2 weak integer cores sharing a single FPU/SIMD unit. Ryzen abandoned this in favour of SMT in the style of Intel's Hyperthreading.)
There's definitely a chicken/egg effect here; if compilers did make code that regularly used stored/reloaded 80-bit temporaries in memory, CPU designers would spend some more transistors to make it more efficient at least on later CPUs. Maybe doing a single 16-byte unaligned cache access when possible, and grabbing the required 10 bytes from that.
Fun fact: fld m32/m64
can raise / flag an FP exception (#IA
) if the source operand is SNaN, but Intel's manual says this can't happen if the source operand is in double extended-precision floating-point format. So it can just stuff the bits into an x87 register without looking at them, unlike fld m32 / m64 where it has to expand the significand/exponent fields.
So ironically, on recent CPUs where the main use-case for x87 is for 80-bit, 80-bit float support is relatively even worse than on older CPUs. Obviously CPU designers don't put much weight on that and assume it's mostly used by old 32-bit binaries.
x87 and MMX are de-prioritized, though, e.g. Haswell made fxch
a 2-uop instruction, up from 1 in previous uarches. (Still 0 latency using register renaming, though. See Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? for some thoughts on that and fxch
.) And fmul
/ fadd
throughputs are only 1 per clock on Skylake, vs. 2 per clock for SSE/AVX vector or scalar add/mul/fma. On Skylake even some MMX integer SIMD instructions run on fewer execution ports than their XMM equivalents.
(If you're looking at the tables yourself, fbld
and fbstp m80bcd
are insanely slow because they convert from/to BCD, thus requiring conversion from binary to decimal with division by 10. Nevermind those, they're always microcoded).
yet everyone seemed to behave as though this had no value, and to positively celebrate the move to SSE2 where extended precision is no longer available.
No, what people celebrated was that FP became more deterministic. When and where you got 80-bit temporaries depended on compiler optimization decisions. You still can't compile most code on different platforms and get bitwise-identical results, but 80-bit x87 was one major source of difference between x86 and some other platforms.
Some people (e.g. writing unit tests) would rather have the same numbers everywhere than have more accurate results on x86. Often double
is more than enough, and/or the benefit was relatively small. In other cases, not so much, and extra temporary precision might help significantly.
Deterministic FP is a hard problem, but sought after by people for various reasons. e.g. trying to make multi-player games that don't need to send the whole state of the world over the network every simulation step, but instead can have everyone's simulation run in lockstep without drifting out of sync.
- https://stackoverflow.com/questions/328622/how-deterministic-is-floating-point-inaccuracy
- https://stackoverflow.com/questions/27149894/does-any-floating-point-intensive-code-produce-bit-exact-results-in-any-x86-base
x87 (thus C FLT_EVAL_METHOD == 2
) isn't the only thing that was / is problematic. C compilers that can contract x*y + z
into fma(x,y,z)
also avoid that intermediate rounding step.
For algorithms that didn't try to account for rounding at all, increased temporary precision usually only helped. But numerical techniques like Kahan summation that compensate for FP rounding errors can be defeated by extra temporary precision. So yes, there are definitely people that are happy that extra temporary precision went away, so their code works the way they designed it on more compilers.
When do compilers round:
Any time they need to pass a double
to a non-inline function, obviously they store it in memory as a double
. (32-bit calling conventions pass FP args on the stack, not in x87 registers unfortunately. They do return FP values in st0
. I think some more recent 32-bit conventions on Windows use XMM registers for FP pass/return like in 64-bit mode. Other OSes care less about 32-bit code and still just use the inefficient i386 System V ABI which is stack args all the way even for integer.)
So you can use sinl(x)
instead of sin(x)
to call the long double
version of the library function. But all your other variables and internal temporaries get rounded to their declared precision (normally double
or float
) around that function call, because the whole x87 stack is call-clobbered.
When compilers spill/reload variables and optimization-created temporaries, they do so with the precision of the C variable. So unless you actually declared long double a,b,c
, your double a,b,c
all get rounded to double
when you do x = sinl(y)
. That's somewhat predictable.
But even less predictable is when the compiler decides to spill something because it's running out of registers. Or when you compile with/without optimization. gcc -ffloat-store
does this store/reload variables to the declared precision between statements even when optimization is enabled. (Not temporaries within the evaluation of one expression.) So for FP variables, kind of like debug-mode code-gen where vars are treated similar to volatile
.
But of course this is crippling for performance unless your code is bottlenecked on something like cache misses for an array.
Extended precision long double
is still an option
(Obviously long double
will prevent auto-vectorization, so only use it if you need it when writing modern code.)
Nobody was celebrating removing the possibility of extended precision, because that didn't happen (except with MSVC which didn't give access to it even for 32-bit code where SSE wasn't part of the standard calling convention).
Extended precision is rarely used, and not supported by MSVC, but on other compilers targeting x86 and x86-64, long double
is the 80-bit x87 type. Apparently even when compiling for Windows, gcc and clang use 80-bit long double
.
Beware long double
is an ABI difference between MSVC and other x86 compilers. Usually gcc and clang are careful to match the calling convention, type widths, and struct layout rules of the platform. But they chose to make long double
a 10-byte type despite MSVC making it the same as 8-byte double
.
GCC has a -mlong-double-64/80/128
x86 option to set the width of long double
, and the docs warn that it changes the ABI.
ICC has a /Qlong-double
option that makes long double
an 80-bit type even on Windows.
So functions that interact with any kind of long double
are not ABI compatible between MSVC and other compilers (except GCC or ICC with special options); they're expecting a different sized object, so not even a single long double
works, except as a function return value in st0
where it's in a register already.
If you need more precision than IEEE binary64 double
, your options include so-called double-double (using a pair of double
values to get twice the significand width but the same exponent range), or taking advantage of x87 80-bit hardware. If 80-bit is enough, it's a useful option, and gives you extra range as well as significand precision, and only requires 1 instruction per computation).
(On CPUs with AVX, especially with AVX2 + FMA, for some loops double-double might outperform x87, being able to compute 4x double
in parallel. e.g. https://stackoverflow.com/questions/30573443/optimize-for-fast-multiplication-but-slow-addition-fma-and-doubledouble shows that double * double => double_double
(53x53 => 106-bit significand) multiplication can be as simple as high = a * b;
low = fma(a, b, -high);
and Haswell/Skylake can do that for 4 elements at once in 2 instructions (with 2-per-clock throughput for FP mul/FMA). But with double_double
inputs, it's obviously less cheap.)
Further fun facts:
The x87 FPU has precision-control bits that let you set how results in registers are rounded after any/every computation and load:
- to 80-bit long double: 64-bit significand precision. The
finit
default, and normal setting except with MSVC. - to 64-bit double: 53-bit significand precision. 32-bit MSVC sets this.
- to 24-bit float: 24-bit significand precision.
Apparently Visual C++'s CRT startup code (that calls main
) reduces x87 precision from 64-bit significand down to 53-bit (64-bit double). Apparently x86 (32-bit) VS2012 and later still does this, if I'm reading Bruce Dawson's article correctly.
So as well as not having an 80-bit FP type, 32-bit MSVC changes the FPU setting so even if you used hand-written asm, you'd still only have 53-bit significand precision, with only the wider range from having more exponent bits. (fstp m80
would still store in the same format, but the low 11 bits of the significand would always be zero. And I guess loading would have to round to nearest. Supporting this stuff might be why fld
decodes to multiple ALU uops on modern CPUs.)
I don't know if the motivation was to speed up fdiv
and fsqrt
(which it does for inputs that don't have a lot of trailing zeros in the significand), or if it's to avoid extra temporary precision. But it has the huge downside that it makes using extended precision impossible (or useless). It's interesting that GNU/Linux and MSVC made opposite decisions here.
Apparently the D3D9 library init function sets x87 precision to 24-bit significand single-precision float, making everything less precise for a speed gain on fdiv/fsqrt (and maybe fcos
/fsin
and other slow microcoded instructions, too.) But x87 precision settings are per-thread, so it matters which thread you call the init function from! (The x87 control word is part of the architectural state that context switches save/restore.)
Of course you can set it back to 64-bit significand with _controlfp_s
, so you could useful use asm, or call a function using long double
compiled by GCC, clang, or ICC. But beware the ABI differences: you can only pass it inputs as float
, double
, or integer, because MSVC won't ever create objects in memory in the 80-bit x87 format.
1
In most of the situations where 80-bit values should be used, it would be possible to keep them in registers. Spilling an 80-bit value and later reloading it would cost more than doing likewise with a 64-bit value, but if 80-bit register spills would be a meaningful factor in performance, 64-bit register spills would have an adverse effect anyway.
– supercat
Apr 19 at 21:30
1
@supercat: right, but if you want deterministic 80-bit, you have to use it at least for some local vars, so when they have to be spilled across other function calls, like forsin(x) + sin(y) + sin(x*y)
or whatever, the spill/reload will be in 80-bit precision. And this is why compilers don't default to promoting locals and temporaries to 80-bit by default. (An option to do that would be possible, but GCC (still) doesn't have one. See discussion on Davislor's answer about a quote from the g77 2.95 manual about the possibility.
– Peter Cordes
Apr 19 at 21:53
1
If an ABI has a good blend of caller-saved and callee-saved registers, most calls to leaf functions shouldn't end up requiring register spills. Entering and returning from a non-leaf function will likely require a spill/restore for each local variable, but the total execution time for most non-leaf functions will be long enough that even if 80-bit loads and stores cost twice as much as 64-bit ones, that wouldn't meaningfully affect overall performance.
– supercat
Apr 19 at 22:07
2
@supercat: As I mentioned in this answer, the x87 stack is always call-clobbered in all calling conventions. (And must be empty oncall
, and empty or holding return value onret
). Due to its nature, there's no sane way to make any of it call-preserved; it's a stack, and pushing a new value when the slot is already in use give you a NaN-indefinite. You could in theory make a calling convention that stored the status word and figured out how many regs were in use, so it knew how many slow 80-bitfstp
s to use on entry, and how many slow 80-bitfld
s to do before returning, but yuck.
– Peter Cordes
Apr 19 at 22:12
3
the option in GCC is-mlong-double-64/80/128
. There's also a warning under them saying if you override the default value for your target ABI, this changes the size of structures and arrays containing long double variables, as well as modifying the function calling convention for functions taking long double. Hence they are not binary-compatible with code compiled without that switch.
– phuclv
Apr 20 at 4:19
|
show 9 more comments
I worked for Borland back in the days of the 8086/8087. Back then, both Turbo C and Microsoft C defined long double
as an 80-bit type, matching the layout of Intel's 80-bit floating-point type. Some years later, when Microsoft got cross-hardware religion (maybe at the same time as they released Windows NT?) they changed their compiler to make long double
a 64-bit type. To the best of my recollection, Borland continued to use 80 bits.
New contributor
4
Microsoft X86 16 bit tool sets support 80 bit long doubles. This was dropped in their X86 32 bit and X86 64 bit tool sets. Win32s for WIndow 3.1 was released about the same time as NT 3.1. I'm not sure if Windows 3.1 winmem32 was released before or after win32s.
– rcgldr
Apr 19 at 13:46
2
C89 botched "long double" by violating the fundamental principle that all floating-point values passed to non-prototyped functions get converted to a common type. If it had specified that values oflong double
get converted todouble
except when wrapped using a special macro which would pass them in astruct __wrapped_long_double
, that would have avoided the need for something likeprintf("%10.4f", x*y)
; to care about the types of bothx
andy
[since the value isn't wrapped, the value would get passed todouble
regardless of the types ofx
andy
].
– supercat
Apr 19 at 21:37
1
IIRC Delphi 5 (and probably also 3,4,6, and 7) had the "Extended" type which used all 80 bits of the FPU registers. The generic "Real" type could be made an alias of that, of the 64-bit Double, or of a legacy Borland soft float format.
– cyco130
Apr 20 at 11:04
2
@MichaelKarcher: The introduction oflong int
came fairly late in the development of C, and caused considerable problems. Nonetheless, I think a fundamental difference between the relationship oflong int
andint
, vs.long double
anddouble
, is that every value within the range ofint
can be represented just as accurately by that type as by any larger type. Thus, ifscale_factor
will never exceed the range ofint
, there would generally be no reason for it to be declared as a larger type. On the other hand, if one writesdouble one_tenth=0.1;
, ...
– supercat
Apr 20 at 16:23
2
...and then computesdouble x=one_tenth*y;
, the calculation may be less precise than if one had writtendouble x=y/10.0;
or usedlong double scale_factor=0.1lL
. If neitherwholeQuantity1
norwholeQuantity2
would need to accommodate values outside the range ofint
, the expression wholeQuantity1+wholeQuantity2 will likely be of typeint
orunsigned
. But in many cases involving floating-point, there would be some advantage to using longer-precision scale factors.
– supercat
Apr 20 at 16:29
|
show 1 more comment
The Gnu Ada compiler ("Gnat") has supported 80-bit floating point as a fully-fledged built-in type with its Long_Long_Float
type since at least 1998.
Here's a Usenet argument in February of 1999 between Ada compiler vendors and users about whether not supporting 80-bit floats is an Ada LRM violation. This was a huge deal for compiler vendors, as many government contracts can't use your compiler then, and the rest of the Ada userbase at that time viewed the Ada LRM as the next best thing to holy writ.*
To take a simple example, an x86 compiler that does not support 80-bit
IEEE extended arithmetic is clearly violates B.2(10):
10 Floating point types corresponding to each floating
point format fully supported by the hardware.
and is thus non-conformant. It will still be fully validatable, since
this is not the sort of thing the validation can test with automated
tests.
...
P.S. Just to ensure that people do not regard the above as special
pleading for non-conformances in GNAT, please be sure to realize that
GNAT does support 80-bit float on the ia32 (x86).
Since this is a GCC-based compiler, its debatable if this is a revelation over the current top-rated answer, but I didn't see it mentioned.
* - It may look silly, but this user attitude kept Ada source code extremely portable. The only other languages that really can compare are ones that are effectively defined by the behavior of a single reference implementation, or under the control of a single developer.
add a comment |
Did any compilers ever make full use of extended precision (i.e. 80
bits in memory as well as in registers)? If not, why not?
Since any calculations inside the x87 fpu have 80bit precision by default, any compiler that's able to generate x87 fpu code, is already using extended precision.
I also remember using long double
even in 16-bit compilers for real mode.
The very similar situation was in 68k world, with FPUs like 68881 and 68882 supporting 80bit precision by default and any FPU code without special precautions would keep all register values in that precision. There was also long double
datatype.
On the other hand, Intel provided a solution with an extra eleven bits
of precision and five bits of exponent, that would cost very little
performance to use (since the hardware implemented it whether you used
it or not), and yet everyone seemed to behave as though this had no
value
The usage of long double
would prevent contemporary compilers from ever making calculations using SSE/whatever registers and instructions. And SSE is actually a very fast engine, able to fetch data in large chunks and make several computations in parallel, every clock. The x87 fpu now is just a legacy, not being very fast. So the deliberate usage of 80bit precision now would be certainly a huge performance hit.
Right, I was talking about the historical context in which x87 was the only FPU on x86, so no performance hit from using it. Good point about 68881 being a very similar architecture.
– rwallace
Apr 19 at 8:12
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "648"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fretrocomputing.stackexchange.com%2fquestions%2f9751%2fdid-any-compiler-fully-use-80-bit-floating-point%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
Yes. For example, the C math library has had full support for long double
, which on x87 was 80 bits wide, since C99. Previous versions of the standard library supported only the double
type. Conforming C and C++ compilers also perform long double
math if you give the operations a long double
argument. (Recall that, in C, 1.0/3.0
divides a double
by another double
, producing a double
-precision result, and to get long double
precision, you would write 1.0L/3.0L
.)
GCC, in particular, even has options such as -ffloat-store
to turn off computing intermediate results to a higher precision than a double
is supposed to have. That is, on some architectures, the fastest way to perform some operations on double
arguments is to use extra precision, but that might produce a non-portable result, so GCC has an option to always round double
intermediate values off.
Testing with godbolt.org, GCC, Clang and ICC in x87 mode all perform 80-bit computations and memory stores with long double
variables—except that they will optimize constants such as 0.5L
to double
when that will save memory at no loss of precision. MSVC 2017, however, only supports 64-bit long double
.
Although you asked specifically about x87, the 68K architecture also had 80-bit FP hardware, and GCC made long double
80 bits wide on that target.
Fortran 95 finally provided a reasonably-portable way to specify a type with at least the precision of an 80-bit float (with kind
and SELECTED_REAL_KIND()
). These might give you double-double or 128-bit math on other implementations. Even before then, some Fortran compilers provided extensions such as REAL*10
. Ada was another language that allowed the programmer to specify a minimum number of DIGITS
of precision. There were other compilers that supported 80-bit math to some degree as well. For example, Turbo Pascal had an extended
type, although its math library supported only real
arguments.
Another possible example is Haskell, which provided both exact Rational
types and arbitrary-precision floating-point through Data.Number.CReal
. So far as I know, no implementation used x87 80-bit hardware, but it might still be an answer to your question.
Comments are not for extended discussion; this conversation has been moved to chat.
– Chenmunka♦
yesterday
add a comment |
Yes. For example, the C math library has had full support for long double
, which on x87 was 80 bits wide, since C99. Previous versions of the standard library supported only the double
type. Conforming C and C++ compilers also perform long double
math if you give the operations a long double
argument. (Recall that, in C, 1.0/3.0
divides a double
by another double
, producing a double
-precision result, and to get long double
precision, you would write 1.0L/3.0L
.)
GCC, in particular, even has options such as -ffloat-store
to turn off computing intermediate results to a higher precision than a double
is supposed to have. That is, on some architectures, the fastest way to perform some operations on double
arguments is to use extra precision, but that might produce a non-portable result, so GCC has an option to always round double
intermediate values off.
Testing with godbolt.org, GCC, Clang and ICC in x87 mode all perform 80-bit computations and memory stores with long double
variables—except that they will optimize constants such as 0.5L
to double
when that will save memory at no loss of precision. MSVC 2017, however, only supports 64-bit long double
.
Although you asked specifically about x87, the 68K architecture also had 80-bit FP hardware, and GCC made long double
80 bits wide on that target.
Fortran 95 finally provided a reasonably-portable way to specify a type with at least the precision of an 80-bit float (with kind
and SELECTED_REAL_KIND()
). These might give you double-double or 128-bit math on other implementations. Even before then, some Fortran compilers provided extensions such as REAL*10
. Ada was another language that allowed the programmer to specify a minimum number of DIGITS
of precision. There were other compilers that supported 80-bit math to some degree as well. For example, Turbo Pascal had an extended
type, although its math library supported only real
arguments.
Another possible example is Haskell, which provided both exact Rational
types and arbitrary-precision floating-point through Data.Number.CReal
. So far as I know, no implementation used x87 80-bit hardware, but it might still be an answer to your question.
Comments are not for extended discussion; this conversation has been moved to chat.
– Chenmunka♦
yesterday
add a comment |
Yes. For example, the C math library has had full support for long double
, which on x87 was 80 bits wide, since C99. Previous versions of the standard library supported only the double
type. Conforming C and C++ compilers also perform long double
math if you give the operations a long double
argument. (Recall that, in C, 1.0/3.0
divides a double
by another double
, producing a double
-precision result, and to get long double
precision, you would write 1.0L/3.0L
.)
GCC, in particular, even has options such as -ffloat-store
to turn off computing intermediate results to a higher precision than a double
is supposed to have. That is, on some architectures, the fastest way to perform some operations on double
arguments is to use extra precision, but that might produce a non-portable result, so GCC has an option to always round double
intermediate values off.
Testing with godbolt.org, GCC, Clang and ICC in x87 mode all perform 80-bit computations and memory stores with long double
variables—except that they will optimize constants such as 0.5L
to double
when that will save memory at no loss of precision. MSVC 2017, however, only supports 64-bit long double
.
Although you asked specifically about x87, the 68K architecture also had 80-bit FP hardware, and GCC made long double
80 bits wide on that target.
Fortran 95 finally provided a reasonably-portable way to specify a type with at least the precision of an 80-bit float (with kind
and SELECTED_REAL_KIND()
). These might give you double-double or 128-bit math on other implementations. Even before then, some Fortran compilers provided extensions such as REAL*10
. Ada was another language that allowed the programmer to specify a minimum number of DIGITS
of precision. There were other compilers that supported 80-bit math to some degree as well. For example, Turbo Pascal had an extended
type, although its math library supported only real
arguments.
Another possible example is Haskell, which provided both exact Rational
types and arbitrary-precision floating-point through Data.Number.CReal
. So far as I know, no implementation used x87 80-bit hardware, but it might still be an answer to your question.
Yes. For example, the C math library has had full support for long double
, which on x87 was 80 bits wide, since C99. Previous versions of the standard library supported only the double
type. Conforming C and C++ compilers also perform long double
math if you give the operations a long double
argument. (Recall that, in C, 1.0/3.0
divides a double
by another double
, producing a double
-precision result, and to get long double
precision, you would write 1.0L/3.0L
.)
GCC, in particular, even has options such as -ffloat-store
to turn off computing intermediate results to a higher precision than a double
is supposed to have. That is, on some architectures, the fastest way to perform some operations on double
arguments is to use extra precision, but that might produce a non-portable result, so GCC has an option to always round double
intermediate values off.
Testing with godbolt.org, GCC, Clang and ICC in x87 mode all perform 80-bit computations and memory stores with long double
variables—except that they will optimize constants such as 0.5L
to double
when that will save memory at no loss of precision. MSVC 2017, however, only supports 64-bit long double
.
Although you asked specifically about x87, the 68K architecture also had 80-bit FP hardware, and GCC made long double
80 bits wide on that target.
Fortran 95 finally provided a reasonably-portable way to specify a type with at least the precision of an 80-bit float (with kind
and SELECTED_REAL_KIND()
). These might give you double-double or 128-bit math on other implementations. Even before then, some Fortran compilers provided extensions such as REAL*10
. Ada was another language that allowed the programmer to specify a minimum number of DIGITS
of precision. There were other compilers that supported 80-bit math to some degree as well. For example, Turbo Pascal had an extended
type, although its math library supported only real
arguments.
Another possible example is Haskell, which provided both exact Rational
types and arbitrary-precision floating-point through Data.Number.CReal
. So far as I know, no implementation used x87 80-bit hardware, but it might still be an answer to your question.
edited Apr 19 at 20:34
answered Apr 19 at 6:57
DavislorDavislor
1,495411
1,495411
Comments are not for extended discussion; this conversation has been moved to chat.
– Chenmunka♦
yesterday
add a comment |
Comments are not for extended discussion; this conversation has been moved to chat.
– Chenmunka♦
yesterday
Comments are not for extended discussion; this conversation has been moved to chat.
– Chenmunka♦
yesterday
Comments are not for extended discussion; this conversation has been moved to chat.
– Chenmunka♦
yesterday
add a comment |
TL:DR: no, none of the major C compilers had an option to force promoting double
locals/temporaries to 80-bit even across spill/reload, only keeping them as 80-bit when it was convenient to keep them in registers anway.
Bruce Dawson's Intermediate Floating-Point Precision article is essential reading if you're wondering about whether extra precision for temporaries is helpful or harmful. He has examples that demonstrate both, and links to articles that conclude one way and the other.
Also very importantly, he has lots of specific details about what Visual Studio / MSVC actually does, and what gcc actually does, with x87 and with SSE/SSE2. Fun fact: MSVC before VS2012 used double
for float * float
even when using SSE/SSE2 instructions! (Presumably to match the numerical behaviour of x87 with its precision set to 53-bit significand; which is what MSVC does without SSE/SSE2.)
His whole series of FP articles is excellent; index in this one.
that would cost very little performance to use (since the hardware implemented it whether you used it or not)
This is an overstatement. Working with 80-bit long double
in x87 registers has zero extra cost, but as memory operands they are definitely 2nd-class citizens in both ISA design and performance. Most x87 code involves a significant amount of loading and storing, something like Mandelbrot iterations being a rare exception at the upper end of computational intensity. Some round constants can be stored as float
without precision loss, but runtime variables usually can't make any assumptions.
Compilers that always promoted temporaries / local variables to 80-bit even when they needed to be spilled/reloaded would create slower code (as @Davislor's answer seems to suggest would have been an option for gcc to implement). See below about when compilers actually round C double
temporaries and locals to IEEE binary64: any time they store/reload.
80-bit REAL10 /
long double
can't be a memory operand forfadd
/fsub
/fmul
/fdiv
/ etc. Those only support using 32 or 64-bit float/double memory operands.So to work with an 80-bit value from memory, you need an extra
fld
instruction. (Unless you want it in a register separately anyway, then the separatefld
isn't "extra"). On P5 Pentium, memory operands for instructions likefadd
have no extra cost, so if you already had to spill a value earlier, adding it from memory is efficient forfloat
/double
.And you need an extra x87 stack register to load it into.
fadd st5, qword [mem]
isn't available (only memory source with the top of the register stackst0
as an implicit destination), so memory operands didn't help much to avoidfxch
, but if you were close to filling up all 8st0..7
stack slots then having to load might require you to spill something else.fst
to storest0
to memory without popping the x87 stack is only available form32
/m64
operands (IEEE binary32float
/ IEEE binary64double
).fstp m32/m64/m80
to store-and-pop is used more often, but there are some use-cases where you want to store and keep using a value. Like in a computation where one result is also part of a later expression, or an array calc wherex[i]
depends onx[i-1]
.If you want to store 80-bit
long double
,fstp
is your only option. You might need usefld st0
to duplicate it, thenfstp
to pop that copy off. (You canfld
/fstp
with a register operand instead of memory, as well asfxch
to swap a register to the top of the stack.)
80-bit FP load/store is significantly slower than 32-bit or 64-bit, and not (just) because of larger cache footprint. On original Pentium, it's close to what you might expect from 32/64-bit load/store being a single cache access, vs. 80-bit taking 2 accesses (presumably 64 + 16 bit), but on later CPUs it's even worse.
Some perf numbers from Agner Fog's instruction tables for some 32-bit-only CPUs that were relevant in the era before SSE2 and x86-64. I don't have 486 numbers; Agner Fog only covers Pentium and earlier, and http://instlatx64.atw.hu/ only has CPUID from a 486, not instruction latencies. And its ppro / PIII latency/throughput numbers don't cover fld/fstp. It does show fsqrt
and fdiv
performance being slower for full 80-bit precision, though.
P5 Pentium (in-order pipelined dual issue superscalar):
fld m32/m64
(load float/double into 80-bit x87 ST0): 1 cycle, pairable with fxchg.fld m80
: 3 cycles, not pairable, and (unlikefadd
/fmul
which are pipelined), not overlapable with later FP or integer instructions.fst(p) m32/m64
(round 80-bit ST0 to float/double and store): 2 cycles, not pairable or overlapablefstp m80
: (note only available inpop
version that frees the x87 register): 3 cycles, not pairable
P6 Pentium Pro / Pentium II / Pentium III. (out-of-order 3-wide superscalar, decodes to 1 or more RISC-like micro-ops that can be scheduled independently)
(Agner Fog doesn't have useful latency numbers for FP load/store on this uarch)fld m32/m64
is 1 uop for the load port.fld m80
: 4 uops total: 2 ALU p0, 2 load portfst(p) m32/m64
2 uops (store-address + store-data, not micro-fused because that only existed on P-M and later)fstp m80
: 6 uops total: 2 ALU p0, 2x store-address, 2x store-data. I guess ALU extract into 64-bit and 16-bit chunks, as inputs for 2 stores.
Multi-uop instructions can only be decoded by the "complex" decoder on Intel CPUs (while simple instructions can decode in parallel, in patterns like 1-1-1 up to 4-1-1), so 4-uop fld m80
can lead to the previous cycle only producing 1 uop in the worst case. 6 uops for fstp m80
is more than 4, so decoding it requires the microcode sequencer. These decode bottlenecks could lead to bubbles in the front-end, as well as / instead of possible back-end bottlenecks. (P6-family CPUs, especially later ones with better back-end throughput, can bottleneck on instruction fetch/decode in the front-end if you aren't careful; see Agner Fog's microarch pdf. Keeping the issue/rename stage fed with 3 uops / clock can be hard, or 4 on Core2 and later.)
Agner doesn't have latencies or throughputs for FP loads/stores on original P6 (the "1 cycle" latency in a couple columns appears bogus). But it's probably similar to later CPUs, where m80
has worse throughput than you'd expect from the uop counts / ports.
- Pentium-M: 1 per 3 cycle throughput for
fstp m80
6 uops. vs. 1 uop / 1-per-clock forfst(p) m32/m64
, with micro-fusion of the store-address and store-data uops into a single fused-domain uop that can decode in any slot on the simple decoders. - Core 2 (Merom) / Nehalem:
fld m80
: 1 per 3 cycles (4 uops)fstp m80
1 per 5 cycles (7 uops: 3 ALU + 2x each store-address and store-data). Agner's latency numbers show 1 extra cycle for both load and store. - Pentium 4 (pre-Prescott):
fld m80
3+4 uops, 1 per 6 cycles vs. 1-uop pipelined.fstp m80
: 3+8 uops, 1 per 8 cycles vs. 2+0 uops with 2 to 3c throughput. Prescott is similar - Skylake:
fld m80
: 1 per 2 cycles (4 uops) vs. 1 per 0.5 cycles for m32/m64.fstp m80
: Still 7 uops, 1 per 5 cycles vs. 1 per clock for normal stores.
AMD K7/K8:
fld m80
: 7 m-ops, 1 per 4-cycle throughput (vs. 1 per 0.5c for 1 m-opfld m32/m64
).fstp m80
: 10 m-ops, 1 per 5-cycle throughput. (vs. 1 m-op fully pipelinedfst(p) m32/m64
). The latency penalty on these is much higher than on Intel, e.g. 16 cycle m80 loads vs. 4-cycle m32/m64.AMD Bulldozer:
fld m80
: 8 ops/14c lat/4c tput. (vs. 1 op/8c lat/1c tput for m32/m64). Interesting that even regularfloat
/double
x87 loads have half throughput of SSE2 / AVX loads.fstp m80
: 13 ops/9c lat/20c tput. (vs. 1 op/8c lat/1c tput). Piledriver/Steamroller are similar, that catastrophic store throughput of one per 20 or 19 cycles is real.
(Bulldozer-family's high load/store latencies for regular m32/m64 operands is related to having a "cluster" of 2 weak integer cores sharing a single FPU/SIMD unit. Ryzen abandoned this in favour of SMT in the style of Intel's Hyperthreading.)
There's definitely a chicken/egg effect here; if compilers did make code that regularly used stored/reloaded 80-bit temporaries in memory, CPU designers would spend some more transistors to make it more efficient at least on later CPUs. Maybe doing a single 16-byte unaligned cache access when possible, and grabbing the required 10 bytes from that.
Fun fact: fld m32/m64
can raise / flag an FP exception (#IA
) if the source operand is SNaN, but Intel's manual says this can't happen if the source operand is in double extended-precision floating-point format. So it can just stuff the bits into an x87 register without looking at them, unlike fld m32 / m64 where it has to expand the significand/exponent fields.
So ironically, on recent CPUs where the main use-case for x87 is for 80-bit, 80-bit float support is relatively even worse than on older CPUs. Obviously CPU designers don't put much weight on that and assume it's mostly used by old 32-bit binaries.
x87 and MMX are de-prioritized, though, e.g. Haswell made fxch
a 2-uop instruction, up from 1 in previous uarches. (Still 0 latency using register renaming, though. See Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? for some thoughts on that and fxch
.) And fmul
/ fadd
throughputs are only 1 per clock on Skylake, vs. 2 per clock for SSE/AVX vector or scalar add/mul/fma. On Skylake even some MMX integer SIMD instructions run on fewer execution ports than their XMM equivalents.
(If you're looking at the tables yourself, fbld
and fbstp m80bcd
are insanely slow because they convert from/to BCD, thus requiring conversion from binary to decimal with division by 10. Nevermind those, they're always microcoded).
yet everyone seemed to behave as though this had no value, and to positively celebrate the move to SSE2 where extended precision is no longer available.
No, what people celebrated was that FP became more deterministic. When and where you got 80-bit temporaries depended on compiler optimization decisions. You still can't compile most code on different platforms and get bitwise-identical results, but 80-bit x87 was one major source of difference between x86 and some other platforms.
Some people (e.g. writing unit tests) would rather have the same numbers everywhere than have more accurate results on x86. Often double
is more than enough, and/or the benefit was relatively small. In other cases, not so much, and extra temporary precision might help significantly.
Deterministic FP is a hard problem, but sought after by people for various reasons. e.g. trying to make multi-player games that don't need to send the whole state of the world over the network every simulation step, but instead can have everyone's simulation run in lockstep without drifting out of sync.
- https://stackoverflow.com/questions/328622/how-deterministic-is-floating-point-inaccuracy
- https://stackoverflow.com/questions/27149894/does-any-floating-point-intensive-code-produce-bit-exact-results-in-any-x86-base
x87 (thus C FLT_EVAL_METHOD == 2
) isn't the only thing that was / is problematic. C compilers that can contract x*y + z
into fma(x,y,z)
also avoid that intermediate rounding step.
For algorithms that didn't try to account for rounding at all, increased temporary precision usually only helped. But numerical techniques like Kahan summation that compensate for FP rounding errors can be defeated by extra temporary precision. So yes, there are definitely people that are happy that extra temporary precision went away, so their code works the way they designed it on more compilers.
When do compilers round:
Any time they need to pass a double
to a non-inline function, obviously they store it in memory as a double
. (32-bit calling conventions pass FP args on the stack, not in x87 registers unfortunately. They do return FP values in st0
. I think some more recent 32-bit conventions on Windows use XMM registers for FP pass/return like in 64-bit mode. Other OSes care less about 32-bit code and still just use the inefficient i386 System V ABI which is stack args all the way even for integer.)
So you can use sinl(x)
instead of sin(x)
to call the long double
version of the library function. But all your other variables and internal temporaries get rounded to their declared precision (normally double
or float
) around that function call, because the whole x87 stack is call-clobbered.
When compilers spill/reload variables and optimization-created temporaries, they do so with the precision of the C variable. So unless you actually declared long double a,b,c
, your double a,b,c
all get rounded to double
when you do x = sinl(y)
. That's somewhat predictable.
But even less predictable is when the compiler decides to spill something because it's running out of registers. Or when you compile with/without optimization. gcc -ffloat-store
does this store/reload variables to the declared precision between statements even when optimization is enabled. (Not temporaries within the evaluation of one expression.) So for FP variables, kind of like debug-mode code-gen where vars are treated similar to volatile
.
But of course this is crippling for performance unless your code is bottlenecked on something like cache misses for an array.
Extended precision long double
is still an option
(Obviously long double
will prevent auto-vectorization, so only use it if you need it when writing modern code.)
Nobody was celebrating removing the possibility of extended precision, because that didn't happen (except with MSVC which didn't give access to it even for 32-bit code where SSE wasn't part of the standard calling convention).
Extended precision is rarely used, and not supported by MSVC, but on other compilers targeting x86 and x86-64, long double
is the 80-bit x87 type. Apparently even when compiling for Windows, gcc and clang use 80-bit long double
.
Beware long double
is an ABI difference between MSVC and other x86 compilers. Usually gcc and clang are careful to match the calling convention, type widths, and struct layout rules of the platform. But they chose to make long double
a 10-byte type despite MSVC making it the same as 8-byte double
.
GCC has a -mlong-double-64/80/128
x86 option to set the width of long double
, and the docs warn that it changes the ABI.
ICC has a /Qlong-double
option that makes long double
an 80-bit type even on Windows.
So functions that interact with any kind of long double
are not ABI compatible between MSVC and other compilers (except GCC or ICC with special options); they're expecting a different sized object, so not even a single long double
works, except as a function return value in st0
where it's in a register already.
If you need more precision than IEEE binary64 double
, your options include so-called double-double (using a pair of double
values to get twice the significand width but the same exponent range), or taking advantage of x87 80-bit hardware. If 80-bit is enough, it's a useful option, and gives you extra range as well as significand precision, and only requires 1 instruction per computation).
(On CPUs with AVX, especially with AVX2 + FMA, for some loops double-double might outperform x87, being able to compute 4x double
in parallel. e.g. https://stackoverflow.com/questions/30573443/optimize-for-fast-multiplication-but-slow-addition-fma-and-doubledouble shows that double * double => double_double
(53x53 => 106-bit significand) multiplication can be as simple as high = a * b;
low = fma(a, b, -high);
and Haswell/Skylake can do that for 4 elements at once in 2 instructions (with 2-per-clock throughput for FP mul/FMA). But with double_double
inputs, it's obviously less cheap.)
Further fun facts:
The x87 FPU has precision-control bits that let you set how results in registers are rounded after any/every computation and load:
- to 80-bit long double: 64-bit significand precision. The
finit
default, and normal setting except with MSVC. - to 64-bit double: 53-bit significand precision. 32-bit MSVC sets this.
- to 24-bit float: 24-bit significand precision.
Apparently Visual C++'s CRT startup code (that calls main
) reduces x87 precision from 64-bit significand down to 53-bit (64-bit double). Apparently x86 (32-bit) VS2012 and later still does this, if I'm reading Bruce Dawson's article correctly.
So as well as not having an 80-bit FP type, 32-bit MSVC changes the FPU setting so even if you used hand-written asm, you'd still only have 53-bit significand precision, with only the wider range from having more exponent bits. (fstp m80
would still store in the same format, but the low 11 bits of the significand would always be zero. And I guess loading would have to round to nearest. Supporting this stuff might be why fld
decodes to multiple ALU uops on modern CPUs.)
I don't know if the motivation was to speed up fdiv
and fsqrt
(which it does for inputs that don't have a lot of trailing zeros in the significand), or if it's to avoid extra temporary precision. But it has the huge downside that it makes using extended precision impossible (or useless). It's interesting that GNU/Linux and MSVC made opposite decisions here.
Apparently the D3D9 library init function sets x87 precision to 24-bit significand single-precision float, making everything less precise for a speed gain on fdiv/fsqrt (and maybe fcos
/fsin
and other slow microcoded instructions, too.) But x87 precision settings are per-thread, so it matters which thread you call the init function from! (The x87 control word is part of the architectural state that context switches save/restore.)
Of course you can set it back to 64-bit significand with _controlfp_s
, so you could useful use asm, or call a function using long double
compiled by GCC, clang, or ICC. But beware the ABI differences: you can only pass it inputs as float
, double
, or integer, because MSVC won't ever create objects in memory in the 80-bit x87 format.
1
In most of the situations where 80-bit values should be used, it would be possible to keep them in registers. Spilling an 80-bit value and later reloading it would cost more than doing likewise with a 64-bit value, but if 80-bit register spills would be a meaningful factor in performance, 64-bit register spills would have an adverse effect anyway.
– supercat
Apr 19 at 21:30
1
@supercat: right, but if you want deterministic 80-bit, you have to use it at least for some local vars, so when they have to be spilled across other function calls, like forsin(x) + sin(y) + sin(x*y)
or whatever, the spill/reload will be in 80-bit precision. And this is why compilers don't default to promoting locals and temporaries to 80-bit by default. (An option to do that would be possible, but GCC (still) doesn't have one. See discussion on Davislor's answer about a quote from the g77 2.95 manual about the possibility.
– Peter Cordes
Apr 19 at 21:53
1
If an ABI has a good blend of caller-saved and callee-saved registers, most calls to leaf functions shouldn't end up requiring register spills. Entering and returning from a non-leaf function will likely require a spill/restore for each local variable, but the total execution time for most non-leaf functions will be long enough that even if 80-bit loads and stores cost twice as much as 64-bit ones, that wouldn't meaningfully affect overall performance.
– supercat
Apr 19 at 22:07
2
@supercat: As I mentioned in this answer, the x87 stack is always call-clobbered in all calling conventions. (And must be empty oncall
, and empty or holding return value onret
). Due to its nature, there's no sane way to make any of it call-preserved; it's a stack, and pushing a new value when the slot is already in use give you a NaN-indefinite. You could in theory make a calling convention that stored the status word and figured out how many regs were in use, so it knew how many slow 80-bitfstp
s to use on entry, and how many slow 80-bitfld
s to do before returning, but yuck.
– Peter Cordes
Apr 19 at 22:12
3
the option in GCC is-mlong-double-64/80/128
. There's also a warning under them saying if you override the default value for your target ABI, this changes the size of structures and arrays containing long double variables, as well as modifying the function calling convention for functions taking long double. Hence they are not binary-compatible with code compiled without that switch.
– phuclv
Apr 20 at 4:19
|
show 9 more comments
TL:DR: no, none of the major C compilers had an option to force promoting double
locals/temporaries to 80-bit even across spill/reload, only keeping them as 80-bit when it was convenient to keep them in registers anway.
Bruce Dawson's Intermediate Floating-Point Precision article is essential reading if you're wondering about whether extra precision for temporaries is helpful or harmful. He has examples that demonstrate both, and links to articles that conclude one way and the other.
Also very importantly, he has lots of specific details about what Visual Studio / MSVC actually does, and what gcc actually does, with x87 and with SSE/SSE2. Fun fact: MSVC before VS2012 used double
for float * float
even when using SSE/SSE2 instructions! (Presumably to match the numerical behaviour of x87 with its precision set to 53-bit significand; which is what MSVC does without SSE/SSE2.)
His whole series of FP articles is excellent; index in this one.
that would cost very little performance to use (since the hardware implemented it whether you used it or not)
This is an overstatement. Working with 80-bit long double
in x87 registers has zero extra cost, but as memory operands they are definitely 2nd-class citizens in both ISA design and performance. Most x87 code involves a significant amount of loading and storing, something like Mandelbrot iterations being a rare exception at the upper end of computational intensity. Some round constants can be stored as float
without precision loss, but runtime variables usually can't make any assumptions.
Compilers that always promoted temporaries / local variables to 80-bit even when they needed to be spilled/reloaded would create slower code (as @Davislor's answer seems to suggest would have been an option for gcc to implement). See below about when compilers actually round C double
temporaries and locals to IEEE binary64: any time they store/reload.
80-bit REAL10 /
long double
can't be a memory operand forfadd
/fsub
/fmul
/fdiv
/ etc. Those only support using 32 or 64-bit float/double memory operands.So to work with an 80-bit value from memory, you need an extra
fld
instruction. (Unless you want it in a register separately anyway, then the separatefld
isn't "extra"). On P5 Pentium, memory operands for instructions likefadd
have no extra cost, so if you already had to spill a value earlier, adding it from memory is efficient forfloat
/double
.And you need an extra x87 stack register to load it into.
fadd st5, qword [mem]
isn't available (only memory source with the top of the register stackst0
as an implicit destination), so memory operands didn't help much to avoidfxch
, but if you were close to filling up all 8st0..7
stack slots then having to load might require you to spill something else.fst
to storest0
to memory without popping the x87 stack is only available form32
/m64
operands (IEEE binary32float
/ IEEE binary64double
).fstp m32/m64/m80
to store-and-pop is used more often, but there are some use-cases where you want to store and keep using a value. Like in a computation where one result is also part of a later expression, or an array calc wherex[i]
depends onx[i-1]
.If you want to store 80-bit
long double
,fstp
is your only option. You might need usefld st0
to duplicate it, thenfstp
to pop that copy off. (You canfld
/fstp
with a register operand instead of memory, as well asfxch
to swap a register to the top of the stack.)
80-bit FP load/store is significantly slower than 32-bit or 64-bit, and not (just) because of larger cache footprint. On original Pentium, it's close to what you might expect from 32/64-bit load/store being a single cache access, vs. 80-bit taking 2 accesses (presumably 64 + 16 bit), but on later CPUs it's even worse.
Some perf numbers from Agner Fog's instruction tables for some 32-bit-only CPUs that were relevant in the era before SSE2 and x86-64. I don't have 486 numbers; Agner Fog only covers Pentium and earlier, and http://instlatx64.atw.hu/ only has CPUID from a 486, not instruction latencies. And its ppro / PIII latency/throughput numbers don't cover fld/fstp. It does show fsqrt
and fdiv
performance being slower for full 80-bit precision, though.
P5 Pentium (in-order pipelined dual issue superscalar):
fld m32/m64
(load float/double into 80-bit x87 ST0): 1 cycle, pairable with fxchg.fld m80
: 3 cycles, not pairable, and (unlikefadd
/fmul
which are pipelined), not overlapable with later FP or integer instructions.fst(p) m32/m64
(round 80-bit ST0 to float/double and store): 2 cycles, not pairable or overlapablefstp m80
: (note only available inpop
version that frees the x87 register): 3 cycles, not pairable
P6 Pentium Pro / Pentium II / Pentium III. (out-of-order 3-wide superscalar, decodes to 1 or more RISC-like micro-ops that can be scheduled independently)
(Agner Fog doesn't have useful latency numbers for FP load/store on this uarch)fld m32/m64
is 1 uop for the load port.fld m80
: 4 uops total: 2 ALU p0, 2 load portfst(p) m32/m64
2 uops (store-address + store-data, not micro-fused because that only existed on P-M and later)fstp m80
: 6 uops total: 2 ALU p0, 2x store-address, 2x store-data. I guess ALU extract into 64-bit and 16-bit chunks, as inputs for 2 stores.
Multi-uop instructions can only be decoded by the "complex" decoder on Intel CPUs (while simple instructions can decode in parallel, in patterns like 1-1-1 up to 4-1-1), so 4-uop fld m80
can lead to the previous cycle only producing 1 uop in the worst case. 6 uops for fstp m80
is more than 4, so decoding it requires the microcode sequencer. These decode bottlenecks could lead to bubbles in the front-end, as well as / instead of possible back-end bottlenecks. (P6-family CPUs, especially later ones with better back-end throughput, can bottleneck on instruction fetch/decode in the front-end if you aren't careful; see Agner Fog's microarch pdf. Keeping the issue/rename stage fed with 3 uops / clock can be hard, or 4 on Core2 and later.)
Agner doesn't have latencies or throughputs for FP loads/stores on original P6 (the "1 cycle" latency in a couple columns appears bogus). But it's probably similar to later CPUs, where m80
has worse throughput than you'd expect from the uop counts / ports.
- Pentium-M: 1 per 3 cycle throughput for
fstp m80
6 uops. vs. 1 uop / 1-per-clock forfst(p) m32/m64
, with micro-fusion of the store-address and store-data uops into a single fused-domain uop that can decode in any slot on the simple decoders. - Core 2 (Merom) / Nehalem:
fld m80
: 1 per 3 cycles (4 uops)fstp m80
1 per 5 cycles (7 uops: 3 ALU + 2x each store-address and store-data). Agner's latency numbers show 1 extra cycle for both load and store. - Pentium 4 (pre-Prescott):
fld m80
3+4 uops, 1 per 6 cycles vs. 1-uop pipelined.fstp m80
: 3+8 uops, 1 per 8 cycles vs. 2+0 uops with 2 to 3c throughput. Prescott is similar - Skylake:
fld m80
: 1 per 2 cycles (4 uops) vs. 1 per 0.5 cycles for m32/m64.fstp m80
: Still 7 uops, 1 per 5 cycles vs. 1 per clock for normal stores.
AMD K7/K8:
fld m80
: 7 m-ops, 1 per 4-cycle throughput (vs. 1 per 0.5c for 1 m-opfld m32/m64
).fstp m80
: 10 m-ops, 1 per 5-cycle throughput. (vs. 1 m-op fully pipelinedfst(p) m32/m64
). The latency penalty on these is much higher than on Intel, e.g. 16 cycle m80 loads vs. 4-cycle m32/m64.AMD Bulldozer:
fld m80
: 8 ops/14c lat/4c tput. (vs. 1 op/8c lat/1c tput for m32/m64). Interesting that even regularfloat
/double
x87 loads have half throughput of SSE2 / AVX loads.fstp m80
: 13 ops/9c lat/20c tput. (vs. 1 op/8c lat/1c tput). Piledriver/Steamroller are similar, that catastrophic store throughput of one per 20 or 19 cycles is real.
(Bulldozer-family's high load/store latencies for regular m32/m64 operands is related to having a "cluster" of 2 weak integer cores sharing a single FPU/SIMD unit. Ryzen abandoned this in favour of SMT in the style of Intel's Hyperthreading.)
There's definitely a chicken/egg effect here; if compilers did make code that regularly used stored/reloaded 80-bit temporaries in memory, CPU designers would spend some more transistors to make it more efficient at least on later CPUs. Maybe doing a single 16-byte unaligned cache access when possible, and grabbing the required 10 bytes from that.
Fun fact: fld m32/m64
can raise / flag an FP exception (#IA
) if the source operand is SNaN, but Intel's manual says this can't happen if the source operand is in double extended-precision floating-point format. So it can just stuff the bits into an x87 register without looking at them, unlike fld m32 / m64 where it has to expand the significand/exponent fields.
So ironically, on recent CPUs where the main use-case for x87 is for 80-bit, 80-bit float support is relatively even worse than on older CPUs. Obviously CPU designers don't put much weight on that and assume it's mostly used by old 32-bit binaries.
x87 and MMX are de-prioritized, though, e.g. Haswell made fxch
a 2-uop instruction, up from 1 in previous uarches. (Still 0 latency using register renaming, though. See Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? for some thoughts on that and fxch
.) And fmul
/ fadd
throughputs are only 1 per clock on Skylake, vs. 2 per clock for SSE/AVX vector or scalar add/mul/fma. On Skylake even some MMX integer SIMD instructions run on fewer execution ports than their XMM equivalents.
(If you're looking at the tables yourself, fbld
and fbstp m80bcd
are insanely slow because they convert from/to BCD, thus requiring conversion from binary to decimal with division by 10. Nevermind those, they're always microcoded).
yet everyone seemed to behave as though this had no value, and to positively celebrate the move to SSE2 where extended precision is no longer available.
No, what people celebrated was that FP became more deterministic. When and where you got 80-bit temporaries depended on compiler optimization decisions. You still can't compile most code on different platforms and get bitwise-identical results, but 80-bit x87 was one major source of difference between x86 and some other platforms.
Some people (e.g. writing unit tests) would rather have the same numbers everywhere than have more accurate results on x86. Often double
is more than enough, and/or the benefit was relatively small. In other cases, not so much, and extra temporary precision might help significantly.
Deterministic FP is a hard problem, but sought after by people for various reasons. e.g. trying to make multi-player games that don't need to send the whole state of the world over the network every simulation step, but instead can have everyone's simulation run in lockstep without drifting out of sync.
- https://stackoverflow.com/questions/328622/how-deterministic-is-floating-point-inaccuracy
- https://stackoverflow.com/questions/27149894/does-any-floating-point-intensive-code-produce-bit-exact-results-in-any-x86-base
x87 (thus C FLT_EVAL_METHOD == 2
) isn't the only thing that was / is problematic. C compilers that can contract x*y + z
into fma(x,y,z)
also avoid that intermediate rounding step.
For algorithms that didn't try to account for rounding at all, increased temporary precision usually only helped. But numerical techniques like Kahan summation that compensate for FP rounding errors can be defeated by extra temporary precision. So yes, there are definitely people that are happy that extra temporary precision went away, so their code works the way they designed it on more compilers.
When do compilers round:
Any time they need to pass a double
to a non-inline function, obviously they store it in memory as a double
. (32-bit calling conventions pass FP args on the stack, not in x87 registers unfortunately. They do return FP values in st0
. I think some more recent 32-bit conventions on Windows use XMM registers for FP pass/return like in 64-bit mode. Other OSes care less about 32-bit code and still just use the inefficient i386 System V ABI which is stack args all the way even for integer.)
So you can use sinl(x)
instead of sin(x)
to call the long double
version of the library function. But all your other variables and internal temporaries get rounded to their declared precision (normally double
or float
) around that function call, because the whole x87 stack is call-clobbered.
When compilers spill/reload variables and optimization-created temporaries, they do so with the precision of the C variable. So unless you actually declared long double a,b,c
, your double a,b,c
all get rounded to double
when you do x = sinl(y)
. That's somewhat predictable.
But even less predictable is when the compiler decides to spill something because it's running out of registers. Or when you compile with/without optimization. gcc -ffloat-store
does this store/reload variables to the declared precision between statements even when optimization is enabled. (Not temporaries within the evaluation of one expression.) So for FP variables, kind of like debug-mode code-gen where vars are treated similar to volatile
.
But of course this is crippling for performance unless your code is bottlenecked on something like cache misses for an array.
Extended precision long double
is still an option
(Obviously long double
will prevent auto-vectorization, so only use it if you need it when writing modern code.)
Nobody was celebrating removing the possibility of extended precision, because that didn't happen (except with MSVC which didn't give access to it even for 32-bit code where SSE wasn't part of the standard calling convention).
Extended precision is rarely used, and not supported by MSVC, but on other compilers targeting x86 and x86-64, long double
is the 80-bit x87 type. Apparently even when compiling for Windows, gcc and clang use 80-bit long double
.
Beware long double
is an ABI difference between MSVC and other x86 compilers. Usually gcc and clang are careful to match the calling convention, type widths, and struct layout rules of the platform. But they chose to make long double
a 10-byte type despite MSVC making it the same as 8-byte double
.
GCC has a -mlong-double-64/80/128
x86 option to set the width of long double
, and the docs warn that it changes the ABI.
ICC has a /Qlong-double
option that makes long double
an 80-bit type even on Windows.
So functions that interact with any kind of long double
are not ABI compatible between MSVC and other compilers (except GCC or ICC with special options); they're expecting a different sized object, so not even a single long double
works, except as a function return value in st0
where it's in a register already.
If you need more precision than IEEE binary64 double
, your options include so-called double-double (using a pair of double
values to get twice the significand width but the same exponent range), or taking advantage of x87 80-bit hardware. If 80-bit is enough, it's a useful option, and gives you extra range as well as significand precision, and only requires 1 instruction per computation).
(On CPUs with AVX, especially with AVX2 + FMA, for some loops double-double might outperform x87, being able to compute 4x double
in parallel. e.g. https://stackoverflow.com/questions/30573443/optimize-for-fast-multiplication-but-slow-addition-fma-and-doubledouble shows that double * double => double_double
(53x53 => 106-bit significand) multiplication can be as simple as high = a * b;
low = fma(a, b, -high);
and Haswell/Skylake can do that for 4 elements at once in 2 instructions (with 2-per-clock throughput for FP mul/FMA). But with double_double
inputs, it's obviously less cheap.)
Further fun facts:
The x87 FPU has precision-control bits that let you set how results in registers are rounded after any/every computation and load:
- to 80-bit long double: 64-bit significand precision. The
finit
default, and normal setting except with MSVC. - to 64-bit double: 53-bit significand precision. 32-bit MSVC sets this.
- to 24-bit float: 24-bit significand precision.
Apparently Visual C++'s CRT startup code (that calls main
) reduces x87 precision from 64-bit significand down to 53-bit (64-bit double). Apparently x86 (32-bit) VS2012 and later still does this, if I'm reading Bruce Dawson's article correctly.
So as well as not having an 80-bit FP type, 32-bit MSVC changes the FPU setting so even if you used hand-written asm, you'd still only have 53-bit significand precision, with only the wider range from having more exponent bits. (fstp m80
would still store in the same format, but the low 11 bits of the significand would always be zero. And I guess loading would have to round to nearest. Supporting this stuff might be why fld
decodes to multiple ALU uops on modern CPUs.)
I don't know if the motivation was to speed up fdiv
and fsqrt
(which it does for inputs that don't have a lot of trailing zeros in the significand), or if it's to avoid extra temporary precision. But it has the huge downside that it makes using extended precision impossible (or useless). It's interesting that GNU/Linux and MSVC made opposite decisions here.
Apparently the D3D9 library init function sets x87 precision to 24-bit significand single-precision float, making everything less precise for a speed gain on fdiv/fsqrt (and maybe fcos
/fsin
and other slow microcoded instructions, too.) But x87 precision settings are per-thread, so it matters which thread you call the init function from! (The x87 control word is part of the architectural state that context switches save/restore.)
Of course you can set it back to 64-bit significand with _controlfp_s
, so you could useful use asm, or call a function using long double
compiled by GCC, clang, or ICC. But beware the ABI differences: you can only pass it inputs as float
, double
, or integer, because MSVC won't ever create objects in memory in the 80-bit x87 format.
1
In most of the situations where 80-bit values should be used, it would be possible to keep them in registers. Spilling an 80-bit value and later reloading it would cost more than doing likewise with a 64-bit value, but if 80-bit register spills would be a meaningful factor in performance, 64-bit register spills would have an adverse effect anyway.
– supercat
Apr 19 at 21:30
1
@supercat: right, but if you want deterministic 80-bit, you have to use it at least for some local vars, so when they have to be spilled across other function calls, like forsin(x) + sin(y) + sin(x*y)
or whatever, the spill/reload will be in 80-bit precision. And this is why compilers don't default to promoting locals and temporaries to 80-bit by default. (An option to do that would be possible, but GCC (still) doesn't have one. See discussion on Davislor's answer about a quote from the g77 2.95 manual about the possibility.
– Peter Cordes
Apr 19 at 21:53
1
If an ABI has a good blend of caller-saved and callee-saved registers, most calls to leaf functions shouldn't end up requiring register spills. Entering and returning from a non-leaf function will likely require a spill/restore for each local variable, but the total execution time for most non-leaf functions will be long enough that even if 80-bit loads and stores cost twice as much as 64-bit ones, that wouldn't meaningfully affect overall performance.
– supercat
Apr 19 at 22:07
2
@supercat: As I mentioned in this answer, the x87 stack is always call-clobbered in all calling conventions. (And must be empty oncall
, and empty or holding return value onret
). Due to its nature, there's no sane way to make any of it call-preserved; it's a stack, and pushing a new value when the slot is already in use give you a NaN-indefinite. You could in theory make a calling convention that stored the status word and figured out how many regs were in use, so it knew how many slow 80-bitfstp
s to use on entry, and how many slow 80-bitfld
s to do before returning, but yuck.
– Peter Cordes
Apr 19 at 22:12
3
the option in GCC is-mlong-double-64/80/128
. There's also a warning under them saying if you override the default value for your target ABI, this changes the size of structures and arrays containing long double variables, as well as modifying the function calling convention for functions taking long double. Hence they are not binary-compatible with code compiled without that switch.
– phuclv
Apr 20 at 4:19
|
show 9 more comments
TL:DR: no, none of the major C compilers had an option to force promoting double
locals/temporaries to 80-bit even across spill/reload, only keeping them as 80-bit when it was convenient to keep them in registers anway.
Bruce Dawson's Intermediate Floating-Point Precision article is essential reading if you're wondering about whether extra precision for temporaries is helpful or harmful. He has examples that demonstrate both, and links to articles that conclude one way and the other.
Also very importantly, he has lots of specific details about what Visual Studio / MSVC actually does, and what gcc actually does, with x87 and with SSE/SSE2. Fun fact: MSVC before VS2012 used double
for float * float
even when using SSE/SSE2 instructions! (Presumably to match the numerical behaviour of x87 with its precision set to 53-bit significand; which is what MSVC does without SSE/SSE2.)
His whole series of FP articles is excellent; index in this one.
that would cost very little performance to use (since the hardware implemented it whether you used it or not)
This is an overstatement. Working with 80-bit long double
in x87 registers has zero extra cost, but as memory operands they are definitely 2nd-class citizens in both ISA design and performance. Most x87 code involves a significant amount of loading and storing, something like Mandelbrot iterations being a rare exception at the upper end of computational intensity. Some round constants can be stored as float
without precision loss, but runtime variables usually can't make any assumptions.
Compilers that always promoted temporaries / local variables to 80-bit even when they needed to be spilled/reloaded would create slower code (as @Davislor's answer seems to suggest would have been an option for gcc to implement). See below about when compilers actually round C double
temporaries and locals to IEEE binary64: any time they store/reload.
80-bit REAL10 /
long double
can't be a memory operand forfadd
/fsub
/fmul
/fdiv
/ etc. Those only support using 32 or 64-bit float/double memory operands.So to work with an 80-bit value from memory, you need an extra
fld
instruction. (Unless you want it in a register separately anyway, then the separatefld
isn't "extra"). On P5 Pentium, memory operands for instructions likefadd
have no extra cost, so if you already had to spill a value earlier, adding it from memory is efficient forfloat
/double
.And you need an extra x87 stack register to load it into.
fadd st5, qword [mem]
isn't available (only memory source with the top of the register stackst0
as an implicit destination), so memory operands didn't help much to avoidfxch
, but if you were close to filling up all 8st0..7
stack slots then having to load might require you to spill something else.fst
to storest0
to memory without popping the x87 stack is only available form32
/m64
operands (IEEE binary32float
/ IEEE binary64double
).fstp m32/m64/m80
to store-and-pop is used more often, but there are some use-cases where you want to store and keep using a value. Like in a computation where one result is also part of a later expression, or an array calc wherex[i]
depends onx[i-1]
.If you want to store 80-bit
long double
,fstp
is your only option. You might need usefld st0
to duplicate it, thenfstp
to pop that copy off. (You canfld
/fstp
with a register operand instead of memory, as well asfxch
to swap a register to the top of the stack.)
80-bit FP load/store is significantly slower than 32-bit or 64-bit, and not (just) because of larger cache footprint. On original Pentium, it's close to what you might expect from 32/64-bit load/store being a single cache access, vs. 80-bit taking 2 accesses (presumably 64 + 16 bit), but on later CPUs it's even worse.
Some perf numbers from Agner Fog's instruction tables for some 32-bit-only CPUs that were relevant in the era before SSE2 and x86-64. I don't have 486 numbers; Agner Fog only covers Pentium and earlier, and http://instlatx64.atw.hu/ only has CPUID from a 486, not instruction latencies. And its ppro / PIII latency/throughput numbers don't cover fld/fstp. It does show fsqrt
and fdiv
performance being slower for full 80-bit precision, though.
P5 Pentium (in-order pipelined dual issue superscalar):
fld m32/m64
(load float/double into 80-bit x87 ST0): 1 cycle, pairable with fxchg.fld m80
: 3 cycles, not pairable, and (unlikefadd
/fmul
which are pipelined), not overlapable with later FP or integer instructions.fst(p) m32/m64
(round 80-bit ST0 to float/double and store): 2 cycles, not pairable or overlapablefstp m80
: (note only available inpop
version that frees the x87 register): 3 cycles, not pairable
P6 Pentium Pro / Pentium II / Pentium III. (out-of-order 3-wide superscalar, decodes to 1 or more RISC-like micro-ops that can be scheduled independently)
(Agner Fog doesn't have useful latency numbers for FP load/store on this uarch)fld m32/m64
is 1 uop for the load port.fld m80
: 4 uops total: 2 ALU p0, 2 load portfst(p) m32/m64
2 uops (store-address + store-data, not micro-fused because that only existed on P-M and later)fstp m80
: 6 uops total: 2 ALU p0, 2x store-address, 2x store-data. I guess ALU extract into 64-bit and 16-bit chunks, as inputs for 2 stores.
Multi-uop instructions can only be decoded by the "complex" decoder on Intel CPUs (while simple instructions can decode in parallel, in patterns like 1-1-1 up to 4-1-1), so 4-uop fld m80
can lead to the previous cycle only producing 1 uop in the worst case. 6 uops for fstp m80
is more than 4, so decoding it requires the microcode sequencer. These decode bottlenecks could lead to bubbles in the front-end, as well as / instead of possible back-end bottlenecks. (P6-family CPUs, especially later ones with better back-end throughput, can bottleneck on instruction fetch/decode in the front-end if you aren't careful; see Agner Fog's microarch pdf. Keeping the issue/rename stage fed with 3 uops / clock can be hard, or 4 on Core2 and later.)
Agner doesn't have latencies or throughputs for FP loads/stores on original P6 (the "1 cycle" latency in a couple columns appears bogus). But it's probably similar to later CPUs, where m80
has worse throughput than you'd expect from the uop counts / ports.
- Pentium-M: 1 per 3 cycle throughput for
fstp m80
6 uops. vs. 1 uop / 1-per-clock forfst(p) m32/m64
, with micro-fusion of the store-address and store-data uops into a single fused-domain uop that can decode in any slot on the simple decoders. - Core 2 (Merom) / Nehalem:
fld m80
: 1 per 3 cycles (4 uops)fstp m80
1 per 5 cycles (7 uops: 3 ALU + 2x each store-address and store-data). Agner's latency numbers show 1 extra cycle for both load and store. - Pentium 4 (pre-Prescott):
fld m80
3+4 uops, 1 per 6 cycles vs. 1-uop pipelined.fstp m80
: 3+8 uops, 1 per 8 cycles vs. 2+0 uops with 2 to 3c throughput. Prescott is similar - Skylake:
fld m80
: 1 per 2 cycles (4 uops) vs. 1 per 0.5 cycles for m32/m64.fstp m80
: Still 7 uops, 1 per 5 cycles vs. 1 per clock for normal stores.
AMD K7/K8:
fld m80
: 7 m-ops, 1 per 4-cycle throughput (vs. 1 per 0.5c for 1 m-opfld m32/m64
).fstp m80
: 10 m-ops, 1 per 5-cycle throughput. (vs. 1 m-op fully pipelinedfst(p) m32/m64
). The latency penalty on these is much higher than on Intel, e.g. 16 cycle m80 loads vs. 4-cycle m32/m64.AMD Bulldozer:
fld m80
: 8 ops/14c lat/4c tput. (vs. 1 op/8c lat/1c tput for m32/m64). Interesting that even regularfloat
/double
x87 loads have half throughput of SSE2 / AVX loads.fstp m80
: 13 ops/9c lat/20c tput. (vs. 1 op/8c lat/1c tput). Piledriver/Steamroller are similar, that catastrophic store throughput of one per 20 or 19 cycles is real.
(Bulldozer-family's high load/store latencies for regular m32/m64 operands is related to having a "cluster" of 2 weak integer cores sharing a single FPU/SIMD unit. Ryzen abandoned this in favour of SMT in the style of Intel's Hyperthreading.)
There's definitely a chicken/egg effect here; if compilers did make code that regularly used stored/reloaded 80-bit temporaries in memory, CPU designers would spend some more transistors to make it more efficient at least on later CPUs. Maybe doing a single 16-byte unaligned cache access when possible, and grabbing the required 10 bytes from that.
Fun fact: fld m32/m64
can raise / flag an FP exception (#IA
) if the source operand is SNaN, but Intel's manual says this can't happen if the source operand is in double extended-precision floating-point format. So it can just stuff the bits into an x87 register without looking at them, unlike fld m32 / m64 where it has to expand the significand/exponent fields.
So ironically, on recent CPUs where the main use-case for x87 is for 80-bit, 80-bit float support is relatively even worse than on older CPUs. Obviously CPU designers don't put much weight on that and assume it's mostly used by old 32-bit binaries.
x87 and MMX are de-prioritized, though, e.g. Haswell made fxch
a 2-uop instruction, up from 1 in previous uarches. (Still 0 latency using register renaming, though. See Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? for some thoughts on that and fxch
.) And fmul
/ fadd
throughputs are only 1 per clock on Skylake, vs. 2 per clock for SSE/AVX vector or scalar add/mul/fma. On Skylake even some MMX integer SIMD instructions run on fewer execution ports than their XMM equivalents.
(If you're looking at the tables yourself, fbld
and fbstp m80bcd
are insanely slow because they convert from/to BCD, thus requiring conversion from binary to decimal with division by 10. Nevermind those, they're always microcoded).
yet everyone seemed to behave as though this had no value, and to positively celebrate the move to SSE2 where extended precision is no longer available.
No, what people celebrated was that FP became more deterministic. When and where you got 80-bit temporaries depended on compiler optimization decisions. You still can't compile most code on different platforms and get bitwise-identical results, but 80-bit x87 was one major source of difference between x86 and some other platforms.
Some people (e.g. writing unit tests) would rather have the same numbers everywhere than have more accurate results on x86. Often double
is more than enough, and/or the benefit was relatively small. In other cases, not so much, and extra temporary precision might help significantly.
Deterministic FP is a hard problem, but sought after by people for various reasons. e.g. trying to make multi-player games that don't need to send the whole state of the world over the network every simulation step, but instead can have everyone's simulation run in lockstep without drifting out of sync.
- https://stackoverflow.com/questions/328622/how-deterministic-is-floating-point-inaccuracy
- https://stackoverflow.com/questions/27149894/does-any-floating-point-intensive-code-produce-bit-exact-results-in-any-x86-base
x87 (thus C FLT_EVAL_METHOD == 2
) isn't the only thing that was / is problematic. C compilers that can contract x*y + z
into fma(x,y,z)
also avoid that intermediate rounding step.
For algorithms that didn't try to account for rounding at all, increased temporary precision usually only helped. But numerical techniques like Kahan summation that compensate for FP rounding errors can be defeated by extra temporary precision. So yes, there are definitely people that are happy that extra temporary precision went away, so their code works the way they designed it on more compilers.
When do compilers round:
Any time they need to pass a double
to a non-inline function, obviously they store it in memory as a double
. (32-bit calling conventions pass FP args on the stack, not in x87 registers unfortunately. They do return FP values in st0
. I think some more recent 32-bit conventions on Windows use XMM registers for FP pass/return like in 64-bit mode. Other OSes care less about 32-bit code and still just use the inefficient i386 System V ABI which is stack args all the way even for integer.)
So you can use sinl(x)
instead of sin(x)
to call the long double
version of the library function. But all your other variables and internal temporaries get rounded to their declared precision (normally double
or float
) around that function call, because the whole x87 stack is call-clobbered.
When compilers spill/reload variables and optimization-created temporaries, they do so with the precision of the C variable. So unless you actually declared long double a,b,c
, your double a,b,c
all get rounded to double
when you do x = sinl(y)
. That's somewhat predictable.
But even less predictable is when the compiler decides to spill something because it's running out of registers. Or when you compile with/without optimization. gcc -ffloat-store
does this store/reload variables to the declared precision between statements even when optimization is enabled. (Not temporaries within the evaluation of one expression.) So for FP variables, kind of like debug-mode code-gen where vars are treated similar to volatile
.
But of course this is crippling for performance unless your code is bottlenecked on something like cache misses for an array.
Extended precision long double
is still an option
(Obviously long double
will prevent auto-vectorization, so only use it if you need it when writing modern code.)
Nobody was celebrating removing the possibility of extended precision, because that didn't happen (except with MSVC which didn't give access to it even for 32-bit code where SSE wasn't part of the standard calling convention).
Extended precision is rarely used, and not supported by MSVC, but on other compilers targeting x86 and x86-64, long double
is the 80-bit x87 type. Apparently even when compiling for Windows, gcc and clang use 80-bit long double
.
Beware long double
is an ABI difference between MSVC and other x86 compilers. Usually gcc and clang are careful to match the calling convention, type widths, and struct layout rules of the platform. But they chose to make long double
a 10-byte type despite MSVC making it the same as 8-byte double
.
GCC has a -mlong-double-64/80/128
x86 option to set the width of long double
, and the docs warn that it changes the ABI.
ICC has a /Qlong-double
option that makes long double
an 80-bit type even on Windows.
So functions that interact with any kind of long double
are not ABI compatible between MSVC and other compilers (except GCC or ICC with special options); they're expecting a different sized object, so not even a single long double
works, except as a function return value in st0
where it's in a register already.
If you need more precision than IEEE binary64 double
, your options include so-called double-double (using a pair of double
values to get twice the significand width but the same exponent range), or taking advantage of x87 80-bit hardware. If 80-bit is enough, it's a useful option, and gives you extra range as well as significand precision, and only requires 1 instruction per computation).
(On CPUs with AVX, especially with AVX2 + FMA, for some loops double-double might outperform x87, being able to compute 4x double
in parallel. e.g. https://stackoverflow.com/questions/30573443/optimize-for-fast-multiplication-but-slow-addition-fma-and-doubledouble shows that double * double => double_double
(53x53 => 106-bit significand) multiplication can be as simple as high = a * b;
low = fma(a, b, -high);
and Haswell/Skylake can do that for 4 elements at once in 2 instructions (with 2-per-clock throughput for FP mul/FMA). But with double_double
inputs, it's obviously less cheap.)
Further fun facts:
The x87 FPU has precision-control bits that let you set how results in registers are rounded after any/every computation and load:
- to 80-bit long double: 64-bit significand precision. The
finit
default, and normal setting except with MSVC. - to 64-bit double: 53-bit significand precision. 32-bit MSVC sets this.
- to 24-bit float: 24-bit significand precision.
Apparently Visual C++'s CRT startup code (that calls main
) reduces x87 precision from 64-bit significand down to 53-bit (64-bit double). Apparently x86 (32-bit) VS2012 and later still does this, if I'm reading Bruce Dawson's article correctly.
So as well as not having an 80-bit FP type, 32-bit MSVC changes the FPU setting so even if you used hand-written asm, you'd still only have 53-bit significand precision, with only the wider range from having more exponent bits. (fstp m80
would still store in the same format, but the low 11 bits of the significand would always be zero. And I guess loading would have to round to nearest. Supporting this stuff might be why fld
decodes to multiple ALU uops on modern CPUs.)
I don't know if the motivation was to speed up fdiv
and fsqrt
(which it does for inputs that don't have a lot of trailing zeros in the significand), or if it's to avoid extra temporary precision. But it has the huge downside that it makes using extended precision impossible (or useless). It's interesting that GNU/Linux and MSVC made opposite decisions here.
Apparently the D3D9 library init function sets x87 precision to 24-bit significand single-precision float, making everything less precise for a speed gain on fdiv/fsqrt (and maybe fcos
/fsin
and other slow microcoded instructions, too.) But x87 precision settings are per-thread, so it matters which thread you call the init function from! (The x87 control word is part of the architectural state that context switches save/restore.)
Of course you can set it back to 64-bit significand with _controlfp_s
, so you could useful use asm, or call a function using long double
compiled by GCC, clang, or ICC. But beware the ABI differences: you can only pass it inputs as float
, double
, or integer, because MSVC won't ever create objects in memory in the 80-bit x87 format.
TL:DR: no, none of the major C compilers had an option to force promoting double
locals/temporaries to 80-bit even across spill/reload, only keeping them as 80-bit when it was convenient to keep them in registers anway.
Bruce Dawson's Intermediate Floating-Point Precision article is essential reading if you're wondering about whether extra precision for temporaries is helpful or harmful. He has examples that demonstrate both, and links to articles that conclude one way and the other.
Also very importantly, he has lots of specific details about what Visual Studio / MSVC actually does, and what gcc actually does, with x87 and with SSE/SSE2. Fun fact: MSVC before VS2012 used double
for float * float
even when using SSE/SSE2 instructions! (Presumably to match the numerical behaviour of x87 with its precision set to 53-bit significand; which is what MSVC does without SSE/SSE2.)
His whole series of FP articles is excellent; index in this one.
that would cost very little performance to use (since the hardware implemented it whether you used it or not)
This is an overstatement. Working with 80-bit long double
in x87 registers has zero extra cost, but as memory operands they are definitely 2nd-class citizens in both ISA design and performance. Most x87 code involves a significant amount of loading and storing, something like Mandelbrot iterations being a rare exception at the upper end of computational intensity. Some round constants can be stored as float
without precision loss, but runtime variables usually can't make any assumptions.
Compilers that always promoted temporaries / local variables to 80-bit even when they needed to be spilled/reloaded would create slower code (as @Davislor's answer seems to suggest would have been an option for gcc to implement). See below about when compilers actually round C double
temporaries and locals to IEEE binary64: any time they store/reload.
80-bit REAL10 /
long double
can't be a memory operand forfadd
/fsub
/fmul
/fdiv
/ etc. Those only support using 32 or 64-bit float/double memory operands.So to work with an 80-bit value from memory, you need an extra
fld
instruction. (Unless you want it in a register separately anyway, then the separatefld
isn't "extra"). On P5 Pentium, memory operands for instructions likefadd
have no extra cost, so if you already had to spill a value earlier, adding it from memory is efficient forfloat
/double
.And you need an extra x87 stack register to load it into.
fadd st5, qword [mem]
isn't available (only memory source with the top of the register stackst0
as an implicit destination), so memory operands didn't help much to avoidfxch
, but if you were close to filling up all 8st0..7
stack slots then having to load might require you to spill something else.fst
to storest0
to memory without popping the x87 stack is only available form32
/m64
operands (IEEE binary32float
/ IEEE binary64double
).fstp m32/m64/m80
to store-and-pop is used more often, but there are some use-cases where you want to store and keep using a value. Like in a computation where one result is also part of a later expression, or an array calc wherex[i]
depends onx[i-1]
.If you want to store 80-bit
long double
,fstp
is your only option. You might need usefld st0
to duplicate it, thenfstp
to pop that copy off. (You canfld
/fstp
with a register operand instead of memory, as well asfxch
to swap a register to the top of the stack.)
80-bit FP load/store is significantly slower than 32-bit or 64-bit, and not (just) because of larger cache footprint. On original Pentium, it's close to what you might expect from 32/64-bit load/store being a single cache access, vs. 80-bit taking 2 accesses (presumably 64 + 16 bit), but on later CPUs it's even worse.
Some perf numbers from Agner Fog's instruction tables for some 32-bit-only CPUs that were relevant in the era before SSE2 and x86-64. I don't have 486 numbers; Agner Fog only covers Pentium and earlier, and http://instlatx64.atw.hu/ only has CPUID from a 486, not instruction latencies. And its ppro / PIII latency/throughput numbers don't cover fld/fstp. It does show fsqrt
and fdiv
performance being slower for full 80-bit precision, though.
P5 Pentium (in-order pipelined dual issue superscalar):
fld m32/m64
(load float/double into 80-bit x87 ST0): 1 cycle, pairable with fxchg.fld m80
: 3 cycles, not pairable, and (unlikefadd
/fmul
which are pipelined), not overlapable with later FP or integer instructions.fst(p) m32/m64
(round 80-bit ST0 to float/double and store): 2 cycles, not pairable or overlapablefstp m80
: (note only available inpop
version that frees the x87 register): 3 cycles, not pairable
P6 Pentium Pro / Pentium II / Pentium III. (out-of-order 3-wide superscalar, decodes to 1 or more RISC-like micro-ops that can be scheduled independently)
(Agner Fog doesn't have useful latency numbers for FP load/store on this uarch)fld m32/m64
is 1 uop for the load port.fld m80
: 4 uops total: 2 ALU p0, 2 load portfst(p) m32/m64
2 uops (store-address + store-data, not micro-fused because that only existed on P-M and later)fstp m80
: 6 uops total: 2 ALU p0, 2x store-address, 2x store-data. I guess ALU extract into 64-bit and 16-bit chunks, as inputs for 2 stores.
Multi-uop instructions can only be decoded by the "complex" decoder on Intel CPUs (while simple instructions can decode in parallel, in patterns like 1-1-1 up to 4-1-1), so 4-uop fld m80
can lead to the previous cycle only producing 1 uop in the worst case. 6 uops for fstp m80
is more than 4, so decoding it requires the microcode sequencer. These decode bottlenecks could lead to bubbles in the front-end, as well as / instead of possible back-end bottlenecks. (P6-family CPUs, especially later ones with better back-end throughput, can bottleneck on instruction fetch/decode in the front-end if you aren't careful; see Agner Fog's microarch pdf. Keeping the issue/rename stage fed with 3 uops / clock can be hard, or 4 on Core2 and later.)
Agner doesn't have latencies or throughputs for FP loads/stores on original P6 (the "1 cycle" latency in a couple columns appears bogus). But it's probably similar to later CPUs, where m80
has worse throughput than you'd expect from the uop counts / ports.
- Pentium-M: 1 per 3 cycle throughput for
fstp m80
6 uops. vs. 1 uop / 1-per-clock forfst(p) m32/m64
, with micro-fusion of the store-address and store-data uops into a single fused-domain uop that can decode in any slot on the simple decoders. - Core 2 (Merom) / Nehalem:
fld m80
: 1 per 3 cycles (4 uops)fstp m80
1 per 5 cycles (7 uops: 3 ALU + 2x each store-address and store-data). Agner's latency numbers show 1 extra cycle for both load and store. - Pentium 4 (pre-Prescott):
fld m80
3+4 uops, 1 per 6 cycles vs. 1-uop pipelined.fstp m80
: 3+8 uops, 1 per 8 cycles vs. 2+0 uops with 2 to 3c throughput. Prescott is similar - Skylake:
fld m80
: 1 per 2 cycles (4 uops) vs. 1 per 0.5 cycles for m32/m64.fstp m80
: Still 7 uops, 1 per 5 cycles vs. 1 per clock for normal stores.
AMD K7/K8:
fld m80
: 7 m-ops, 1 per 4-cycle throughput (vs. 1 per 0.5c for 1 m-opfld m32/m64
).fstp m80
: 10 m-ops, 1 per 5-cycle throughput. (vs. 1 m-op fully pipelinedfst(p) m32/m64
). The latency penalty on these is much higher than on Intel, e.g. 16 cycle m80 loads vs. 4-cycle m32/m64.AMD Bulldozer:
fld m80
: 8 ops/14c lat/4c tput. (vs. 1 op/8c lat/1c tput for m32/m64). Interesting that even regularfloat
/double
x87 loads have half throughput of SSE2 / AVX loads.fstp m80
: 13 ops/9c lat/20c tput. (vs. 1 op/8c lat/1c tput). Piledriver/Steamroller are similar, that catastrophic store throughput of one per 20 or 19 cycles is real.
(Bulldozer-family's high load/store latencies for regular m32/m64 operands is related to having a "cluster" of 2 weak integer cores sharing a single FPU/SIMD unit. Ryzen abandoned this in favour of SMT in the style of Intel's Hyperthreading.)
There's definitely a chicken/egg effect here; if compilers did make code that regularly used stored/reloaded 80-bit temporaries in memory, CPU designers would spend some more transistors to make it more efficient at least on later CPUs. Maybe doing a single 16-byte unaligned cache access when possible, and grabbing the required 10 bytes from that.
Fun fact: fld m32/m64
can raise / flag an FP exception (#IA
) if the source operand is SNaN, but Intel's manual says this can't happen if the source operand is in double extended-precision floating-point format. So it can just stuff the bits into an x87 register without looking at them, unlike fld m32 / m64 where it has to expand the significand/exponent fields.
So ironically, on recent CPUs where the main use-case for x87 is for 80-bit, 80-bit float support is relatively even worse than on older CPUs. Obviously CPU designers don't put much weight on that and assume it's mostly used by old 32-bit binaries.
x87 and MMX are de-prioritized, though, e.g. Haswell made fxch
a 2-uop instruction, up from 1 in previous uarches. (Still 0 latency using register renaming, though. See Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? for some thoughts on that and fxch
.) And fmul
/ fadd
throughputs are only 1 per clock on Skylake, vs. 2 per clock for SSE/AVX vector or scalar add/mul/fma. On Skylake even some MMX integer SIMD instructions run on fewer execution ports than their XMM equivalents.
(If you're looking at the tables yourself, fbld
and fbstp m80bcd
are insanely slow because they convert from/to BCD, thus requiring conversion from binary to decimal with division by 10. Nevermind those, they're always microcoded).
yet everyone seemed to behave as though this had no value, and to positively celebrate the move to SSE2 where extended precision is no longer available.
No, what people celebrated was that FP became more deterministic. When and where you got 80-bit temporaries depended on compiler optimization decisions. You still can't compile most code on different platforms and get bitwise-identical results, but 80-bit x87 was one major source of difference between x86 and some other platforms.
Some people (e.g. writing unit tests) would rather have the same numbers everywhere than have more accurate results on x86. Often double
is more than enough, and/or the benefit was relatively small. In other cases, not so much, and extra temporary precision might help significantly.
Deterministic FP is a hard problem, but sought after by people for various reasons. e.g. trying to make multi-player games that don't need to send the whole state of the world over the network every simulation step, but instead can have everyone's simulation run in lockstep without drifting out of sync.
- https://stackoverflow.com/questions/328622/how-deterministic-is-floating-point-inaccuracy
- https://stackoverflow.com/questions/27149894/does-any-floating-point-intensive-code-produce-bit-exact-results-in-any-x86-base
x87 (thus C FLT_EVAL_METHOD == 2
) isn't the only thing that was / is problematic. C compilers that can contract x*y + z
into fma(x,y,z)
also avoid that intermediate rounding step.
For algorithms that didn't try to account for rounding at all, increased temporary precision usually only helped. But numerical techniques like Kahan summation that compensate for FP rounding errors can be defeated by extra temporary precision. So yes, there are definitely people that are happy that extra temporary precision went away, so their code works the way they designed it on more compilers.
When do compilers round:
Any time they need to pass a double
to a non-inline function, obviously they store it in memory as a double
. (32-bit calling conventions pass FP args on the stack, not in x87 registers unfortunately. They do return FP values in st0
. I think some more recent 32-bit conventions on Windows use XMM registers for FP pass/return like in 64-bit mode. Other OSes care less about 32-bit code and still just use the inefficient i386 System V ABI which is stack args all the way even for integer.)
So you can use sinl(x)
instead of sin(x)
to call the long double
version of the library function. But all your other variables and internal temporaries get rounded to their declared precision (normally double
or float
) around that function call, because the whole x87 stack is call-clobbered.
When compilers spill/reload variables and optimization-created temporaries, they do so with the precision of the C variable. So unless you actually declared long double a,b,c
, your double a,b,c
all get rounded to double
when you do x = sinl(y)
. That's somewhat predictable.
But even less predictable is when the compiler decides to spill something because it's running out of registers. Or when you compile with/without optimization. gcc -ffloat-store
does this store/reload variables to the declared precision between statements even when optimization is enabled. (Not temporaries within the evaluation of one expression.) So for FP variables, kind of like debug-mode code-gen where vars are treated similar to volatile
.
But of course this is crippling for performance unless your code is bottlenecked on something like cache misses for an array.
Extended precision long double
is still an option
(Obviously long double
will prevent auto-vectorization, so only use it if you need it when writing modern code.)
Nobody was celebrating removing the possibility of extended precision, because that didn't happen (except with MSVC which didn't give access to it even for 32-bit code where SSE wasn't part of the standard calling convention).
Extended precision is rarely used, and not supported by MSVC, but on other compilers targeting x86 and x86-64, long double
is the 80-bit x87 type. Apparently even when compiling for Windows, gcc and clang use 80-bit long double
.
Beware long double
is an ABI difference between MSVC and other x86 compilers. Usually gcc and clang are careful to match the calling convention, type widths, and struct layout rules of the platform. But they chose to make long double
a 10-byte type despite MSVC making it the same as 8-byte double
.
GCC has a -mlong-double-64/80/128
x86 option to set the width of long double
, and the docs warn that it changes the ABI.
ICC has a /Qlong-double
option that makes long double
an 80-bit type even on Windows.
So functions that interact with any kind of long double
are not ABI compatible between MSVC and other compilers (except GCC or ICC with special options); they're expecting a different sized object, so not even a single long double
works, except as a function return value in st0
where it's in a register already.
If you need more precision than IEEE binary64 double
, your options include so-called double-double (using a pair of double
values to get twice the significand width but the same exponent range), or taking advantage of x87 80-bit hardware. If 80-bit is enough, it's a useful option, and gives you extra range as well as significand precision, and only requires 1 instruction per computation).
(On CPUs with AVX, especially with AVX2 + FMA, for some loops double-double might outperform x87, being able to compute 4x double
in parallel. e.g. https://stackoverflow.com/questions/30573443/optimize-for-fast-multiplication-but-slow-addition-fma-and-doubledouble shows that double * double => double_double
(53x53 => 106-bit significand) multiplication can be as simple as high = a * b;
low = fma(a, b, -high);
and Haswell/Skylake can do that for 4 elements at once in 2 instructions (with 2-per-clock throughput for FP mul/FMA). But with double_double
inputs, it's obviously less cheap.)
Further fun facts:
The x87 FPU has precision-control bits that let you set how results in registers are rounded after any/every computation and load:
- to 80-bit long double: 64-bit significand precision. The
finit
default, and normal setting except with MSVC. - to 64-bit double: 53-bit significand precision. 32-bit MSVC sets this.
- to 24-bit float: 24-bit significand precision.
Apparently Visual C++'s CRT startup code (that calls main
) reduces x87 precision from 64-bit significand down to 53-bit (64-bit double). Apparently x86 (32-bit) VS2012 and later still does this, if I'm reading Bruce Dawson's article correctly.
So as well as not having an 80-bit FP type, 32-bit MSVC changes the FPU setting so even if you used hand-written asm, you'd still only have 53-bit significand precision, with only the wider range from having more exponent bits. (fstp m80
would still store in the same format, but the low 11 bits of the significand would always be zero. And I guess loading would have to round to nearest. Supporting this stuff might be why fld
decodes to multiple ALU uops on modern CPUs.)
I don't know if the motivation was to speed up fdiv
and fsqrt
(which it does for inputs that don't have a lot of trailing zeros in the significand), or if it's to avoid extra temporary precision. But it has the huge downside that it makes using extended precision impossible (or useless). It's interesting that GNU/Linux and MSVC made opposite decisions here.
Apparently the D3D9 library init function sets x87 precision to 24-bit significand single-precision float, making everything less precise for a speed gain on fdiv/fsqrt (and maybe fcos
/fsin
and other slow microcoded instructions, too.) But x87 precision settings are per-thread, so it matters which thread you call the init function from! (The x87 control word is part of the architectural state that context switches save/restore.)
Of course you can set it back to 64-bit significand with _controlfp_s
, so you could useful use asm, or call a function using long double
compiled by GCC, clang, or ICC. But beware the ABI differences: you can only pass it inputs as float
, double
, or integer, because MSVC won't ever create objects in memory in the 80-bit x87 format.
edited 9 hours ago
answered Apr 19 at 12:06
Peter CordesPeter Cordes
1,291711
1,291711
1
In most of the situations where 80-bit values should be used, it would be possible to keep them in registers. Spilling an 80-bit value and later reloading it would cost more than doing likewise with a 64-bit value, but if 80-bit register spills would be a meaningful factor in performance, 64-bit register spills would have an adverse effect anyway.
– supercat
Apr 19 at 21:30
1
@supercat: right, but if you want deterministic 80-bit, you have to use it at least for some local vars, so when they have to be spilled across other function calls, like forsin(x) + sin(y) + sin(x*y)
or whatever, the spill/reload will be in 80-bit precision. And this is why compilers don't default to promoting locals and temporaries to 80-bit by default. (An option to do that would be possible, but GCC (still) doesn't have one. See discussion on Davislor's answer about a quote from the g77 2.95 manual about the possibility.
– Peter Cordes
Apr 19 at 21:53
1
If an ABI has a good blend of caller-saved and callee-saved registers, most calls to leaf functions shouldn't end up requiring register spills. Entering and returning from a non-leaf function will likely require a spill/restore for each local variable, but the total execution time for most non-leaf functions will be long enough that even if 80-bit loads and stores cost twice as much as 64-bit ones, that wouldn't meaningfully affect overall performance.
– supercat
Apr 19 at 22:07
2
@supercat: As I mentioned in this answer, the x87 stack is always call-clobbered in all calling conventions. (And must be empty oncall
, and empty or holding return value onret
). Due to its nature, there's no sane way to make any of it call-preserved; it's a stack, and pushing a new value when the slot is already in use give you a NaN-indefinite. You could in theory make a calling convention that stored the status word and figured out how many regs were in use, so it knew how many slow 80-bitfstp
s to use on entry, and how many slow 80-bitfld
s to do before returning, but yuck.
– Peter Cordes
Apr 19 at 22:12
3
the option in GCC is-mlong-double-64/80/128
. There's also a warning under them saying if you override the default value for your target ABI, this changes the size of structures and arrays containing long double variables, as well as modifying the function calling convention for functions taking long double. Hence they are not binary-compatible with code compiled without that switch.
– phuclv
Apr 20 at 4:19
|
show 9 more comments
1
In most of the situations where 80-bit values should be used, it would be possible to keep them in registers. Spilling an 80-bit value and later reloading it would cost more than doing likewise with a 64-bit value, but if 80-bit register spills would be a meaningful factor in performance, 64-bit register spills would have an adverse effect anyway.
– supercat
Apr 19 at 21:30
1
@supercat: right, but if you want deterministic 80-bit, you have to use it at least for some local vars, so when they have to be spilled across other function calls, like forsin(x) + sin(y) + sin(x*y)
or whatever, the spill/reload will be in 80-bit precision. And this is why compilers don't default to promoting locals and temporaries to 80-bit by default. (An option to do that would be possible, but GCC (still) doesn't have one. See discussion on Davislor's answer about a quote from the g77 2.95 manual about the possibility.
– Peter Cordes
Apr 19 at 21:53
1
If an ABI has a good blend of caller-saved and callee-saved registers, most calls to leaf functions shouldn't end up requiring register spills. Entering and returning from a non-leaf function will likely require a spill/restore for each local variable, but the total execution time for most non-leaf functions will be long enough that even if 80-bit loads and stores cost twice as much as 64-bit ones, that wouldn't meaningfully affect overall performance.
– supercat
Apr 19 at 22:07
2
@supercat: As I mentioned in this answer, the x87 stack is always call-clobbered in all calling conventions. (And must be empty oncall
, and empty or holding return value onret
). Due to its nature, there's no sane way to make any of it call-preserved; it's a stack, and pushing a new value when the slot is already in use give you a NaN-indefinite. You could in theory make a calling convention that stored the status word and figured out how many regs were in use, so it knew how many slow 80-bitfstp
s to use on entry, and how many slow 80-bitfld
s to do before returning, but yuck.
– Peter Cordes
Apr 19 at 22:12
3
the option in GCC is-mlong-double-64/80/128
. There's also a warning under them saying if you override the default value for your target ABI, this changes the size of structures and arrays containing long double variables, as well as modifying the function calling convention for functions taking long double. Hence they are not binary-compatible with code compiled without that switch.
– phuclv
Apr 20 at 4:19
1
1
In most of the situations where 80-bit values should be used, it would be possible to keep them in registers. Spilling an 80-bit value and later reloading it would cost more than doing likewise with a 64-bit value, but if 80-bit register spills would be a meaningful factor in performance, 64-bit register spills would have an adverse effect anyway.
– supercat
Apr 19 at 21:30
In most of the situations where 80-bit values should be used, it would be possible to keep them in registers. Spilling an 80-bit value and later reloading it would cost more than doing likewise with a 64-bit value, but if 80-bit register spills would be a meaningful factor in performance, 64-bit register spills would have an adverse effect anyway.
– supercat
Apr 19 at 21:30
1
1
@supercat: right, but if you want deterministic 80-bit, you have to use it at least for some local vars, so when they have to be spilled across other function calls, like for
sin(x) + sin(y) + sin(x*y)
or whatever, the spill/reload will be in 80-bit precision. And this is why compilers don't default to promoting locals and temporaries to 80-bit by default. (An option to do that would be possible, but GCC (still) doesn't have one. See discussion on Davislor's answer about a quote from the g77 2.95 manual about the possibility.– Peter Cordes
Apr 19 at 21:53
@supercat: right, but if you want deterministic 80-bit, you have to use it at least for some local vars, so when they have to be spilled across other function calls, like for
sin(x) + sin(y) + sin(x*y)
or whatever, the spill/reload will be in 80-bit precision. And this is why compilers don't default to promoting locals and temporaries to 80-bit by default. (An option to do that would be possible, but GCC (still) doesn't have one. See discussion on Davislor's answer about a quote from the g77 2.95 manual about the possibility.– Peter Cordes
Apr 19 at 21:53
1
1
If an ABI has a good blend of caller-saved and callee-saved registers, most calls to leaf functions shouldn't end up requiring register spills. Entering and returning from a non-leaf function will likely require a spill/restore for each local variable, but the total execution time for most non-leaf functions will be long enough that even if 80-bit loads and stores cost twice as much as 64-bit ones, that wouldn't meaningfully affect overall performance.
– supercat
Apr 19 at 22:07
If an ABI has a good blend of caller-saved and callee-saved registers, most calls to leaf functions shouldn't end up requiring register spills. Entering and returning from a non-leaf function will likely require a spill/restore for each local variable, but the total execution time for most non-leaf functions will be long enough that even if 80-bit loads and stores cost twice as much as 64-bit ones, that wouldn't meaningfully affect overall performance.
– supercat
Apr 19 at 22:07
2
2
@supercat: As I mentioned in this answer, the x87 stack is always call-clobbered in all calling conventions. (And must be empty on
call
, and empty or holding return value on ret
). Due to its nature, there's no sane way to make any of it call-preserved; it's a stack, and pushing a new value when the slot is already in use give you a NaN-indefinite. You could in theory make a calling convention that stored the status word and figured out how many regs were in use, so it knew how many slow 80-bit fstp
s to use on entry, and how many slow 80-bit fld
s to do before returning, but yuck.– Peter Cordes
Apr 19 at 22:12
@supercat: As I mentioned in this answer, the x87 stack is always call-clobbered in all calling conventions. (And must be empty on
call
, and empty or holding return value on ret
). Due to its nature, there's no sane way to make any of it call-preserved; it's a stack, and pushing a new value when the slot is already in use give you a NaN-indefinite. You could in theory make a calling convention that stored the status word and figured out how many regs were in use, so it knew how many slow 80-bit fstp
s to use on entry, and how many slow 80-bit fld
s to do before returning, but yuck.– Peter Cordes
Apr 19 at 22:12
3
3
the option in GCC is
-mlong-double-64/80/128
. There's also a warning under them saying if you override the default value for your target ABI, this changes the size of structures and arrays containing long double variables, as well as modifying the function calling convention for functions taking long double. Hence they are not binary-compatible with code compiled without that switch.– phuclv
Apr 20 at 4:19
the option in GCC is
-mlong-double-64/80/128
. There's also a warning under them saying if you override the default value for your target ABI, this changes the size of structures and arrays containing long double variables, as well as modifying the function calling convention for functions taking long double. Hence they are not binary-compatible with code compiled without that switch.– phuclv
Apr 20 at 4:19
|
show 9 more comments
I worked for Borland back in the days of the 8086/8087. Back then, both Turbo C and Microsoft C defined long double
as an 80-bit type, matching the layout of Intel's 80-bit floating-point type. Some years later, when Microsoft got cross-hardware religion (maybe at the same time as they released Windows NT?) they changed their compiler to make long double
a 64-bit type. To the best of my recollection, Borland continued to use 80 bits.
New contributor
4
Microsoft X86 16 bit tool sets support 80 bit long doubles. This was dropped in their X86 32 bit and X86 64 bit tool sets. Win32s for WIndow 3.1 was released about the same time as NT 3.1. I'm not sure if Windows 3.1 winmem32 was released before or after win32s.
– rcgldr
Apr 19 at 13:46
2
C89 botched "long double" by violating the fundamental principle that all floating-point values passed to non-prototyped functions get converted to a common type. If it had specified that values oflong double
get converted todouble
except when wrapped using a special macro which would pass them in astruct __wrapped_long_double
, that would have avoided the need for something likeprintf("%10.4f", x*y)
; to care about the types of bothx
andy
[since the value isn't wrapped, the value would get passed todouble
regardless of the types ofx
andy
].
– supercat
Apr 19 at 21:37
1
IIRC Delphi 5 (and probably also 3,4,6, and 7) had the "Extended" type which used all 80 bits of the FPU registers. The generic "Real" type could be made an alias of that, of the 64-bit Double, or of a legacy Borland soft float format.
– cyco130
Apr 20 at 11:04
2
@MichaelKarcher: The introduction oflong int
came fairly late in the development of C, and caused considerable problems. Nonetheless, I think a fundamental difference between the relationship oflong int
andint
, vs.long double
anddouble
, is that every value within the range ofint
can be represented just as accurately by that type as by any larger type. Thus, ifscale_factor
will never exceed the range ofint
, there would generally be no reason for it to be declared as a larger type. On the other hand, if one writesdouble one_tenth=0.1;
, ...
– supercat
Apr 20 at 16:23
2
...and then computesdouble x=one_tenth*y;
, the calculation may be less precise than if one had writtendouble x=y/10.0;
or usedlong double scale_factor=0.1lL
. If neitherwholeQuantity1
norwholeQuantity2
would need to accommodate values outside the range ofint
, the expression wholeQuantity1+wholeQuantity2 will likely be of typeint
orunsigned
. But in many cases involving floating-point, there would be some advantage to using longer-precision scale factors.
– supercat
Apr 20 at 16:29
|
show 1 more comment
I worked for Borland back in the days of the 8086/8087. Back then, both Turbo C and Microsoft C defined long double
as an 80-bit type, matching the layout of Intel's 80-bit floating-point type. Some years later, when Microsoft got cross-hardware religion (maybe at the same time as they released Windows NT?) they changed their compiler to make long double
a 64-bit type. To the best of my recollection, Borland continued to use 80 bits.
New contributor
4
Microsoft X86 16 bit tool sets support 80 bit long doubles. This was dropped in their X86 32 bit and X86 64 bit tool sets. Win32s for WIndow 3.1 was released about the same time as NT 3.1. I'm not sure if Windows 3.1 winmem32 was released before or after win32s.
– rcgldr
Apr 19 at 13:46
2
C89 botched "long double" by violating the fundamental principle that all floating-point values passed to non-prototyped functions get converted to a common type. If it had specified that values oflong double
get converted todouble
except when wrapped using a special macro which would pass them in astruct __wrapped_long_double
, that would have avoided the need for something likeprintf("%10.4f", x*y)
; to care about the types of bothx
andy
[since the value isn't wrapped, the value would get passed todouble
regardless of the types ofx
andy
].
– supercat
Apr 19 at 21:37
1
IIRC Delphi 5 (and probably also 3,4,6, and 7) had the "Extended" type which used all 80 bits of the FPU registers. The generic "Real" type could be made an alias of that, of the 64-bit Double, or of a legacy Borland soft float format.
– cyco130
Apr 20 at 11:04
2
@MichaelKarcher: The introduction oflong int
came fairly late in the development of C, and caused considerable problems. Nonetheless, I think a fundamental difference between the relationship oflong int
andint
, vs.long double
anddouble
, is that every value within the range ofint
can be represented just as accurately by that type as by any larger type. Thus, ifscale_factor
will never exceed the range ofint
, there would generally be no reason for it to be declared as a larger type. On the other hand, if one writesdouble one_tenth=0.1;
, ...
– supercat
Apr 20 at 16:23
2
...and then computesdouble x=one_tenth*y;
, the calculation may be less precise than if one had writtendouble x=y/10.0;
or usedlong double scale_factor=0.1lL
. If neitherwholeQuantity1
norwholeQuantity2
would need to accommodate values outside the range ofint
, the expression wholeQuantity1+wholeQuantity2 will likely be of typeint
orunsigned
. But in many cases involving floating-point, there would be some advantage to using longer-precision scale factors.
– supercat
Apr 20 at 16:29
|
show 1 more comment
I worked for Borland back in the days of the 8086/8087. Back then, both Turbo C and Microsoft C defined long double
as an 80-bit type, matching the layout of Intel's 80-bit floating-point type. Some years later, when Microsoft got cross-hardware religion (maybe at the same time as they released Windows NT?) they changed their compiler to make long double
a 64-bit type. To the best of my recollection, Borland continued to use 80 bits.
New contributor
I worked for Borland back in the days of the 8086/8087. Back then, both Turbo C and Microsoft C defined long double
as an 80-bit type, matching the layout of Intel's 80-bit floating-point type. Some years later, when Microsoft got cross-hardware religion (maybe at the same time as they released Windows NT?) they changed their compiler to make long double
a 64-bit type. To the best of my recollection, Borland continued to use 80 bits.
New contributor
edited Apr 19 at 11:52
New contributor
answered Apr 19 at 11:47
Pete BeckerPete Becker
26115
26115
New contributor
New contributor
4
Microsoft X86 16 bit tool sets support 80 bit long doubles. This was dropped in their X86 32 bit and X86 64 bit tool sets. Win32s for WIndow 3.1 was released about the same time as NT 3.1. I'm not sure if Windows 3.1 winmem32 was released before or after win32s.
– rcgldr
Apr 19 at 13:46
2
C89 botched "long double" by violating the fundamental principle that all floating-point values passed to non-prototyped functions get converted to a common type. If it had specified that values oflong double
get converted todouble
except when wrapped using a special macro which would pass them in astruct __wrapped_long_double
, that would have avoided the need for something likeprintf("%10.4f", x*y)
; to care about the types of bothx
andy
[since the value isn't wrapped, the value would get passed todouble
regardless of the types ofx
andy
].
– supercat
Apr 19 at 21:37
1
IIRC Delphi 5 (and probably also 3,4,6, and 7) had the "Extended" type which used all 80 bits of the FPU registers. The generic "Real" type could be made an alias of that, of the 64-bit Double, or of a legacy Borland soft float format.
– cyco130
Apr 20 at 11:04
2
@MichaelKarcher: The introduction oflong int
came fairly late in the development of C, and caused considerable problems. Nonetheless, I think a fundamental difference between the relationship oflong int
andint
, vs.long double
anddouble
, is that every value within the range ofint
can be represented just as accurately by that type as by any larger type. Thus, ifscale_factor
will never exceed the range ofint
, there would generally be no reason for it to be declared as a larger type. On the other hand, if one writesdouble one_tenth=0.1;
, ...
– supercat
Apr 20 at 16:23
2
...and then computesdouble x=one_tenth*y;
, the calculation may be less precise than if one had writtendouble x=y/10.0;
or usedlong double scale_factor=0.1lL
. If neitherwholeQuantity1
norwholeQuantity2
would need to accommodate values outside the range ofint
, the expression wholeQuantity1+wholeQuantity2 will likely be of typeint
orunsigned
. But in many cases involving floating-point, there would be some advantage to using longer-precision scale factors.
– supercat
Apr 20 at 16:29
|
show 1 more comment
4
Microsoft X86 16 bit tool sets support 80 bit long doubles. This was dropped in their X86 32 bit and X86 64 bit tool sets. Win32s for WIndow 3.1 was released about the same time as NT 3.1. I'm not sure if Windows 3.1 winmem32 was released before or after win32s.
– rcgldr
Apr 19 at 13:46
2
C89 botched "long double" by violating the fundamental principle that all floating-point values passed to non-prototyped functions get converted to a common type. If it had specified that values oflong double
get converted todouble
except when wrapped using a special macro which would pass them in astruct __wrapped_long_double
, that would have avoided the need for something likeprintf("%10.4f", x*y)
; to care about the types of bothx
andy
[since the value isn't wrapped, the value would get passed todouble
regardless of the types ofx
andy
].
– supercat
Apr 19 at 21:37
1
IIRC Delphi 5 (and probably also 3,4,6, and 7) had the "Extended" type which used all 80 bits of the FPU registers. The generic "Real" type could be made an alias of that, of the 64-bit Double, or of a legacy Borland soft float format.
– cyco130
Apr 20 at 11:04
2
@MichaelKarcher: The introduction oflong int
came fairly late in the development of C, and caused considerable problems. Nonetheless, I think a fundamental difference between the relationship oflong int
andint
, vs.long double
anddouble
, is that every value within the range ofint
can be represented just as accurately by that type as by any larger type. Thus, ifscale_factor
will never exceed the range ofint
, there would generally be no reason for it to be declared as a larger type. On the other hand, if one writesdouble one_tenth=0.1;
, ...
– supercat
Apr 20 at 16:23
2
...and then computesdouble x=one_tenth*y;
, the calculation may be less precise than if one had writtendouble x=y/10.0;
or usedlong double scale_factor=0.1lL
. If neitherwholeQuantity1
norwholeQuantity2
would need to accommodate values outside the range ofint
, the expression wholeQuantity1+wholeQuantity2 will likely be of typeint
orunsigned
. But in many cases involving floating-point, there would be some advantage to using longer-precision scale factors.
– supercat
Apr 20 at 16:29
4
4
Microsoft X86 16 bit tool sets support 80 bit long doubles. This was dropped in their X86 32 bit and X86 64 bit tool sets. Win32s for WIndow 3.1 was released about the same time as NT 3.1. I'm not sure if Windows 3.1 winmem32 was released before or after win32s.
– rcgldr
Apr 19 at 13:46
Microsoft X86 16 bit tool sets support 80 bit long doubles. This was dropped in their X86 32 bit and X86 64 bit tool sets. Win32s for WIndow 3.1 was released about the same time as NT 3.1. I'm not sure if Windows 3.1 winmem32 was released before or after win32s.
– rcgldr
Apr 19 at 13:46
2
2
C89 botched "long double" by violating the fundamental principle that all floating-point values passed to non-prototyped functions get converted to a common type. If it had specified that values of
long double
get converted to double
except when wrapped using a special macro which would pass them in a struct __wrapped_long_double
, that would have avoided the need for something like printf("%10.4f", x*y)
; to care about the types of both x
and y
[since the value isn't wrapped, the value would get passed to double
regardless of the types of x
and y
].– supercat
Apr 19 at 21:37
C89 botched "long double" by violating the fundamental principle that all floating-point values passed to non-prototyped functions get converted to a common type. If it had specified that values of
long double
get converted to double
except when wrapped using a special macro which would pass them in a struct __wrapped_long_double
, that would have avoided the need for something like printf("%10.4f", x*y)
; to care about the types of both x
and y
[since the value isn't wrapped, the value would get passed to double
regardless of the types of x
and y
].– supercat
Apr 19 at 21:37
1
1
IIRC Delphi 5 (and probably also 3,4,6, and 7) had the "Extended" type which used all 80 bits of the FPU registers. The generic "Real" type could be made an alias of that, of the 64-bit Double, or of a legacy Borland soft float format.
– cyco130
Apr 20 at 11:04
IIRC Delphi 5 (and probably also 3,4,6, and 7) had the "Extended" type which used all 80 bits of the FPU registers. The generic "Real" type could be made an alias of that, of the 64-bit Double, or of a legacy Borland soft float format.
– cyco130
Apr 20 at 11:04
2
2
@MichaelKarcher: The introduction of
long int
came fairly late in the development of C, and caused considerable problems. Nonetheless, I think a fundamental difference between the relationship of long int
and int
, vs. long double
and double
, is that every value within the range of int
can be represented just as accurately by that type as by any larger type. Thus, if scale_factor
will never exceed the range of int
, there would generally be no reason for it to be declared as a larger type. On the other hand, if one writes double one_tenth=0.1;
, ...– supercat
Apr 20 at 16:23
@MichaelKarcher: The introduction of
long int
came fairly late in the development of C, and caused considerable problems. Nonetheless, I think a fundamental difference between the relationship of long int
and int
, vs. long double
and double
, is that every value within the range of int
can be represented just as accurately by that type as by any larger type. Thus, if scale_factor
will never exceed the range of int
, there would generally be no reason for it to be declared as a larger type. On the other hand, if one writes double one_tenth=0.1;
, ...– supercat
Apr 20 at 16:23
2
2
...and then computes
double x=one_tenth*y;
, the calculation may be less precise than if one had written double x=y/10.0;
or used long double scale_factor=0.1lL
. If neither wholeQuantity1
nor wholeQuantity2
would need to accommodate values outside the range of int
, the expression wholeQuantity1+wholeQuantity2 will likely be of type int
or unsigned
. But in many cases involving floating-point, there would be some advantage to using longer-precision scale factors.– supercat
Apr 20 at 16:29
...and then computes
double x=one_tenth*y;
, the calculation may be less precise than if one had written double x=y/10.0;
or used long double scale_factor=0.1lL
. If neither wholeQuantity1
nor wholeQuantity2
would need to accommodate values outside the range of int
, the expression wholeQuantity1+wholeQuantity2 will likely be of type int
or unsigned
. But in many cases involving floating-point, there would be some advantage to using longer-precision scale factors.– supercat
Apr 20 at 16:29
|
show 1 more comment
The Gnu Ada compiler ("Gnat") has supported 80-bit floating point as a fully-fledged built-in type with its Long_Long_Float
type since at least 1998.
Here's a Usenet argument in February of 1999 between Ada compiler vendors and users about whether not supporting 80-bit floats is an Ada LRM violation. This was a huge deal for compiler vendors, as many government contracts can't use your compiler then, and the rest of the Ada userbase at that time viewed the Ada LRM as the next best thing to holy writ.*
To take a simple example, an x86 compiler that does not support 80-bit
IEEE extended arithmetic is clearly violates B.2(10):
10 Floating point types corresponding to each floating
point format fully supported by the hardware.
and is thus non-conformant. It will still be fully validatable, since
this is not the sort of thing the validation can test with automated
tests.
...
P.S. Just to ensure that people do not regard the above as special
pleading for non-conformances in GNAT, please be sure to realize that
GNAT does support 80-bit float on the ia32 (x86).
Since this is a GCC-based compiler, its debatable if this is a revelation over the current top-rated answer, but I didn't see it mentioned.
* - It may look silly, but this user attitude kept Ada source code extremely portable. The only other languages that really can compare are ones that are effectively defined by the behavior of a single reference implementation, or under the control of a single developer.
add a comment |
The Gnu Ada compiler ("Gnat") has supported 80-bit floating point as a fully-fledged built-in type with its Long_Long_Float
type since at least 1998.
Here's a Usenet argument in February of 1999 between Ada compiler vendors and users about whether not supporting 80-bit floats is an Ada LRM violation. This was a huge deal for compiler vendors, as many government contracts can't use your compiler then, and the rest of the Ada userbase at that time viewed the Ada LRM as the next best thing to holy writ.*
To take a simple example, an x86 compiler that does not support 80-bit
IEEE extended arithmetic is clearly violates B.2(10):
10 Floating point types corresponding to each floating
point format fully supported by the hardware.
and is thus non-conformant. It will still be fully validatable, since
this is not the sort of thing the validation can test with automated
tests.
...
P.S. Just to ensure that people do not regard the above as special
pleading for non-conformances in GNAT, please be sure to realize that
GNAT does support 80-bit float on the ia32 (x86).
Since this is a GCC-based compiler, its debatable if this is a revelation over the current top-rated answer, but I didn't see it mentioned.
* - It may look silly, but this user attitude kept Ada source code extremely portable. The only other languages that really can compare are ones that are effectively defined by the behavior of a single reference implementation, or under the control of a single developer.
add a comment |
The Gnu Ada compiler ("Gnat") has supported 80-bit floating point as a fully-fledged built-in type with its Long_Long_Float
type since at least 1998.
Here's a Usenet argument in February of 1999 between Ada compiler vendors and users about whether not supporting 80-bit floats is an Ada LRM violation. This was a huge deal for compiler vendors, as many government contracts can't use your compiler then, and the rest of the Ada userbase at that time viewed the Ada LRM as the next best thing to holy writ.*
To take a simple example, an x86 compiler that does not support 80-bit
IEEE extended arithmetic is clearly violates B.2(10):
10 Floating point types corresponding to each floating
point format fully supported by the hardware.
and is thus non-conformant. It will still be fully validatable, since
this is not the sort of thing the validation can test with automated
tests.
...
P.S. Just to ensure that people do not regard the above as special
pleading for non-conformances in GNAT, please be sure to realize that
GNAT does support 80-bit float on the ia32 (x86).
Since this is a GCC-based compiler, its debatable if this is a revelation over the current top-rated answer, but I didn't see it mentioned.
* - It may look silly, but this user attitude kept Ada source code extremely portable. The only other languages that really can compare are ones that are effectively defined by the behavior of a single reference implementation, or under the control of a single developer.
The Gnu Ada compiler ("Gnat") has supported 80-bit floating point as a fully-fledged built-in type with its Long_Long_Float
type since at least 1998.
Here's a Usenet argument in February of 1999 between Ada compiler vendors and users about whether not supporting 80-bit floats is an Ada LRM violation. This was a huge deal for compiler vendors, as many government contracts can't use your compiler then, and the rest of the Ada userbase at that time viewed the Ada LRM as the next best thing to holy writ.*
To take a simple example, an x86 compiler that does not support 80-bit
IEEE extended arithmetic is clearly violates B.2(10):
10 Floating point types corresponding to each floating
point format fully supported by the hardware.
and is thus non-conformant. It will still be fully validatable, since
this is not the sort of thing the validation can test with automated
tests.
...
P.S. Just to ensure that people do not regard the above as special
pleading for non-conformances in GNAT, please be sure to realize that
GNAT does support 80-bit float on the ia32 (x86).
Since this is a GCC-based compiler, its debatable if this is a revelation over the current top-rated answer, but I didn't see it mentioned.
* - It may look silly, but this user attitude kept Ada source code extremely portable. The only other languages that really can compare are ones that are effectively defined by the behavior of a single reference implementation, or under the control of a single developer.
edited Apr 19 at 14:10
answered Apr 19 at 13:35
T.E.D.T.E.D.
61125
61125
add a comment |
add a comment |
Did any compilers ever make full use of extended precision (i.e. 80
bits in memory as well as in registers)? If not, why not?
Since any calculations inside the x87 fpu have 80bit precision by default, any compiler that's able to generate x87 fpu code, is already using extended precision.
I also remember using long double
even in 16-bit compilers for real mode.
The very similar situation was in 68k world, with FPUs like 68881 and 68882 supporting 80bit precision by default and any FPU code without special precautions would keep all register values in that precision. There was also long double
datatype.
On the other hand, Intel provided a solution with an extra eleven bits
of precision and five bits of exponent, that would cost very little
performance to use (since the hardware implemented it whether you used
it or not), and yet everyone seemed to behave as though this had no
value
The usage of long double
would prevent contemporary compilers from ever making calculations using SSE/whatever registers and instructions. And SSE is actually a very fast engine, able to fetch data in large chunks and make several computations in parallel, every clock. The x87 fpu now is just a legacy, not being very fast. So the deliberate usage of 80bit precision now would be certainly a huge performance hit.
Right, I was talking about the historical context in which x87 was the only FPU on x86, so no performance hit from using it. Good point about 68881 being a very similar architecture.
– rwallace
Apr 19 at 8:12
add a comment |
Did any compilers ever make full use of extended precision (i.e. 80
bits in memory as well as in registers)? If not, why not?
Since any calculations inside the x87 fpu have 80bit precision by default, any compiler that's able to generate x87 fpu code, is already using extended precision.
I also remember using long double
even in 16-bit compilers for real mode.
The very similar situation was in 68k world, with FPUs like 68881 and 68882 supporting 80bit precision by default and any FPU code without special precautions would keep all register values in that precision. There was also long double
datatype.
On the other hand, Intel provided a solution with an extra eleven bits
of precision and five bits of exponent, that would cost very little
performance to use (since the hardware implemented it whether you used
it or not), and yet everyone seemed to behave as though this had no
value
The usage of long double
would prevent contemporary compilers from ever making calculations using SSE/whatever registers and instructions. And SSE is actually a very fast engine, able to fetch data in large chunks and make several computations in parallel, every clock. The x87 fpu now is just a legacy, not being very fast. So the deliberate usage of 80bit precision now would be certainly a huge performance hit.
Right, I was talking about the historical context in which x87 was the only FPU on x86, so no performance hit from using it. Good point about 68881 being a very similar architecture.
– rwallace
Apr 19 at 8:12
add a comment |
Did any compilers ever make full use of extended precision (i.e. 80
bits in memory as well as in registers)? If not, why not?
Since any calculations inside the x87 fpu have 80bit precision by default, any compiler that's able to generate x87 fpu code, is already using extended precision.
I also remember using long double
even in 16-bit compilers for real mode.
The very similar situation was in 68k world, with FPUs like 68881 and 68882 supporting 80bit precision by default and any FPU code without special precautions would keep all register values in that precision. There was also long double
datatype.
On the other hand, Intel provided a solution with an extra eleven bits
of precision and five bits of exponent, that would cost very little
performance to use (since the hardware implemented it whether you used
it or not), and yet everyone seemed to behave as though this had no
value
The usage of long double
would prevent contemporary compilers from ever making calculations using SSE/whatever registers and instructions. And SSE is actually a very fast engine, able to fetch data in large chunks and make several computations in parallel, every clock. The x87 fpu now is just a legacy, not being very fast. So the deliberate usage of 80bit precision now would be certainly a huge performance hit.
Did any compilers ever make full use of extended precision (i.e. 80
bits in memory as well as in registers)? If not, why not?
Since any calculations inside the x87 fpu have 80bit precision by default, any compiler that's able to generate x87 fpu code, is already using extended precision.
I also remember using long double
even in 16-bit compilers for real mode.
The very similar situation was in 68k world, with FPUs like 68881 and 68882 supporting 80bit precision by default and any FPU code without special precautions would keep all register values in that precision. There was also long double
datatype.
On the other hand, Intel provided a solution with an extra eleven bits
of precision and five bits of exponent, that would cost very little
performance to use (since the hardware implemented it whether you used
it or not), and yet everyone seemed to behave as though this had no
value
The usage of long double
would prevent contemporary compilers from ever making calculations using SSE/whatever registers and instructions. And SSE is actually a very fast engine, able to fetch data in large chunks and make several computations in parallel, every clock. The x87 fpu now is just a legacy, not being very fast. So the deliberate usage of 80bit precision now would be certainly a huge performance hit.
answered Apr 19 at 7:23
lvdlvd
2,995721
2,995721
Right, I was talking about the historical context in which x87 was the only FPU on x86, so no performance hit from using it. Good point about 68881 being a very similar architecture.
– rwallace
Apr 19 at 8:12
add a comment |
Right, I was talking about the historical context in which x87 was the only FPU on x86, so no performance hit from using it. Good point about 68881 being a very similar architecture.
– rwallace
Apr 19 at 8:12
Right, I was talking about the historical context in which x87 was the only FPU on x86, so no performance hit from using it. Good point about 68881 being a very similar architecture.
– rwallace
Apr 19 at 8:12
Right, I was talking about the historical context in which x87 was the only FPU on x86, so no performance hit from using it. Good point about 68881 being a very similar architecture.
– rwallace
Apr 19 at 8:12
add a comment |
Thanks for contributing an answer to Retrocomputing Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fretrocomputing.stackexchange.com%2fquestions%2f9751%2fdid-any-compiler-fully-use-80-bit-floating-point%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Comments are not for extended discussion; this conversation has been moved to chat.
– Chenmunka♦
yesterday
Can the title be edited to make it clear we're talking about a particular 80-bit implementation? Would 'x87' or 'Intel' be the best word to add?
– another-dave
10 hours ago