IEEE 754

(Redirected from IEEE floating-point standard)
The 'IEEE Standard for Binary Floating-Point Arithmetic' ('IEEE 754') is the most widely-used standard for floating-point computation, and is followed by many CPU and FPU implementations. The standard defines formats for representing floating-point numbers (including negative zero and denormal numbers) and special values (infinities and NaNs) together with a set of ''floating-point operations'' that operate on these values. It also specifies four rounding modes and five exceptions (including when the exceptions occur, and what happens when they do occur).
IEEE 754 specifies four formats for representing floating-point values: single-precision (32-bit), double-precision (64-bit), single-extended precision (≥ 43-bit, not commonly used) and double-extended precision (≥ 79-bit, usually implemented with 80 bits). Only 32-bit values are required by the standard; the others are optional. Many languages specify that IEEE formats and arithmetic be implemented, although sometimes it is optional. For example, the C programming language, which pre-dated IEEE 754, now allows but does not require IEEE arithmetic (the C float typically is used for IEEE single-precision and double uses IEEE double-precision).
The full title of the standard is 'IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985)', and it is also known as 'IEC 60559:1989, Binary floating-point arithmetic for microprocessor systems' (originally the reference number was IEC 559:1989).[1]
Later there was an 'IEEE 854-1987' for "radix independent floating point" as long as the radix is 2 or 10.

Contents
Anatomy of a floating-point number
Bit conventions used in this article
General layout
Exponent biasing
Cases
Single-precision 32 bit
A more complex example
Double-precision 64 bit
Comparing floating-point numbers
Rounding floating-point numbers
Extending the real numbers
Recommended functions and predicates
References
Revision of the standard
See also
External links

Anatomy of a floating-point number


Following is a description of the standards' format for floating-point numbers.
Bit conventions used in this article

Bits within a word of width W are indexed by integers in the range 0 to W−1 inclusive. The bit with index 0 is drawn on the right. The lowest indexed bit is usually the lsb (Least Significant Bit, the one that if changed would cause the smallest variation of the represented value).
General layout

The three fields in an IEEE 754 float

Binary floating-point numbers are stored in a sign-magnitude form where the most significant bit is the sign bit, ''exponent'' is the biased exponent, and ''"fraction"'' is the significand minus the ''most significant bit''.
Exponent biasing

The exponent is biased by 2''e''−1−1. See also Excess-''N''. Biasing is done because exponents have to be signed values in order to be able to represent both tiny and huge values, but two's complement, the usual representation for signed values, would make comparison harder. To solve this the exponent is biased before being stored, by adjusting its value to put it within an unsigned range suitable for comparison.
For example, to represent a number which has exponent of 17, ''exponent'' is 17 + 2''e''−1−1. Assuming ''e'' = 8, the exponent is equal to 17 + 128 − 1 = 144.
Cases

The most significant bit of the significand (not stored) is determined by the value of ''exponent''. If 0 < ''exponent'' < 2^{e} - 1, the most significant bit of the ''significand'' is 1, and the number is said to be ''normalized''. If ''exponent'' is 0, the most significant bit of the ''significand'' is 0 and the number is said to be ''de-normalized''. Three special cases arise:
# if ''exponent'' is 0 and ''fraction'' is 0, the number is ±0 (depending on the sign bit)
# if ''exponent'' = 2^{e} - 1 and ''fraction'' is 0, the number is ±infinity (again depending on the sign bit), and
# if ''exponent'' = 2^{e} - 1 and ''fraction'' is not 0, the number being represented is not a number (NaN).
This can be summarized as:
Type Exponent Fraction
Zeroes 0 0
Denormalized numbers 0 non zero
Normalized numbers 1 to 2^e-2 any
Infinities 2^e-1 0
NaNs 2^e-1 non zero

Single-precision 32 bit

A single-precision binary floating-point number is stored in 32 bits.
Bit values for the the IEEE 754 32bit float 0.15625

The exponent is biased by 2^{8-1} - 1 = 127 in this case (Exponents in the range −126 to +127 are representable. See the above explanation to understand why biasing is done). An exponent of −127 would be biased to the value 0 but this is reserved to encode that the value is a denormalized number or zero. An exponent of 128 would be biased to the value 255 but this is reserved to encode an infinity or not a number (NaN). See the chart above.
For normalised numbers, the most common, ''exponent'' is the biased exponent and
''fraction'' is the significand minus the most significant bit.
The number has value v:
v = s × 2e × m
Where
s = +1 (positive numbers) when the sign bit is 0
s = −1 (negative numbers) when the sign bit is 1
e = Exp − 127 (in other words the exponent is stored with 127 added to it, also called "biased with 127")
m = 1.fraction in binary (that is, the significand is the binary number 1 followed by the radix point followed by the binary bits of the fraction). Therefore, 1 ≤ m < 2.
In the example shown above, the sign is zero, the exponent is −3, and the significand is 1.01 (in binary, which is 1.25 in decimal). The represented number is therefore +1.25 × 2−3, which is +0.15625.
Notes:
# Denormalized numbers are the same except that e = −126 and m is 0.fraction. (e is NOT −127 : The fraction has to be shifted to the right by one more bit, in order to include the leading bit, which is not always 1 in this case. This is balanced by incrementing the exponent to −126 for the calculation.)
# −126 is the smallest exponent for a normalized number
# There are two Zeroes, +0 (s is 0) and −0 (s is 1)
# There are two Infinities +∞ (s is 0) and −∞ (s is 1)
# NaNs may have a sign and a fraction, but these have no meaning other than for diagnostics; the first bit of the fraction is often used to distinguish ''signaling NaNs'' from ''quiet NaNs''
# NaNs and Infinities have all 1s in the Exp field.
# The positive and negative numbers closest to zero (represented by the denormalized value with all 0s in the Exp field and the binary value 1 in the Fraction field) are
#: ±2−149 ≈ ±1.4012985
# The positive and negative normalized numbers closest to zero (represented with the binary value 1 in the Exp field and 0 in the fraction field) are
#: ±2−126 ≈ ±1.175494351
# The finite positive and finite negative numbers furthest from zero (represented by the value with 254 in the Exp field and all 1s in the fraction field) are
#: ±((1-(1/2)24)2128) [1] ≈ ±3.4028235
Here is the summary table from the previous section with some example 32-bit single-precision examples:
Type Exponent Significand Value
Zero 0000 0000 000 0000 0000 0000 0000 0000 0.0
One 0111 1111 000 0000 0000 0000 0000 0000 1.0
Denormalized number 0000 0000 100 0000 0000 0000 0000 0000 5.9
Large normalized number 1111 1110 111 1111 1111 1111 1111 1111 3.4
Small normalized number 0000 0001 000 0000 0000 0000 0000 0000 1.18
Infinity 1111 1111 000 0000 0000 0000 0000 0000 Infinity
NaN 1111 1111 non zero NaN

A more complex example

Bit values for the IEEE 754 32bit float -118.625

Let us encode the decimal number −118.625 using the IEEE 754 system.
# First we need to get the sign, the exponent and the fraction. Because it is a negative number, the sign is "1".
# Now, we write the number (without the sign; i.e. unsigned, no two's complement) using binary notation. The result is 1110110.101. We get the 101 after the decimal like this:
## 0.625 x 2 = 1.25 which means we write 1 after decimal and move on
## 0.25 x 2 = 0.5 which means we write 0 after the decimal and move on
## 0.5 x 2 = 1.00 which means we write 1 after the decimal and we are also finished since we have no residuum left to work with
# Next, let's move the radix point left, leaving only a 1 at its left: 1110110.101 = 1.110110101 × 26. This is a normalized floating point number. The first 1 binary digit is dropped. The fraction is the part at the right of the radix point, filled with 0 on the right until we get all 23 bits. That is 11011010100000000000000.
# The exponent is 6, but we need to convert it to binary and bias it (so the most negative exponent is 0, and all exponents are non-negative binary numbers). For the 32-bit IEEE 754 format, the bias is 127 and so 6 + 127 = 133. In binary, this is written as 10000101.
Double-precision 64 bit

The three fields in a 64bit IEEE 754 float

Double precision is essentially the same except that the fields are wider:
The fraction part is much larger, while the exponent is only slightly larger. The standard creators believed precision is more important than range.
NaNs and Infinities are represented with Exp being all 1s (2047).
For Normalized numbers the exponent bias is +1023 (so e is exponent (− 1023)). For Denormalized numbers the exponent is (−1022) (the minimum exponent for a normalized number—it is not (−1023) because normalised numbers have a leading 1 digit before the binary point and denormalized numbers do not). As before, both infinity and zero are signed.
Notes:
# The positive and negative numbers closest to zero (represented by the denormalized value with all 0s in the Exp field and the binary value 1 in the Fraction field) are
#: ±2−1074 ≈ ±5
# The positive and negative normalized numbers closest to zero (represented with the binary value 1 in the Exp field and 0 in the fraction field) are
#: ±2−1022 ≈ ±2.2250738585072020
# The finite positive and finite negative numbers furthest from zero (represented by the value with 2046 in the Exp field and all 1s in the fraction field) are
#: ±((1-(1/2)53)21024) 1 ≈ ±1.7976931348623157

Comparing floating-point numbers


IEEE floating point numbers are lexicographically ordered. If NaNs are excluded, IEEE floating point numbers can be compared (>, <, or ==) as sign and magnitude integers.

Rounding floating-point numbers


The IEEE standard has four different rounding modes; the first is the default; the others are called ''directed roundings''.

★ 'Round to Nearest' – rounds to the nearest value; if the number falls midway it is rounded to the nearest value with an even (zero) least significant bit, which occurs 50% of the time (in IEEE 754r this mode is called ''roundTiesToEven'' to distinguish it from another round-to-nearest mode)

★ 'Round toward 0' – directed rounding towards zero

★ 'Round toward +∞' – directed rounding towards positive infinity

★ 'Round toward −∞' – directed rounding towards negative infinity.

Extending the real numbers


The IEEE standard employs (and extends) the affinely extended real number system, with separate positive and negative infinities. During drafting, there was a proposal for the standard to incorporate the projectively extended real number system, with a single unsigned infinity, by providing programmers with a mode selection option. In the interest of reducing the complexity of the final standard, the projective mode was dropped, however. The Intel 8087 and Intel 80287 floating point co-processors both support this projective mode.[2][3][4]

Recommended functions and predicates



★ Under some C compilers, copysign(x,y) returns x with the sign of y, so abs(x) equals copysign(x,1.0). This is one of the few operations which operates on a NaN in a way resembling arithmetic. The function copysign is new in the C99 standard.

★ −x returns x with the sign reversed. This is different from 0−x in some cases, notably when x is 0. So −(0) is −0, but the sign of 0−0 depends on the rounding mode.

★ scalb (y, N)

★ logb (x)

★ finite (x) a predicate for "x is a finite value", equivalent to −Inf < x < Inf

★ isnan (x) a predicate for "x is a nan", equivalent to "x ≠ x"

★ x <> y which turns out to have different exception behavior than NOT(x = y).

★ unordered (x, y) is true when "x is unordered with y", i.e., either x or y is a NaN.

★ class (x)

★ nextafter(x,y) returns the next representable value from x in the direction towards y

References


1.
2. Handling Floating-Point Exceptions in Numeric Programs, John R. Hauser, , , ACM Transactions on Programming Languages and Systems,
3. IEEE Task P754: A proposed standard for binary floating-point arithmetic, David Stevenson, , , Computer,
4. On a proposed floating-point standard, Kahan, W. and Palmer, J., , , SIGNUM Newsletter,


Floating Point Unit by Jidan Al-Eryani

Revision of the standard


Note that the IEEE 754 standard is currently under revision. See: IEEE 754r

See also



minifloat for simple examples of properties of IEEE 754 floating point numbers

−0 (negative zero)

IEEE 754r working group to revise IEEE 754-1985.

Intel 8087 (early implementation effort)

Q (number format) For constant resolution

External links



IEEE 754 references

Let's Get To The (Floating) Point by Chris Hecker

What Every Computer Scientist Should Know About Floating-Point Arithmetic by David Goldberg - a good introduction and explanation.

A compendium of non-intuitive behaviours of floating-point on popular architectures, with implications for program verification and testing

IEEE 854-1987 History and minutes

Web Based Converter

Another Web Based Converter

Java Applet Converter

Converter as MS-Windows program

An Interview with the Old Man of Floating-Point

Coprocessor.info : x87 FPU pictures, development and manufacturer information

Understanding IEEE 754 - "Try it yourself"

This article provided by Wikipedia. To edit the contents of this article, click here for original source.

psst.. try this: add to faves