Why Are Floating Point Numbers Inaccurate? | Baeldung on Computer Science

1. Introduction

Computer memory has a finite capacity. Real numbers in $R$ , do not, in general, however, have finite uniform representation. For example, consider representing the rational number $\frac{4}{3}$ as a decimal value: it is $(1.333\overline{3})_{10}$ – it is impossible to do so exactly! Since only a finite number of digits can be stored in computer memory, the reals have to be approximated in some fashion (rounded or truncated), when represented on a computer.

In this tutorial, we’ll go over the basic ideas of floating-point representation and learn the limits of floating-point accuracy, when doing practical numerical computing.

2. Rounding and Chopping

There are two distinguishable ways of rounding off a real number $x$ to a given number $t$ of decimals.

In chopping, we simply leave off all decimals to the right of the $t$ th digit. For example, $56.555$ truncated to $t=1$ decimal place yields $56.5$ .

In rounding to nearest, we choose a number with $t$ decimals, which is the closest to $x$ . For example, consider rounding off $56.555$ to $t=1$ decimal place. There are two possible candidates: $56.5$ and $56.6$ . On the real number line, $56.5$ is at a distance of $|56.555 - 56.5|=0.55 \times 10^{-1}$ from $56.555$ , whilst $56.6$ is at a distance of $|56.6 - 56.555|=0.45 \times 10^{-1}$ . So, $56.6$ is the nearest and we round to $56.6$ .

Intuitively, we are comparing the fractional part of the number to the right of the $t$ th digit, which is $0.055 = 0.55 \times 10^{-1}$ with $0.5 \times 10^{-1}$ . As $0.55 \times 10^{-1} > 0.5 \times 10^{-1}$ , we incremented the $t$ th digit, which is $5$ by $1$ .

What if, we wished to round off $56.549$ to $t=1$ decimal places? Observe that, the fractional part is $0.49 \times 10^{-1}$ . As $0.49 \times 10^{-1} < 0.5 \times 10^{-1}$ , we leave the $t$ th digit unchanged. So, the result is $56.5$ , which is indeed nearest to $56.549$ .

In general, let $p$ be the fractional part of the number to the right of $t$ th digit (after the decimal point). If $p>0.5 \times 10^{-t}$ , we increment the $t$ th digit. If $p<0.5 \times 10^{-t}$ , we leave the $t$ th digit unchanged.

In the case of a tie, when $x$ is equidistant to two decimal $t$ -digit numbers, we raise the $t$ th decimal if it is odd or leave it unchanged if it is even. In this way, the error in rounding off a decimal number is positive or negative equally often.

Let’s see some more examples of rounding and chopping, to make sure, this sinks in. Assume that, we are interested in rounding off all quantities to $t=3$ decimal places:

The difference between chopping and rounding has real-world implications! The Vancouver stock exchange index began trading in 1982, with a base value of $1000.00$ . Although the underlying stocks were performing decent, the index began hitting the low 500s at the end of 1983. A computer program re-calculated the value of the index thousands of times each day, and the program used chopping instead of rounding to the nearest. A rounded calculation gave a value of $1082.00$ .

3. Computer Number Systems

In our daily life, we represent numbers using the decimal number system with base $10$ . In the decimal system, we’re aware, that if a digit $d_{-k}$ stands $k$ digits to the right of the decimal point, the value it contributes is $d_{-k} \cdot 10^{-k}$ . For example the sequence of digits $4711.303$ means:

$\begin{align*} 4 \cdot 10^3 + 7 \cdot 10^2 + 1 \cdot 10^1 + 1\cdot 10^0 + 3 \cdot 10^{-1} + 0 \cdot 10^{-2} + 3 \cdot 10^{-3} \end{align*}$

In fact any integer $\beta \geq 2$ (or $\beta \leq -2)$ can be used as a base. Analogously, every real number $x \in R$ has a unique representation of the form:

$\begin{align*} x = d_n \beta^n + d_{n-1}\beta^{n-1} + \ldots + d_1 \beta^{1} + d_0 \beta^0 + d_{-1}\beta^{-1} + d_{-2} \beta^{-2} + \ldots \end{align*}$

or compactly $d_{n}\ldots d_1 d_0.d_{-1} d_{-2} \ldots$ , where the coefficients $d_i$ , the digits in system $\beta$ , are positive integers such that $0 \leq d_i < \beta$ .

3.1. Conversion Algorithm Between Two Number Systems

Consider the problem of conversion between two number systems with different bases. For the sake of concreteness, let’s try to convert $(250)_{10}$ to the binary format. We may write:

$\begin{align*} 250 = d_7 \cdot 2^7 + d_6\cdot 2^6 + d_5 \cdot 2^5 + d_4\cdot 2^4 + d_3\cdot 2^3 + d_2\cdot 2^2 + d_1\cdot 2^1 + d_0 \end{align*}$

We can pull out a factor of $2$ from each term, except the last, and equivalently write:

$\begin{align*} 250 = 2 \times (d_7 \cdot 2^6 + d_6\cdot 2^5 + d_5 \cdot 2^4 + d_4\cdot 2^3 + d_3\cdot 2^2 + d_2\cdot 2^1 + d_1) + d_0 \end{align*}$

Intuitively, therefore, if we were to divide $q_0=250$ by $2$ , the expression in brackets, let’s call it $q_1$ , is the quotient and $d_0$ is the remainder of the division. Similarly, since $q_1 = 2 \times (d_7 \cdot 2^5 + d_6\cdot 2^4 + d_5 \cdot 2^3 + d_4\cdot 2^2 + d_3\cdot 2^1 + d_2) + d_1$ , division by $2$ would return the expression in the brackets as the quotient; call it $q_2$ , and $d_1$ as the remainder.

In general, if $a$ is an integer with base $\alpha$ and we want to determine it’s representation in a number system with base $\beta$ , we perform successive divisions of $a$ with $\beta$ : set $q_0 = a$ and

$\begin{align*} q_k = \beta \times q_{k+1} + d_k, \quad k = 0,1,2,\ldots \end{align*}$

$q_{k+1}$ is the quotient and $d_k$ is the remainder in the division.

Let’s look at the result of applying the algorithm to $(250)_{10}$ :

Therefore, $(250)_{10} = (d_7 d_6 d_5 d_4 d_3 d_2 d_1 d_0)_2 = 1111\,1010$ .

3.2. Converting Fractions to Another Number System

If the real number $a$ is not an integer, we write it as $a = b + c$ , where $b$ is the integer part and

$\begin{align*} c = c_{-1}\beta^{-1} + c_{-2}\beta^{-2} + c_{-3}\beta^{-3} + \ldots \end{align*}$

is the fractional part, where $c_{-1},c_{-2},c_{-3},\ldots$ are to be determined.

Observe that, multiplying both sides by the base $\beta$ yields,

$\begin{align*} c \cdot \beta = \underbrace{c_{-1}}_{\text{Integer}} + \underbrace{c_{-2}\beta^{-1} + c_{-3}\beta^{-2} + \ldots}_{\text{Fractional part}} \end{align*}$

an integer and a fractional portion. The integer portion is precisely $c_{-1}$ – the first digit of $c$ represented in base $\beta$ . Consecutive digits are obtained as the integer parts, when successively multiplying $c$ by $\beta$ .

In general, if a fraction $c$ must be converted to another number system with base $\beta$ , we perform successive multiplications of $c$ with $\beta$ : set $p_0 = c$ and

$\begin{align*} p_{k} \cdot \beta = c_{k-1} + p_{k-1}, \quad k = 0,-1,-2,\ldots \end{align*}$

Let’s look at an example of converting $(0.15625)_{10}$ to binary number system:

Therefore, $(0.15625)_{10} = 0.0010\,1000$ in the binary system.

3.3. Fixed and Floating-Point Representation

Computers are equipped to handle pieces of information of a fixed size called a word. Typical word lengths are 32-bits or 64-bits.

Early computers made numerical calculations in a fixed-point number system. That is, real numbers were represented using a fixed number of $t$ binary digits in the fractional part. If the word-length of the computer is $s+1$ bits, (including the sign bit), then only numbers in the bounded interval $I=[-2^{s-t},2^{s-t}]$ are permitted. This limitation is problematic, since, for example, even when $x \in I$ and $y \in I$ , it is possible that $x - y$ is outside the bounds of the interval $I$ .

In a floating-point number system, a real number $a$ is represented as:

$\begin{align*} a = \pm m \cdot \beta^e, \quad m = (0.d_1 d_2 d_3 \ldots)_\beta, \quad e \text{ an integer } \end{align*}$

The fractional part $m$ of the number is called the mantissa or significand. $e$ is called the exponent and $\beta$ is the base. It’s clear that $d_1 \neq 0$ , because if $d_1 = 0$ , then we can always decrease the exponent by $1$ and shift the decimal point one place to the right.

The mantissa $m$ and the exponent $e$ are limited by the fixed word length in computers. $m$ is rounded off to a number $\overline{m}$ with $t$ digits and the exponent $e$ lies in a certain range.

Thus, we can only represent floating-point numbers of the form:

$\begin{align*} a = \pm \overline{m} \cdot \beta^e, \quad \overline{m} = (0.d_1 d_2 d_3 \ldots d_t)_\beta, \quad e_{min} \leq e \leq e_{max}, \quad e \in \mathbf{Z} \end{align*}$

A floating-point number system $F$ is completely characterized by the base $\beta$ , precision $t$ , the numbers $e_{min}$ and $e_{max}$ . Since $d_1 \neq 0$ , the set $F$ , contains, including the number $0$ ,

$\begin{align*} 2 (\beta - 1) \beta^{t-1}(e_{max} - e_{min} + 1) + 1 \end{align*}$

numbers. Intuitively, there are $2$ choices for the sign. The digit $d_1$ can be chosen from the set $\{1,2,\ldots,\beta - 1\}$ . Each of the successive $t-1$ digits can be chosen from $\{0,1,2,\ldots,\beta - 1\}$ . The exponent can be chosen from $e_{max} - e_{min} + 1$ numbers. By the multiplication rule, there are $2 (\beta - 1) \beta^{t-1}(e_{max} - e_{min} + 1)$ distinguishable numbers. Including the number $0$ gives us the expression above.

3.4. Why Are Floating-Point Numbers Inaccurate?

Consider a toy floating-point number system $F(\beta = 2, t = 3, e_{min}=-1, e_{max}=2)$ . The set $F$ contains exactly $2 \cdot 16 + 1 = 33$ numbers. The positive numbers in the set $F$ are shown below:

It is apparent that not all real numbers, for instance in, $\mathbf{[1,2]}$ are present in $\mathbf{F}$ . Moreover, not all floating-point numbers are equally spaced; the spacing jumps by a factor of $\beta$ at each power of $\beta$ .

The spacing of the floating-point numbers is characterized by the machine epsilon $\mathbf{\epsilon_M}$ , which is the distance from $1.0$ to the next largest floating-point number.

If a real number $x$ is in the range of the floating-point system, the obvious thing to do is to round off $x$ to $\overline{x}$ , where $\overline{x}=float(x)$ is the floating-point number in $F$ , that is closest to $x$ . It should already be clear that representing $\mathbf{x}$ by $\mathbf{float(x)}$ introduces an error.

One interesting question is, how large is this error? It would be great if we could guarantee, that the round-off error at no point exceeds a certain amount. Everybody loves a guarantee! In other words, we seek an upper bound for the relative error:

$\begin{align*} \frac{|fl(x) - x|}{|x|} \end{align*}$

Recall, from the earlier discussion, that when rounding $m$ to $t$ decimals, we leave the $t$ th decimal unchanged, if the part $p$ , of the number to right of the $t$ th decimal, is smaller than $\frac{1}{2} \times 10^{-t}$ , else raise the $t$ th decimal by $1$ . Here, we are working with the generic base $\beta$ instead of the decimal base $10$ . Consequently, the round-off error in the mantissa is bounded by:

$\begin{align*} |\overline{m} - m| \leq \frac{1}{2}\cdot \beta^{-t} \end{align*}$

The relative round-off error in $x$ is thus bounded by:

$\begin{align*} \frac{|fl(x) - x|}{|x|} &= \frac{|\overline{m} - m| \cdot \beta^e}{m \cdot \beta^e}\\ &\leq \frac{\frac{1}{2}\cdot \beta^{-t} \cdot \beta^e}{(0.1)_\beta \beta^e} \quad \quad \{ m \geq (0.1)_\beta \} \\ &= \frac{1}{2} \cdot \beta^{-t + 1} \end{align*}$

Modern floating-point standards such as the IEEE guarantee this upper bound. This upper bound is called the rounding unit, denoted by $\mathbf{u}$ .

4. IEEE Floating-point Standard in a Nutshell

Actual computer hardware implementations of floating-point systems differ from the toy system, we just designed. Most current computers conform to the IEEE 754 standard for binary floating-point arithmetic. There is two main basic formats – single and double precision, requiring 32-bit and 64-bit storage.

In single-precision, a floating-point number $a$ is stored as the sign $s$ ( $1$ bit), the exponent $e$ ( $8$ bits), the mantissa $m$ ( $23$ bits).

In double-precision, $11$ bits are allocated to the exponent $e$ , whereas $52$ bits are allocated to the mantissa $m$ . The exponent is stored as $b=e+1023$ .

The value $v$ of the floating-point number $a$ in the normal case is:

$\begin{align*} v = (-1)^s (1.m)_2 \cdot 2^e, \quad \quad e_{min} \leq e \leq e_{max} \end{align*}$

Note, that the digit before the binary point is always $1$ , similar to the scientific notation we studied in high school. In that way, $1$ bit is gained for the mantissa.

4.1. Quirks in Floating-Point Arithmetic

Consider the following comparison:

(0.10 + 0.20) == 0.30

The result of this logical comparison is false. This abrupt behavior is expected because the floating-point system is broken. However, let’s take a deeper look at what’s going on.

Let’s put $0.10$ in the double-precision format. Because $0.10$ is positive, the sign bit $s=0$ .

$0.10$ in base- $2$ scientific notation can be written as $(-1)^0 (1 + m) 2^{e}$ . This means we must factor $0.10$ into a number $(1 + m)$ in the range $1 \leq m < 2$ and a power of $2$ . If we divide $0.10$ by different powers of $2$ , we get:

$\begin{align*} 0.10 / 2^{-1} &= 0.20\\ 0.10 / 2^{-2} &= 0.40\\ 0.10 / 2^{-3} &= 0.80\\ 0.10 / 2^{-4} &= 1.60 \end{align*}$

Therefore, $0.10 = 1.60 \times 2^{-4}$ . The exponent is stored as $e + 1023=-4 + 1023 = 1019$ , so the bit pattern in the exponent part is $(011\, 1111\, 1011)_2$ .

The mantissa $m$ is the fractional part $0.60$ in the binary form. Successive multiplication by $2$ quickly yields $m = (0.1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001 \ldots)_2$ . However, the double-precision format allocates $t=52$ bits to the mantissa, so we must round off $m$ to $52$ digits. The fractional part $p$ , after the $52$ nd digit, $(0.1001 \ldots)_2 \times 2^{-52}$ exceeds $\frac{1}{2} \beta^{-t} = (0.1)_2 \times 2^{-52}$ . So, we raise the $t$ th decimal by $1$ . Thus, the rounded mantissa is $\overline{m}= (0.1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1010)_2$ .

Finally, we put the binary strings in the correct order. So, $0.10$ in the IEEE double-precision format is:

$\begin{align*} \underbrace{0}_{s}\,\underbrace{011\, 1111\, 1011}_{e+1023}\,\underbrace{1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1001\, 1010}_{\overline{m}} \end{align*}$

This machine number is approximately $0.10000000000000000555$ in base $10$ . In a similar fashion, the closest machine number to $0.2$ in the floating-point system is approximately $0.2000000000000000111$ . Therefore, $float(0.1) + float(0.2) > 0.30$ . On the right hand side, the closest machine number to $0.3$ is $0.29999999999999998889$ . So, $float(0.30) < 0.30$ .

To summarize, algebraically equivalent statements are not necessarily numerically equivalent.

5. Conclusion

In this article, we’ve learned how the rules of chopping and rounding to the nearest work. We developed an algorithm for conversion between number systems. We also learnt that, any floating-point system is characterized by a quadruple $F(\beta,t,e_{min},e_{max}$ ). For example, the IEEE double-precision format is given as $F(\beta=2,t=52,e_{min}=-1022,e_{max}=1023)$ . Because computers have finite memory capacity, the set of machine numbers, $\mathbf{M}$ are only a subset of the real numbers $\mathbf{R}$ . The spacing between machine numbers in technical speak is called the machine epsilon.

Further, any real number $x$ is always rounded off to the nearest machine number on a computer. The IEEE standards guarantee that the relative roundoff error is no more than a certain amount. Moreover, as programmers, we need to take proper care when doing floating-point operations.

Core Concepts

Operating Systems

Artificial Intelligence

Graph Theory

Latex

Full Archive

About Baeldung