## 1. Introduction

Computer memory has a finite capacity. Real numbers in , do not, in general, however, have finite uniform representation. For example, consider representing the rational number as a decimal value: it is – it is impossible to do so exactly! Since only a finite number of digits can be stored in computer memory, the reals have to be approximated in some fashion (rounded or truncated), when represented on a computer.

In this tutorial, we’ll go over the basic ideas of floating-point representation and learn the limits of floating-point accuracy, when doing practical numerical computing.

## 2. Rounding and Chopping

There are two distinguishable ways of rounding off a real number to a given number of decimals.

**In chopping, we simply leave off all decimals to the right of the th digit**. For example, truncated to decimal place yields .

**In rounding to nearest, we choose a number with decimals, which is the closest to **. For example, consider rounding off to decimal place. There are two possible candidates: and . On the real number line, is at a distance of from , whilst is at a distance of . So, is the nearest and we round to .

Intuitively, we are comparing the fractional part of the number to the right of the th digit, which is with . As , we incremented the th digit, which is by .

What if, we wished to round off to decimal places? Observe that, the fractional part is . As , we leave the th digit unchanged. So, the result is , which is indeed nearest to .

In general, let be the fractional part of the number to the right of th digit (after the decimal point). If , we increment the th digit. If , we leave the th digit unchanged.

In the case of a tie, when is equidistant to two decimal -digit numbers, we raise the th decimal if it is odd or leave it unchanged if it is even. In this way, the error in rounding off a decimal number is positive or negative equally often.

Let’s see some more examples of rounding and chopping, to make sure, this sinks in. Assume that, we are interested in rounding off all quantities to decimal places:

x | Rounding | Chopping |
---|---|---|

0.2397 | 0.240 | 0.239 |

0.23750 | 0.238 | 0.237 |

0.23650 | 0.236 | 0.236 |

0.23652 | 0.237 | 0.236 |

3.142 | 3.141 |

The difference between chopping and rounding has real-world implications! The Vancouver stock exchange index began trading in 1982, with a base value of . Although the underlying stocks were performing decent, the index began hitting the low 500s at the end of 1983. A computer program re-calculated the value of the index thousands of times each day, and the program used chopping instead of rounding to the nearest. A rounded calculation gave a value of .

## 3. Computer Number Systems

In our daily life, we represent numbers using the decimal number system with base . In the decimal system, we’re aware, that if a digit stands digits to the right of the decimal point, the value it contributes is . For example the sequence of digits means:

In fact any integer (or can be used as a base. Analogously, every real number has a unique representation of the form:

or compactly , where the coefficients , the digits in system , are positive integers such that .

### 3.1. Conversion Algorithm Between Two Number Systems

Consider the problem of conversion between two number systems with different bases. For the sake of concreteness, let’s try to convert to the binary format. We may write:

We can pull out a factor of from each term, except the last, and equivalently write:

Intuitively, therefore, if we were to divide by , the expression in brackets, let’s call it , is the quotient and is the remainder of the division. Similarly, since , division by would return the expression in the brackets as the quotient; call it , and as the remainder.

In general, if is an integer with base and we want to determine it’s representation in a number system with base , we perform successive divisions of with : set and

is the quotient and is the remainder in the division.

Let’s look at the result of applying the algorithm to :

0 | 250 | 250/2 = 125 | 0 |

1 | 125 | 125/2 = 62 | 1 |

2 | 62 | 62/2 = 31 | 0 |

3 | 31 | 31/2 = 15 | 1 |

4 | 15 | 15/2 = 7 | 1 |

5 | 7 | 7/2 = 3 | 1 |

6 | 3 | 3/2 = 1 | 1 |

7 | 1 | 1/2 = 0 | 1 |

Therefore, .

### 3.2. Converting Fractions to Another Number System

If the real number is not an integer, we write it as , where is the integer part and

is the fractional part, where are to be determined.

Observe that, multiplying both sides by the base yields,

an integer and a fractional portion. The integer portion is precisely – the first digit of represented in base . Consecutive digits are obtained as the integer parts, when successively multiplying by .

In general, if a fraction must be converted to another number system with base , we perform successive multiplications of with : set and

Let’s look at an example of converting to binary number system:

0 | 0.15625 | 0 | 0.3125 | |

-1 | 0.3125 | 0 | 0.625 | |

-2 | 0.625 | 1 | 0.25 | |

-3 | 0.25 | 0 | 0.50 | |

-4 | 0.50 | 1 | 0.00 |

Therefore, in the binary system.

### 3.3. Fixed and Floating-Point Representation

**Computers are equipped to handle pieces of information of a fixed size called a word**. Typical word lengths are 32-bits or 64-bits.

Early computers made numerical calculations in a fixed-point number system. That is, real numbers were represented using a fixed number of binary digits in the fractional part. If the word-length of the computer is bits, (including the sign bit), then only numbers in the bounded interval are permitted. This limitation is problematic, since, for example, even when and , it is possible that is outside the bounds of the interval .

In a floating-point number system, a real number is represented as:

The fractional part of the number is called the mantissa or significand. is called the exponent and is the base. It’s clear that , because if , then we can always decrease the exponent by and shift the decimal point one place to the right.

The mantissa and the exponent are limited by the fixed word length in computers. is rounded off to a number with digits and the exponent lies in a certain range.

Thus, we can only represent floating-point numbers of the form:

**A floating-point number system is completely characterized by the base , precision , the numbers and **. Since , the set , contains, including the number ,

numbers. Intuitively, there are choices for the sign. The digit can be chosen from the set . Each of the successive digits can be chosen from . The exponent can be chosen from numbers. By the multiplication rule, there are distinguishable numbers. Including the number gives us the expression above.

### 3.4. Why Are Floating-Point Numbers Inaccurate?

Consider a toy floating-point number system . The set contains exactly numbers. The positive numbers in the set are shown below:

**It is apparent that not all real numbers, for instance in, are present in **. Moreover, not all floating-point numbers are equally spaced; the spacing jumps by a factor of at each power of .

**The spacing of the floating-point numbers is characterized by the machine epsilon **, which is the distance from to the next largest floating-point number.

If a real number is in the range of the floating-point system, the obvious thing to do is to round off to , where is the floating-point number in , that is closest to . **It should already be clear that representing by introduces an error**.

One interesting question is, how large is this error? It would be great if we could guarantee, that the round-off error at no point exceeds a certain amount. Everybody loves a guarantee! In other words, we seek an upper bound for the relative error:

Recall, from the earlier discussion, that when rounding to decimals, we leave the th decimal unchanged, if the part , of the number to right of the th decimal, is smaller than , else raise the th decimal by . Here, we are working with the generic base instead of the decimal base . Consequently, the round-off error in the mantissa is bounded by:

The relative round-off error in is thus bounded by:

Modern floating-point standards such as the IEEE guarantee this upper bound. **This upper bound is called the rounding unit, denoted by **.

## 4. IEEE Floating-point Standard in a Nutshell

Actual computer hardware implementations of floating-point systems differ from the toy system, we just designed. Most current computers conform to the IEEE 754 standard for binary floating-point arithmetic. There is two main basic formats – single and double precision, requiring 32-bit and 64-bit storage.

In single-precision, a floating-point number is stored as the sign ( bit), the exponent ( bits), the mantissa ( bits).

In double-precision, bits are allocated to the exponent , whereas bits are allocated to the mantissa . The exponent is stored as .

The value of the floating-point number in the normal case is:

Note, that the digit before the binary point is always , similar to the scientific notation we studied in high school. In that way, bit is gained for the mantissa.

### 4.1. Quirks in Floating-Point Arithmetic

Consider the following comparison:

`(0.10 + 0.20) == 0.30`

The result of this logical comparison is false. This abrupt behavior is expected because the floating-point system is broken. However, let’s take a deeper look at what’s going on.

Let’s put in the double-precision format. Because is positive, the sign bit .

in base- scientific notation can be written as . This means we must factor into a number in the range and a power of . If we divide by different powers of , we get:

Therefore, . The exponent is stored as , so the bit pattern in the exponent part is .

The mantissa is the fractional part in the binary form. Successive multiplication by quickly yields . However, the double-precision format allocates bits to the mantissa, so we must round off to digits. The fractional part , after the nd digit, exceeds . So, we raise the th decimal by . Thus, the rounded mantissa is .

Finally, we put the binary strings in the correct order. So, in the IEEE double-precision format is:

This machine number is approximately in base . In a similar fashion, the closest machine number to in the floating-point system is approximately . Therefore, . On the right hand side, the closest machine number to is . So, .

To summarize, **algebraically equivalent statements are not necessarily numerically equivalent**.

## 5. Conclusion

In this article, we’ve learned how the rules of chopping and rounding to the nearest work. We developed an algorithm for conversion between number systems. We also learnt that, any floating-point system is characterized by a quadruple ). For example, the IEEE double-precision format is given as . **Because computers have finite memory capacity, the set of machine numbers, are only a subset of the real numbers **. The spacing between machine numbers in technical speak is called the machine epsilon.

Further, any real number is always rounded off to the nearest machine number on a computer. The IEEE standards guarantee that the relative roundoff error is no more than a certain amount. Moreover, as programmers, we need to take proper care when doing floating-point operations.