Floating-Point Arithmetic

Floating-Point arithmetic is a way of representing real numbers in a computer. It is a way of representing numbers in a computer that is not exact, but is fast and efficient. It is a fundamental concept in numerical computing.

Almost all computation in a computer is carried out using vectors and matrices. These are represented in a computer as arrays of numbers. The numbers in these arrays are usually real numbers. In order to represent real numbers in a computer, we need to use a representation that is not exact, but is fast and efficient. This is called floating-point arithmetic.

This concept is so fundamental to numerical computing that it is worth spending some time to understand it. There is a very famous article on floating-point arithmetic by David Goldberg, which is called What Every Computer Scientist Should Know About Floating-Point Arithmetic (Goldberg, 1991). It is a very good read, and I highly recommend it. However, it is a bit long, and it is not easy to understand.

In this article, we will discuss the basics of floating-point arithmetic, and how it is used in Python.

Floating-Point 101

Computers cannot represent arbitrary real numbers. They can only handle a bounded sequence of binary bits that can be interpreted as bounded integers. Therefore, numerical analysis could be define as: The study of algorithms for mathematical analysis that rely on calculating using bounded integers only

In computer, I=integers are stored using a system called Two’s complement. In this system, bb bits are used to represent numbers between 2b1-2^{b-1} and 2b112^{b-1}-1.

Floating-point numbers. Floating-point numbers with base 22 have the form

±(1+i=1p1di2i)2e\pm(1+\sum_{i=1}^{p-1}d_i2^{-i})2^e

where ee is called the exponent, pp is the precision, and 1+i=1p1di2i1+\sum_{i=1}^{p-1}d_i2^{-i} is the significand (or mantissa).

We can also write the floating-point number as

1.M×2E1.M \times 2^E

To get a feel for this, let’s see an example: the number 3.1406253.140625. First, it is positive, so that sign is ++. Second, since it is between 22 and 222^2, the exponent ee should be 11. So in order for

3.140625=(significant)213.140625 = (\text{significant)}2^1

We should have

significant=1.5703125\text{significant} = 1.5703125

You can check that

1.5703125=1+i=1p1di2i=1+12+116+1128=1+21+24+271.5703125 = 1+\sum_{i=1}^{p-1}d_i2^{-i} = 1+ \frac{1}{2} + \frac{1}{16} + \frac{1}{128} = 1+2^{-1} + 2^{-4} + 2^{-7}

def binary(num):
    return ''.join('{:0>8b}'.format(c) for c in struct.pack('!f', num))
binary(3.140625)
# '01000000010010010000000000000000'
floating number illustrated
Figure 1. The binary representation of the floating-point number 3.1406253.140625.
# more examples
# function link
# https://gist.github.com/oceanumeric/23d741404e01fb57f720d31a3a8e8348
binary_decompostion(10)

#  Decimal: 1.25 x 2^3
#   Binary: 1.01 x 2^11
#     Sign: 0 (+)
# Mantissa: 01 (0.25)
# Exponent: 11 (3)

binary_decompostion(2.998e8)

#  Decimal: 1.11684203147888184 x 2^28
#   Binary: 1.0001110111101001010111 x 2^11100

#     Sign: 0 (+)
# Mantissa: 0001110111101001010111 (0.11684203147888184)
# Exponent: 11100 (28)

For float32, the exponent ee can be any number between 127-127 and 128128, and the precision pp is 2424 (mantissa = 23). For float64, the exponent ee can be any number 2047-2047 and 20482048 and the precision pp is 5353 (mantissa=52). Figure 2 gives the illustration of the binary representation of a floating-point number.

floating number illustrated
Figure 2. 32-bit floating-point number representation and 64-bit floating-point number representation based on IEEE 754 standard.

One notable feature of floating-point numbers is that they are not distributed uniformly on the real axis. The diagram below shows the location of the finite numbers representable by a hypothetical 8-bit floating-point format that has a 4-bit mantissa (with an implied fifth leading 1 bit) and a 3-bit exponent but otherwise follows IEEE conventions. You will understand this distribution well after calculating different numbers with the code in this post.

floating number illustrated
Figure 3. The distribution of floating-point numbers.

To be super clear, the following gives the floating point format written as powers of 22 for all 52 mantissa bits:

20.212223250251252×2e2^{0}.2^{-1}2^{-2}2^{-3}\cdots 2^{-50}2^{-51}2^{-52} \times 2^e

Therefore, 2522^{-52} is the smallest value we can add in the computer system. The smallest significant digit multiplied by our exponent gives us the spacing between this number and the next number we can represent with our format.

# mantissa times exponent
(2**-52) * (2**0)
# 2.220446049250313e-16
np.nextafter(1.0, 2)
# 1.0000000000000002
space = np.nextafter(1.0, 2)-1
space
# 2.220446049250313e-16

Notice the above result shows that the spacing between 11 and 1.00000000000000021.0000000000000002 is 2.220446049250313e162.220446049250313e-16, which is the same as 2522^{-52}.

# space between 2 to 4
np.nextafter(2, 3)
# 2.0000000000000004
space = np.nextafter(2, 3)-2
space
# 4.440892098500626e-16

Remark: There are spaces between floating point numbers as it is shown in figure 3. Therefore, numbers will be rounded to the nearest number in the computer system.

Roundoff errors

If you do a calculation that puts you somewhere between the space of two numbers, the computer will automatically round to the nearest one. I like to think of this as a mountain range in the computer. The valleys are representable numbers. If a calculation puts us on a mountain side, we’ll roll down the mountain to the closest valley. For example, spacing for the number 1 is 2522^{-52}, so if we don’t add at least half that, we won’t reach the next representable number (therefore, this might give us roundoff errors):

# minimal space 
spacing = (2 ** -52) * (2 ** 0)
1 + 0.4 * spacing == 1  # add less than half the spacing
# True
1 + 0.6 * spacing == 1  # add more than half the spacing
# False
floating number illustrated
Figure 4. Illustration of the spacing between floating-point numbers.

We’re working with pretty small numbers here so you might not be too shocked. But remember, the bigger our exponent, the bigger our rounding errors will be!

# a very large number = 1e25
large_number = 1e25
binary_decompostion(large_number)

#  Decimal: 1.03397576569128469 x 2^83
#   Binary: 1.0000100010110010101000101100001010000000001010010001 x 2^1010011

#     Sign: 0 (+)
# Mantissa: 0000100010110010101000101100001010000000001010010001 (0.03397576569128469)
# Exponent: 1010011 (83)
# the exponent is 83, which is very large
spacing = (2 ** -52) * (2 ** 83)
print(spacing, f"{spacing:0.5e}")
# 2147483648.0 2.14748e+09

That’s just over 22 billion! 11 billion is less than half that spacing, so if we add 1 billion to a large number, we won’t cross the mountain peak, and we’ll slide back down into the same valley:

one_billion = 1e9
1e25 + one_billion == 1e25
# True

The above test shows that the computer will round the number to the nearest number in the computer system even if we add a very large number to a very large number, which is completely wrong.

2 billion is more than half that spacing, so if we add it to large_number, we’ll slide into the next valley (representable number):

two_billion = 2e9
1e25 + two_billion == 1e25
# False

Another thing you should know is that, just as with integer data types, there is a biggest and a smallest number you can represent with floating point. You could determine this by hand (by making the mantissa and the exponent as big/small as possible), but we’ll use NumPy:

np.finfo("float64").max  # maximum representable number
# 1.7976931348623157e+308
np.finfo("float64").min  # minimum representable number
# -1.7976931348623157e+308
np.finfo("float64").tiny  # closest to 0
# 2.2250738585072014e-308

The spacing between floating-point numbers in the range [1,2)[1, 2) is 2.2×10162.2 \times 10^{-16}, while it is 4.4×10164.4 \times 10^{-16} in the range [2,4)[2, 4).

We will denote

u=12×(distance between 1 and the next floating-point number)u = \frac{1}{2} \times (\text{distance between 1 and the next floating-point number})

This is called the unit roundoff. In double precision, u1016u \approx 10^{-16}.

Using floating-point arithmetic, x+(y+z)x+(y+z) is not necessarily equal to (x+y)+z(x+y)+z because the order of rounding is different.

2.0 ** (-53)
# 1.1102230246251565e-16
(-1.0 + 1.0) + 2.0**(-53)  # get a very small value 2 ** (-53)
# 1.1102230246251565e-16
(-1.0) + (1.0 + 2.0**(-53))  # since 2 ** -53 < 2 ** -52, it was rounded off 
# 0.0

In numerical analysis, we often need to know the relative error of a calculation and we always want to have a stable algorithm, which means the catastrophic cancellation should be avoided.

In numerical analysis, catastrophic cancellation is the phenomenon that subtracting good approximations to two nearby numbers (which create roundoff error spaces because of gaps we illustrated) may yield a very bad approximation to the difference of the original numbers.

Catastrophic cancellation isn’t affected by how large the inputs are—it applies just as much to large and small inputs. It depends only on how large the difference is, and on the error of the inputs.

Now, I hope you understand the concept of floating-point numbers and the spacing between them, and you can avoid catastrophic cancellation in your numerical analysis.

  1. Goldberg, D. (1991). What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys (CSUR), 23(1), 5–48.