Floating-Point Arithmetic
Floating-Point arithmetic is a way of representing real numbers in a computer. It is a way of representing numbers in a computer that is not exact, but is fast and efficient. It is a fundamental concept in numerical computing.
Almost all computation in a computer is carried out using vectors and matrices. These are represented in a computer as arrays of numbers. The numbers in these arrays are usually real numbers. In order to represent real numbers in a computer, we need to use a representation that is not exact, but is fast and efficient. This is called floating-point arithmetic.
This concept is so fundamental to numerical computing that it is worth spending some time to understand it. There is a very famous article on floating-point arithmetic by David Goldberg, which is called What Every Computer Scientist Should Know About Floating-Point Arithmetic (Goldberg, 1991). It is a very good read, and I highly recommend it. However, it is a bit long, and it is not easy to understand.
In this article, we will discuss the basics of floating-point arithmetic, and how it is used in Python.
Floating-Point 101
Computers cannot represent arbitrary real numbers. They can only handle a bounded sequence of binary bits that can be interpreted as bounded integers. Therefore, numerical analysis could be define as: The study of algorithms for mathematical analysis that rely on calculating using bounded integers only
In computer, I=integers are stored using a system called Two’s complement. In this system, bits are used to represent numbers between and .
Floating-point numbers. Floating-point numbers with base have the form
where is called the exponent, is the precision, and is the significand (or mantissa).
We can also write the floating-point number as
- = “mantissa” (or significand)
- = “exponent”
To get a feel for this, let’s see an example: the number . First, it is positive, so that sign is . Second, since it is between and , the exponent should be . So in order for
We should have
You can check that
def binary(num):
return ''.join('{:0>8b}'.format(c) for c in struct.pack('!f', num))
binary(3.140625)
# '01000000010010010000000000000000'

# more examples
# function link
# https://gist.github.com/oceanumeric/23d741404e01fb57f720d31a3a8e8348
binary_decompostion(10)
# Decimal: 1.25 x 2^3
# Binary: 1.01 x 2^11
# Sign: 0 (+)
# Mantissa: 01 (0.25)
# Exponent: 11 (3)
binary_decompostion(2.998e8)
# Decimal: 1.11684203147888184 x 2^28
# Binary: 1.0001110111101001010111 x 2^11100
# Sign: 0 (+)
# Mantissa: 0001110111101001010111 (0.11684203147888184)
# Exponent: 11100 (28)
For float32
, the exponent can be any number between and ,
and the precision is (mantissa = 23). For float64
, the exponent can be any number and and the precision is (mantissa=52). Figure 2 gives the illustration of the binary representation of a floating-point number.

One notable feature of floating-point numbers is that they are not distributed uniformly on the real axis. The diagram below shows the location of the finite numbers representable by a hypothetical 8-bit floating-point format that has a 4-bit mantissa (with an implied fifth leading 1 bit) and a 3-bit exponent but otherwise follows IEEE conventions. You will understand this distribution well after calculating different numbers with the code in this post.

To be super clear, the following gives the floating point format written as powers of for all 52 mantissa bits:
Therefore, is the smallest value we can add in the computer system. The smallest significant digit multiplied by our exponent gives us the spacing between this number and the next number we can represent with our format.
# mantissa times exponent
(2**-52) * (2**0)
# 2.220446049250313e-16
np.nextafter(1.0, 2)
# 1.0000000000000002
space = np.nextafter(1.0, 2)-1
space
# 2.220446049250313e-16
Notice the above result shows that the spacing between and is , which is the same as .
# space between 2 to 4
np.nextafter(2, 3)
# 2.0000000000000004
space = np.nextafter(2, 3)-2
space
# 4.440892098500626e-16
Remark: There are spaces between floating point numbers as it is shown in figure 3. Therefore, numbers will be rounded to the nearest number in the computer system.
Roundoff errors
If you do a calculation that puts you somewhere between the space of two numbers, the computer will automatically round to the nearest one. I like to think of this as a mountain range in the computer. The valleys are representable numbers. If a calculation puts us on a mountain side, we’ll roll down the mountain to the closest valley. For example, spacing for the number 1 is , so if we don’t add at least half that, we won’t reach the next representable number (therefore, this might give us roundoff errors):
# minimal space
spacing = (2 ** -52) * (2 ** 0)
1 + 0.4 * spacing == 1 # add less than half the spacing
# True
1 + 0.6 * spacing == 1 # add more than half the spacing
# False

We’re working with pretty small numbers here so you might not be too shocked. But remember, the bigger our exponent, the bigger our rounding errors will be!
# a very large number = 1e25
large_number = 1e25
binary_decompostion(large_number)
# Decimal: 1.03397576569128469 x 2^83
# Binary: 1.0000100010110010101000101100001010000000001010010001 x 2^1010011
# Sign: 0 (+)
# Mantissa: 0000100010110010101000101100001010000000001010010001 (0.03397576569128469)
# Exponent: 1010011 (83)
# the exponent is 83, which is very large
spacing = (2 ** -52) * (2 ** 83)
print(spacing, f"{spacing:0.5e}")
# 2147483648.0 2.14748e+09
That’s just over billion! billion is less than half that spacing, so if we add 1 billion to a large number, we won’t cross the mountain peak, and we’ll slide back down into the same valley:
one_billion = 1e9
1e25 + one_billion == 1e25
# True
The above test shows that the computer will round the number to the nearest number in the computer system even if we add a very large number to a very large number, which is completely wrong.
2 billion is more than half that spacing, so if we add it to large_number, we’ll slide into the next valley (representable number):
two_billion = 2e9
1e25 + two_billion == 1e25
# False
Another thing you should know is that, just as with integer data types, there is a biggest and a smallest number you can represent with floating point. You could determine this by hand (by making the mantissa and the exponent as big/small as possible), but we’ll use NumPy:
np.finfo("float64").max # maximum representable number
# 1.7976931348623157e+308
np.finfo("float64").min # minimum representable number
# -1.7976931348623157e+308
np.finfo("float64").tiny # closest to 0
# 2.2250738585072014e-308
The spacing between floating-point numbers in the range is , while it is in the range .
We will denote
This is called the unit roundoff. In double precision, .
Using floating-point arithmetic, is not necessarily equal to because the order of rounding is different.
2.0 ** (-53)
# 1.1102230246251565e-16
(-1.0 + 1.0) + 2.0**(-53) # get a very small value 2 ** (-53)
# 1.1102230246251565e-16
(-1.0) + (1.0 + 2.0**(-53)) # since 2 ** -53 < 2 ** -52, it was rounded off
# 0.0
In numerical analysis, we often need to know the relative error of a calculation and we always want to have a stable algorithm, which means the catastrophic cancellation should be avoided.
In numerical analysis, catastrophic cancellation is the phenomenon that subtracting good approximations to two nearby numbers (which create roundoff error spaces because of gaps we illustrated) may yield a very bad approximation to the difference of the original numbers.
Catastrophic cancellation isn’t affected by how large the inputs are—it applies just as much to large and small inputs. It depends only on how large the difference is, and on the error of the inputs.
Now, I hope you understand the concept of floating-point numbers and the spacing between them, and you can avoid catastrophic cancellation in your numerical analysis.
- Goldberg, D. (1991). What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys (CSUR), 23(1), 5–48.