Floating point numbers

Floating point numbers#

In the following, we explore how computers represent numbers. Integers are the simplest to understand. A computer can only represent a finite range of integers, for example, some implementations can store integers between \(-2^{63} = -9223372036854775808\) and \(2^{63}-1 = 9223372036854775807\).

Next, consider rational numbers such as \(1/3 = 0.33333\ldots\). One might attempt to work directly with fractional expressions like \(1/3\) and keep track of numerators and denominators. However, the number of digits in both the numerator and denominator tends to grow significantly as computations progress, making this approach impractical in most cases. Irrational numbers such as \(\pi\) or \(e\) pose an even greater challenge because they cannot be represented as fractions of integers.

The solution for \(\mathbb{Q}\) and \(\mathbb{R}\) is the same: we must truncate their representation after a finite number of digits, thereby introducing a small error. Complex numbers, in terms of their storage in computer memory, are treated like two real numbers.

Definition of floating-point numbers#

The basic idea behind floating-point numbers is to define a finite subset of \(\mathbb{Q}\) that has a finer spacing for small numbers, allowing computations to account for small quantities, while employing a wider spacing for large numbers, thus covering a large range between the smallest and largest represented numbers.

Definition 2

We call \(\mathcal{F}\) a set of floating-point numbers if

\[ \mathcal{F} = \{ 0 \} \cup \left\{(-1)^s\cdot b^e \cdot \frac{m}{b^{p-1}} :\right. \left. s \in \{ 0,1\}; e_{min}\leq e \leq e_{max}; b^{p-1}\leq m\leq b^{p}-1\right\}. \]

for some \(b, p \in \{2, 3, 4, \ldots\}\) and \(e_\min, e_\max \in \mathbb{Z}\) with \(e_\min < e_\max\).

The above rigorous definition of floating-point numbers allows us to prove theorems that consider that computers only support finite-precision arithmetic.

The definition already indicates how the numbers of \(\mathcal{F}\) are best represented in the base-\(b\) numerical system. Most commonly, we look at binary numbers, i.e., base \(b = 2\), and sometimes at decimal numbers, i.e., base \(b = 10\). The factor \((-1)^s\) determines the sign of the number. The factor \(b^e\) determines the position of the leading digit. The final factor \(\frac{m}{b^{p-1}}\) takes values between \(1 = b^{p-1} / b^{p-1}\) and \((b^{p}-1) / b^{p-1} = b - 1 / b^{p-1}\). For large \(p\), we find \(b \approx (b^{p}-1) / b^{p-1}\). Thus, varying the mantissa \(m\) allows us to approximate different numbers, whose leading digit is in the same position:

\[ \frac{m}{b^{p-1}} = 1, 1+b^{1-p}, 1+2b^{1-p}, \dots, b-b^{1-p} \]

Example 7

Let \(b = 2\), \(p = 2\), \(e_\min = -1\) and \(e_\max = 1\). It follows that \(b^{p-1} = 2\) and \(b^p - 1 = 3\). Then \(\mathcal{F}\) consists of \(0\) and the numbers enumerated in the following table:

Table 1 Non-zero Floating Point Numbers#
\(s\)	\(e\)	\(m\)	Value
0	-1	2	\(2^{-1} \cdot \frac{2}{2^1} = 0.5\)
0	-1	3	\(2^{-1} \cdot \frac{3}{2^1} = 0.75\)
0	0	2	\(2^0 \cdot \frac{2}{2^1} = 1\)
0	0	3	\(2^0 \cdot \frac{3}{2^1} = 1.5\)
0	1	2	\(2^1 \cdot \frac{2}{2^1} = 2\)
0	1	3	\(2^1 \cdot \frac{3}{2^1} = 3\)
1	-1	2	\(-2^{-1} \cdot \frac{2}{2^1} = -0.5\)
1	-1	3	\(-2^{-1} \cdot \frac{3}{2^1} = -0.75\)
1	0	2	\(-2^0 \cdot \frac{2}{2^1} = -1\)
1	0	3	\(-2^0 \cdot \frac{3}{2^1} = -1.5\)
1	1	2	\(-2^1 \cdot \frac{2}{2^1} = -2\)
1	1	3	\(-2^1 \cdot \frac{3}{2^1} = -3\)

A graphical representation (Source Wikipedia) of a floating-point set is

Floating point precision

The term ‘floats’ is often used as a shorthand for ‘floating-point numbers’.

IEEE 754#

The previous example is not practical because that \(\mathcal{F}\) is far too small. Instead, the most widely used floating-point sets are defined in the IEEE 754 standard.

The two most common types of floating-point numbers are IEEE double precision and IEEE single precision. The former provides around 15 digits of accuracy, while the latter offers around 7 digits of accuracy.

IEEE double precision: \(e_{min} = -1022, e_{max} = 1023, p = 53\), \(b = 2\)
IEEE single precision: \(e_{min} = -126, e_{max} = 127, p = 24\), \(b = 2\)

For almost all purposes, this is more than sufficient. Take, for example, the gravitational constant \(G\). It is known to around 4 digits of accuracy, only a fraction of the number of digits available in modern computers.

For most numerical calculations, double precision is preferred because, in some computations, errors might accumulate, leading to a loss of accuracy. Double precision provides much more headroom for this than single precision.

However, not all applications require such precision. For example, many machine learning applications use the half-precision type, which is even less accurate than single precision but can be evaluated extremely efficiently on dedicated hardware, e.g., tensor cores on modern machine learning accelerators.

Floating-point arithmetic#

The distance between neighbouring floats with same leading digit is \(2^e b^{1-p}\). Let \(x \in [b^{e_{min}}, b^{e_{max}+1}]\), i.e. be a real number between the smallest and largest float. Define \(\epsilon_{mach} := \frac{1}{2} b^{1-p}\). There exists \(x' \in \mathcal{F}\) such that \(|x - x'| \leq \epsilon_{mach} |x|\). In words, \(\epsilon_{mach}\) is the relative distance to the next floating-point number in \(\mathcal{F}\).

Define the projection

\[ fl: x \rightarrow fl(x), \]

where \(fl(x)\) is the closest floating-point number in \(\mathcal{F}\). If there are two floating-point numbers of equal distance choose the one with smaller absolute value. It follows that \(fl(x) = x \cdot (1 + \epsilon)\) for some \(|\epsilon| \leq \epsilon_{mach}\).

Theorem 2 (Fundamental Theorem of Floating Point Arithmetic)

Define \(x\odot y = fl(x \cdot y)\), where \(\odot\) is one of \(+,-,\times,\div\). Then for all \(x,y\in\mathcal{F}\) there exists \(\epsilon\) with \(|\epsilon| \leq \epsilon_{mach}\) such that

\[ x\odot y = (x \cdot y)(1+\epsilon). \]

Most modern computer architectures follow the mathematical assumptions about floating-point sets to guarantee the Fundamental Theorem. For additional details we refer to the recommended lecture book [Trefethen and Bau, 1997].

Python skills#

The NumPy module provides convenient tools to query the properties of floating-point numbers. First, import NumPy as follows:

import numpy as np  # Import the numpy extension module and alias it as np

Floating point data types in NumPy#

NumPy defines the following data types for floating-point numbers:

IEEE double precision: np.float, np.double, np.float64
IEEE single precision: np.single, np.float32

Querying floating-point properties#

Use np.finfo to query floating-point properties. For example:

double_precision_info = np.finfo(np.float64)

Key properties of double precision floating-point numbers:

Maximum value: double_precision_info.max
Smallest (absolute) normalised value: double_precision_info.tiny
Smallest relative difference (machine epsilon): double_precision_info.eps
Approximate relative precision: double_precision_info.precision

Illustrative examples:

1 + double_precision_info.eps  # Just larger than 1
1 + 0.25 * double_precision_info.eps  # Slightly closer to 1

Special floating-point values#

The floating-point standard also defines:

NaN: Not a number
inf: Infinity

Examples:

a = np.inf
b = np.float64(0) / np.float64(0)  # Produces NaN
print(b)  # Output: nan

Note: NumPy adheres to the IEEE 754 standard, so operations like division by zero return NaN rather than raising an error.

Python itself is not fully IEEE 754 compliant. Consider:

a = np.inf
b = 0. / 0.  # Raises ZeroDivisionError in Python

This behaviour is one reason why NumPy is critical for numerical computations in Python, as it adheres more closely to the IEEE 754 standard.

Self-check questions#

Question

Explain the meaning of \(\epsilon_{mach}\).

Question

Why do double precision numbers give you around 15 digits of accuracy?

Question

Consider a floating-point system with base \(b=2\), precision \(p=5\), \(e_{\min}=-6\), and \(e_{\max}=6\).

Determine the machine epsilon \(\epsilon_{mach}\) for this system.
Show that any real number \(x\) (within the representable range) can be approximated by a floating-point number \(fl(x)\) such that:

\[ |x - fl(x)| \leq \epsilon_{mach} |x|. \]

Question

Consider floating-point arithmetic in base \(b\) with precision \(p\). Assume \(x\) and \(y\) are floating-point numbers that do not differ too drastically in magnitude. Show that the computed sum \(fl(x \oplus y)\) satisfies

\[ \frac{|(x+y)-fl(x \oplus y)|}{|x+y|} \leq \epsilon_{mach} \]

and discuss the limitations of this error bound when \(x\) and \(y\) differ by orders of magnitude (catastrophic cancellation).

Question

Consider a floating-point system with base \(b=10\), precision \(p=3\), and exponent range \(e_{\min}=-2\), \(e_{\max}=2\). Given a real number \(x=1.23456\), determine its floating-point representation \(fl(x)\) and estimate the relative error \(\frac{|x-fl(x)|}{|x|}\). Compare your computed relative error with \(\epsilon_{mach}\).

Floating point numbers

Contents

Floating point numbers#

Definition of floating-point numbers#

IEEE 754#

Floating-point arithmetic#

Python skills#

Floating point data types in NumPy#

Querying floating-point properties#

Special floating-point values#

Self-check questions#