Numeric Representation

Computers must use approximations when storing certain numbers because their binary system can’t precisely represent all decimal values.

Representation Matters
Whole Numbers Represented by Computers
Real Numbers Represented By Computers
Scientific Notation
IEEE 754 Floating Point Format
Why a Standard
IEEE 754 Binary Format
IEEE 754 32-bit Float Example: The Number One (Almost!)
IEEE 754 32-bit Float Example: The Number One (Actual!)
Another 32-bit Float Example: -0.09375
The “Next” Number After 1.0
Float Density
Special Floats
Subnormal or Denormalized Numbers
Extremes - 32-bit Examples
Further Reading

Representation Matters

It’s important to remember that all numbers we work with in a game development context are stored in memory and processed by the CPU in binary.

Further, it’s important to be aware of:

How precisely those numbers are being represented.
The limits imposed by their binary representation.
The potential errors that result from their binary representation.

Whole Numbers Represented by Computers

How precisely are whole numbers stored by a computer?

Computers can represent whole numbers with 100% accuracy. As we’ve seen above, there’s a direct one-to-one mapping between decimal and binary whole numbers.

Are there limits imposed on the computer representation of whole numbers?

Yes. As we discussed above, the integers extend infinitely in the negative and positive directions. So unless our computer has infinite memory there’s a limit on the maximum positive/negative numbers we can store. Many programming languages will store a whole number in binary with 32 to 64 bits. If we reserve one bit to specify the sign, that’s 31 or 63 bits for storage.

Number of Bits	Whole Number Decimal Range
32 bits	-2,147,483,648 to +2,147,483,647
52 bits*	-9,007,199,254,740,991 to +9,007,199,254,740,991
64 bits	-9,223,372,036,854,775,808 to +9,223,372,036,854,775,807

*Javascript stores whole numbers using a floating point data type, which makes only 52 bits accessible to integers.

Are there potential errors inherent in a computer’s representation of whole numbers?

Due to the range limits, whole numbers will often “rollover” when they exceed either side of their maximum range.

Here’s an example in a C++ context:

int main() {
  int bigNumber = 2'147'483'647; // Max Positive 32 bit integer.
  bigNumber++; // Add one and it's negative! -2147483648
}

For this course we’re using Javascript, which has a BigInt primitive that allows for integers with a maximum size only limited by the amount of memory available.

Real Numbers Represented By Computers

How precisely are real numbers stored by a computer?

Between any two whole numbers there are an infinite number of real numbers. This makes the real numbers uncountable and hence most real numbers cannot be precisely represented by a computer. Instead we need to make do with either fixed-point or floating-point approximations. When we go into the details later we’ll see that for floating-point numbers that precision decreases as the numbers increase in size. And actually, much of the precision lies in the 0.0 to 0.1 range!

Are there limits imposed on the computer representation of real numbers?

Like integers, each real number will be represented by either 32 or 64 bits. By sacrificing precision we can cram in a much larger range than whole numbers.

Number of Bits	Approximate Real Number Decimal Range
32 bits	±1.17 x 10^-38 to ±3.40 x 10³⁸
64 bits	±2.22 x 10^-308 to ±1.79 x 10³⁰⁸

Are there potential errors inherent in a computer’s representation of whole numbers?

Because computer’s store real numbers as approximations, all real numbers have a built-in relative error. Take the real number pi, for example:

pi	value
Actual Value	3.141592653589793238462643383279502884197169399375105…
32 bit Floating Point Approximation	3.141592741012573242187500000000000000000000000000000…
64 bit Floating Point Approximation	3.141592653589793115997963468544185161590576171875000…

You can determine the relative error in this approximation using: (actualPi - floatPi) / actualPi

This problem extends beyond irrational numbers like pi. Take this example in Javascript:

let zeroPointOne = 0.1;
let zeroPointTwo = 0.2;
console.log(zeroPointOne + zeroPointTwo); // 0.30000000000000004

Scientific Notation

Scientific notation is a nice and concise way of representing very large and very small numbers.

With scientific notation we write all numbers in the form m * 10ⁿ:

number	scientific notation	e-form
698,000,000,000	6.98 * 10¹¹	6.98e11
350	3.5 * 10²	3.5e2
0.051	5.1 * 10^-2	5.1e-2

🎵 Note:

We restrict the number of digits to the left of the decimal point to one.

IEEE 754 Floating Point Format

So how exactly are real numbers represented by a computer?

They use a form of binary scientific notation, that’s how!

Most programming languages use the IEEE 754 standard, which has 32-bit single-precision “floats” and 64-bit double-precision “doubles”.

Why a Standard

Before float standards existed, different hardware manufacturers and languages all used their own quirky formats.

IEEE 754 not only specifies the format, but also the algorithms for arithmetic, the rounding rules, and the tests implementations must pass, like:

x + y == y + x
x + 0 == x
if (x == y) then x - y == 0

IEEE 754 Binary Format

In IEEE 754 a number is represented in binary scientific notation. So instead of m * 10ⁿ it’s:

number = m * 2ⁿ

Or more specifically:

num = -1^sign * mantissa * 2^exponent

Where:

The mantissa is a binary number of the form: 1.######...
The exponent is also a binary number.
The sign bit allows for positive (0) and negative numbers (1).

Bit Size	Common Name	Sign Bit	Exponent	Mantissa
32-bit (4 bytes)	float	1 sign bit	8 exponent bits	23 mantissa bits
64-bit (8 bytes)	double	1 sign bit	11 exponent bits	52 mantissa bits

🎵 Note:

There’s a trade-off between precision (large mantissa) and range (large exponent).

IEEE 754 32-bit Float Example: The Number One (Almost!)

Let’s try to represent 1.0 using the IEEE 754 binary format:

num = -1^sign * mantissa * 2^exponent
1.0 = -1⁰ * 1.0 * 2⁰

[0][00000000][00000000000000000000000]
 ▲         ▲                        ▲
 └──Sign   └──Exponent              └──Mantissa right of decimal point

🎵 Note:

Because the mantissa is normalized to 1.# we don’t need to store the leading zero.

⚡ Warning:

But wait! Intuitively, shouldn’t an all zero bitmask mean decimal zero?

IEEE 754 32-bit Float Example: The Number One (Actual!)

Let’s add 127 to the exponent when we store it. This allows for storage of exponent values from -126 to 127 using 00000001 to 11111110. (For 64-bit numbers, the offset used is 1023.)

So using 32-bit IEEE 754, the decimal number 1.0 is actually stored as:

[0][01111111][00000000000000000000000]
 ▲       ▲                          ▲
 └──Sign └──Exponent shifted by 127 └──Mantissa right of the decimal point

⏳ Wait For It:

Later we’ll talk about the special cases when the exponent is all zeros or all ones.

Another 32-bit Float Example: -0.09375

Step 1: Determine Sign Bit

Sign Bit = 1 (Negative)

Step 2: Convert decimal number to binary:

0.09375 [decimal] = 0.00011 [binary]

Step 3: Normalize the mantissa

Mantissa = 0.00011 = 1.1 * 2^-4

Step 4: Bias the exponent by adding 127 and convert to binary:

Exponent = -4 + 127 = 123 [decimal] = 01111011 [binary]

Step 5: Remove the leading one from the mantissa:

Mantissa = 1.10000000000000000000000 = 10000000000000000000000

Step 6: The full IEEE 754 binary representation of -0.09375

[1][01111011][10000000000000000000000]
 ▲         ▲                        ▲
 └──Sign   └──Exponent              └──Mantissa right of the decimal point

The “Next” Number After 1.0

So which number comes after 1.0 in the available 32-bit floating point numbers?

Let’s add a one to the smallest fractional bit of mantissa:

[0][01111111][00000000000000000000001]
 ▲         ▲                        ▲
 └──Sign   └──Exponent              └──Mantissa right of the decimal point

nextNum = -1⁰ * 1.00000000000000000000001 * 2⁰
nextNum = 1.00000000000000000000001 [binary]
nextNum = 1.00000011920928955078125 [decimal] Divide 1 in half 23 times for the delta.

⚡ Warning:

This doesn’t mean there’s a constant delta of 0.0000001192… between all floats!

Float Density

Because float precision is based on its mantissa, and because the mantissa is the same number of bits regardless of the size of the number being stored, there’s an uneven distribution of floats inside the range of possible real numbers!

Approximately half of all possible floats can be found in the -1.0 to +1.0 range!

Range	Fixed Step Between Decimal Numbers
1 to 2	2^-23
2 to 4	2^-22
4 to 8	2^-21
8 to 16	2^-20
2ⁿ to 2ⁿ⁺¹	2^n-23

Special Floats

Exponent	Mantissa	Represents
11111111	All Zeros	Infinity
11111111	Not All Zeros	NAN
00000000	All Zeros	Zero
00000000	Not All Zeros	Subnormal (very small)

Which means that with the sign bit we can store +Infinity, -Infinity, and strangely +Zero, and -Zero.

Subnormal or Denormalized Numbers

With a mantissa of the form 1.#... we get a gap between 0.0 and the first representable number. This gap is present on the positive and negative side of 0, and is called “the hole at zero”.

To remove this gap, the standard represents the smallest non-zero numbers in a denormalized form denoted by an all zero exponent. For these denormalized numbers the exponent is still considered to be it’s smallest value of (-126 for 32-bit and -1022 for 64-bit) but the mantissa is now assumed to be of the form 0.#...

You will sometime hear this denormalization strategy referred to as gradual underflow. In other words as floats in this denormalized range get closer and closer to zero they lose precision.

Extremes - 32-bit Examples

Assuming just positive values:

IEEE 754 32-Bits	Explanation	Value
0 00000000 00000000000000000000001	Smallest Subnormal	2⁻¹²⁶ * 2^-23 = 2^-149 ≈ 1.4012984643 * 10^-45
0 00000000 11111111111111111111111	Largest Subnormal	2⁻¹²⁶ * (1 - 2^-23) = ≈ 1.1754942107 * 10^-38
0 00000001 00000000000000000000000	Smallest Normal	2^-126 ≈ 1.1754943508 * 10^-38
0 11111110 11111111111111111111111	Largest Normal	2¹²⁷ * (2 − 2^-23) ≈ 3.4028234664 * 10³⁸

Numeric Representation

Table of Contents

Representation Matters

Whole Numbers Represented by Computers

Real Numbers Represented By Computers

Scientific Notation

IEEE 754 Floating Point Format

Why a Standard

IEEE 754 Binary Format

IEEE 754 32-bit Float Example: The Number One (Almost!)

IEEE 754 32-bit Float Example: The Number One (Actual!)

Another 32-bit Float Example: -0.09375

The “Next” Number After 1.0

Float Density

Special Floats

Subnormal or Denormalized Numbers

Extremes - 32-bit Examples

Further Reading