Numeric Representation

Computers must use approximations when storing certain numbers because their binary system can’t precisely represent all decimal values.

Table of Contents

  1. Representation Matters
  2. Whole Numbers Represented by Computers
  3. Real Numbers Represented By Computers
  4. Scientific Notation
  5. IEEE 754 Floating Point Format
  6. Why a Standard
  7. IEEE 754 Binary Format
  8. IEEE 754 32-bit Float Example: The Number One (Almost!)
  9. IEEE 754 32-bit Float Example: The Number One (Actual!)
  10. Another 32-bit Float Example: -0.09375
  11. The “Next” Number After 1.0
  12. Float Density
  13. Special Floats
  14. Subnormal or Denormalized Numbers
  15. Extremes - 32-bit Examples
  16. Further Reading

Representation Matters

It’s important to remember that all numbers we work with in a game development context are stored in memory and processed by the CPU in binary.

Further, it’s important to be aware of:

  • How precisely those numbers are being represented.
  • The limits imposed by their binary representation.
  • The potential errors that result from their binary representation.

Whole Numbers Represented by Computers

How precisely are whole numbers stored by a computer?

Computers can represent whole numbers with 100% accuracy. As we’ve seen above, there’s a direct one-to-one mapping between decimal and binary whole numbers.

Are there limits imposed on the computer representation of whole numbers?

Yes. As we discussed above, the integers extend infinitely in the negative and positive directions. So unless our computer has infinite memory there’s a limit on the maximum positive/negative numbers we can store. Many programming languages will store a whole number in binary with 32 to 64 bits. If we reserve one bit to specify the sign, that’s 31 or 63 bits for storage.

Number of Bits Whole Number Decimal Range
32 bits -2,147,483,648 to +2,147,483,647
52 bits* -9,007,199,254,740,991 to +9,007,199,254,740,991
64 bits -9,223,372,036,854,775,808 to +9,223,372,036,854,775,807

*Javascript stores whole numbers using a floating point data type, which makes only 52 bits accessible to integers.

Are there potential errors inherent in a computer’s representation of whole numbers?

Due to the range limits, whole numbers will often “rollover” when they exceed either side of their maximum range.

Here’s an example in a C++ context:

int main() {
  int bigNumber = 2'147'483'647; // Max Positive 32 bit integer.
  bigNumber++; // Add one and it's negative! -2147483648
}

For this course we’re using Javascript, which has a BigInt primitive that allows for integers with a maximum size only limited by the amount of memory available.

Real Numbers Represented By Computers

How precisely are real numbers stored by a computer?

Between any two whole numbers there are an infinite number of real numbers. This makes the real numbers uncountable and hence most real numbers cannot be precisely represented by a computer. Instead we need to make do with either fixed-point or floating-point approximations. When we go into the details later we’ll see that for floating-point numbers that precision decreases as the numbers increase in size. And actually, much of the precision lies in the 0.0 to 0.1 range!

Are there limits imposed on the computer representation of real numbers?

Like integers, each real number will be represented by either 32 or 64 bits. By sacrificing precision we can cram in a much larger range than whole numbers.

Number of Bits Approximate Real Number Decimal Range
32 bits ±1.17 x 10-38 to ±3.40 x 1038
64 bits ±2.22 x 10-308 to ±1.79 x 10308

Are there potential errors inherent in a computer’s representation of whole numbers?

Because computer’s store real numbers as approximations, all real numbers have a built-in relative error. Take the real number pi, for example:

pi value
Actual Value 3.141592653589793238462643383279502884197169399375105…
32 bit Floating Point Approximation 3.141592741012573242187500000000000000000000000000000
64 bit Floating Point Approximation 3.141592653589793115997963468544185161590576171875000

You can determine the relative error in this approximation using: (actualPi - floatPi) / actualPi

This problem extends beyond irrational numbers like pi. Take this example in Javascript:

let zeroPointOne = 0.1;
let zeroPointTwo = 0.2;
console.log(zeroPointOne + zeroPointTwo); // 0.30000000000000004

Scientific Notation

Scientific notation is a nice and concise way of representing very large and very small numbers.

With scientific notation we write all numbers in the form m * 10n:

number scientific notation e-form
698,000,000,000 6.98 * 1011 6.98e11
350 3.5 * 102 3.5e2
0.051 5.1 * 10-2 5.1e-2

🎵 Note:

We restrict the number of digits to the left of the decimal point to one.

IEEE 754 Floating Point Format

So how exactly are real numbers represented by a computer?

They use a form of binary scientific notation, that’s how!

Most programming languages use the IEEE 754 standard, which has 32-bit single-precision “floats” and 64-bit double-precision “doubles”.

Why a Standard

Before float standards existed, different hardware manufacturers and languages all used their own quirky formats.

IEEE 754 not only specifies the format, but also the algorithms for arithmetic, the rounding rules, and the tests implementations must pass, like:

x + y == y + x
x + 0 == x
if (x == y) then x - y == 0

IEEE 754 Binary Format

In IEEE 754 a number is represented in binary scientific notation. So instead of m * 10n it’s:

number = m * 2n

Or more specifically:

num = -1sign * mantissa * 2exponent

Where:

  • The mantissa is a binary number of the form: 1.######...
  • The exponent is also a binary number.
  • The sign bit allows for positive (0) and negative numbers (1).
Bit Size Common Name Sign Bit Exponent Mantissa
32-bit (4 bytes) float 1 sign bit 8 exponent bits 23 mantissa bits
64-bit (8 bytes) double 1 sign bit 11 exponent bits 52 mantissa bits

🎵 Note:

There’s a trade-off between precision (large mantissa) and range (large exponent).

IEEE 754 32-bit Float Example: The Number One (Almost!)

Let’s try to represent 1.0 using the IEEE 754 binary format:

num = -1sign * mantissa * 2exponent
1.0 = -10 * 1.0 * 20

[0][00000000][00000000000000000000000]
 ▲         ▲                        ▲
 └──Sign   └──Exponent              └──Mantissa right of decimal point

🎵 Note:

Because the mantissa is normalized to 1.# we don’t need to store the leading zero.

⚡ Warning:

But wait! Intuitively, shouldn’t an all zero bitmask mean decimal zero?

IEEE 754 32-bit Float Example: The Number One (Actual!)

Let’s add 127 to the exponent when we store it. This allows for storage of exponent values from -126 to 127 using 00000001 to 11111110. (For 64-bit numbers, the offset used is 1023.)

So using 32-bit IEEE 754, the decimal number 1.0 is actually stored as:

[0][01111111][00000000000000000000000]
 ▲       ▲                          ▲
 └──Sign └──Exponent shifted by 127 └──Mantissa right of the decimal point

⏳ Wait For It:

Later we’ll talk about the special cases when the exponent is all zeros or all ones.

Another 32-bit Float Example: -0.09375

Step 1: Determine Sign Bit

Sign Bit = 1 (Negative)

Step 2: Convert decimal number to binary:

0.09375 [decimal] = 0.00011 [binary]

Step 3: Normalize the mantissa

Mantissa = 0.00011 = 1.1 * 2-4

Step 4: Bias the exponent by adding 127 and convert to binary:

Exponent = -4 + 127 = 123 [decimal] = 01111011 [binary]

Step 5: Remove the leading one from the mantissa:

Mantissa = 1.10000000000000000000000 = 10000000000000000000000

Step 6: The full IEEE 754 binary representation of -0.09375

[1][01111011][10000000000000000000000]
 ▲         ▲                        ▲
 └──Sign   └──Exponent              └──Mantissa right of the decimal point

The “Next” Number After 1.0

So which number comes after 1.0 in the available 32-bit floating point numbers?

Let’s add a one to the smallest fractional bit of mantissa:

[0][01111111][00000000000000000000001]
 ▲         ▲                        ▲
 └──Sign   └──Exponent              └──Mantissa right of the decimal point

nextNum = -10 * 1.00000000000000000000001 * 20
nextNum = 1.00000000000000000000001 [binary]
nextNum = 1.00000011920928955078125 [decimal] Divide 1 in half 23 times for the delta.

⚡ Warning:

This doesn’t mean there’s a constant delta of 0.0000001192… between all floats!

Float Density

Because float precision is based on its mantissa, and because the mantissa is the same number of bits regardless of the size of the number being stored, there’s an uneven distribution of floats inside the range of possible real numbers!

Approximately half of all possible floats can be found in the -1.0 to +1.0 range!

Range Fixed Step Between Decimal Numbers
1 to 2 2-23
2 to 4 2-22
4 to 8 2-21
8 to 16 2-20
2n to 2n+1 2n-23

Special Floats

Exponent Mantissa Represents
11111111 All Zeros Infinity
11111111 Not All Zeros NAN
00000000 All Zeros Zero
00000000 Not All Zeros Subnormal (very small)

Which means that with the sign bit we can store +Infinity, -Infinity, and strangely +Zero, and -Zero.

Subnormal or Denormalized Numbers

With a mantissa of the form 1.#... we get a gap between 0.0 and the first representable number. This gap is present on the positive and negative side of 0, and is called “the hole at zero”.

To remove this gap, the standard represents the smallest non-zero numbers in a denormalized form denoted by an all zero exponent. For these denormalized numbers the exponent is still considered to be it’s smallest value of (-126 for 32-bit and -1022 for 64-bit) but the mantissa is now assumed to be of the form 0.#...

You will sometime hear this denormalization strategy referred to as gradual underflow. In other words as floats in this denormalized range get closer and closer to zero they lose precision.

Extremes - 32-bit Examples

Assuming just positive values:

IEEE 754 32-Bits Explanation Value
0 00000000 00000000000000000000001 Smallest Subnormal 2−126 * 2-23 = 2-149
≈ 1.4012984643 * 10-45
0 00000000 11111111111111111111111 Largest Subnormal 2−126 * (1 - 2-23) =
≈ 1.1754942107 * 10-38
0 00000001 00000000000000000000000 Smallest Normal 2-126
≈ 1.1754943508 * 10-38
0 11111110 11111111111111111111111 Largest Normal 2127 * (2 − 2-23)
≈ 3.4028234664 * 1038

Further Reading