Engineering Blog

# Not Understanding Floating Points

In this post, we are going to to understand about floating points in Go.

## What is Floating Point Types in Go

In Go, there are two floating-point types (if we omit imaginary numbers): float32 and float64. The concept of a floating point was invented to solve the major problem with integers: their inability to represent fractional values. So basically, floating points are data types to represent fractional values.

## Floating Point as Approximation

To avoid bad surprises, we need to know that floating-point arithmetic is an approximation of real arithmetic. let’s
Let’s examine the impact of working with approximations and how to increase accuracy. For that, we’ll look at a multiplication example:

``````var n float32 = 1.0001
fmt.Println(n * n)``````
• Exact result : 1.0001 * 1.0001 = 1.00020001
• The result we got running it on x86 processors = 1.0002

How do we explain that? We need to understand the arithmetic of floating points first. Let’s take the float64 type as an example. Note that there’s an infinite number of real values between math.SmallestNonzeroFloat64 (the float64 minimum) and math.MaxFloat64 (the float64 maximum). Conversely, the float64 type has a finite number of bits: 64. Because making infinite values fit into a finite space isn’t possible, we have to work with approximations. Hence, we may lose precision.

The same logic goes for the float32 type. Floating points in Go follow the IEEE-754 standard, with some bits representing a mantissa and other bits representing an exponent. A mantissa is a base value, whereas an exponent is a multiplier applied to the mantissa. In single-precision floating-point types (float32), 8 bits represent the exponent, and 23 bits represent the mantissa. In double-precision floating-point types (float64), the values are 11 and 52 bits, respectively, for the exponent and the mantissa. The remaining bit is for the sign. To convert a floating point into a decimal, we use the following calculation:

``sign * 2^exponent * mantissa``

## What are the implications for us as developers?

The first implication is related to comparisons. Using the == operator to compare two floating-point numbers can lead to inaccuracies. Instead, we should compare their difference to see if it is less than some small error value. For example, the testify testing library (https://github.com/stretchr/testify) has an InDelta function to assert that two values are within a given delta of each other.

``````func (a *Assertions) InDelta(expected interface{}, actual interface{}, delta float64, msgAndArgs ...interface{}) bool

// Example...
a.InDelta(math.Pi, 22/7.0, 0.01)``````

Also bear in mind that the result of floating-point calculations depends on the actual processor. Most processors have a floating-point unit (FPU) to deal with such calculations. There is no guarantee that the result executed on one machine will be the same on another machine with a different FPU. Comparing two values using a delta can be a solution for implementing valid tests across different machines.

So far, we have seen that decimal-to-floating-point conversions can lead to a loss of accuracy. This is the error due to conversion. Also note that the error can accumulate in a sequence of floating-point operations.

Let’s look at an example with two functions that perform the same sequence of operations in a different order. In our example, f1 starts by initializing a float64 to 10,000 and then repeatedly adds 1.0001 to this result (n times). Conversely, f2 performs the same operations but in the opposite order (adding 10,000 in the end):

``````func f1(n int) float64 {
result := 10_000.
for i := 0; i < n; i++ {
result += 1.0001
}
return result
}

func f2(n int) float64 {
result := 0.
for i := 0; i < n; i++ {
result += 1.0001
}
return result + 10_000.
}``````

Now, let’s run these functions on an x86 processor. This time, however, we’ll vary n.

Notice that the bigger n is, the greater the imprecision. However, we can also see that the f2 accuracy is better than f1. Keep in mind that the order of floating-point calculations can affect the accuracy of the result. When performing a chain of additions and subtractions, we should group the operations to add or subtract values with a similar order of magnitude before adding or subtracting those with magnitudes that aren’t close. Because f2 adds 10,000, in the end it produces more accurate results than f1. What about multiplications and divisions? Let’s imagine that we want to compute the following:

``a × (b + c)``

As we know, this calculation is equal to

``a × b + a × c``

Let’s run these two calculations with a having a different order of magnitude than b and c:

``````a := 100000.001
b := 1.0001
c := 1.0002
fmt.Println(a * (b + c)) // 200030.00200030004
fmt.Println(ab + ac) // 200030.0020003``````

The exact result is 200,030.002. Hence, the first calculation has the worst accuracy. Indeed, when performing floating-point calculations involving addition, subtraction, multiplication, or division, we have to complete the multiplication and division operations first to get better accuracy. Sometimes, this may impact the execution time (in the previous example, it requires three operations instead of two). In that case, it’s a choice between accuracy and execution time.

## Conclusion

Go’s float32 and float64 are approximations. Because of that, we have to bear a few rules in mind:

• When comparing two floating-point numbers, check that their difference is within an acceptable range
• When performing additions or subtractions, group operations with a similar order of magnitude for better accuracy.
• To favor accuracy, if a sequence of operations requires addition, subtraction, multiplication, or division, perform the multiplication and division operations first.

References:

• 100 Go Mistakes and how to avoid them, Teiva Harsanyi, Manning Publications Co

### Bibek Pokhrel

Software Engineer

Previous Post
Next Post