Bit depth is such a widely discussed topic in audio engineering that probably anyone reading this is already aware of many deep details about it. Maybe details like the dynamic range of a an audio CD being close ~96dB or the 6dB per bit rule are already clear to some extent. However, I’ve always noticed in this type of topics there’s always space to dig deeper and achieve a better understanding of what bit-Depth/resolution/word-Lenght is.
I’m pretty sure that many of you have seen graphics like these exemplifying what dynamic range is and how can we attempt to quantify a property describing the potential quality of the signals we work with.
Nonetheless, many of these are either misleading or obscuring the reality behind an abstraction. While this is entirely valid for a simple understanding of the concept of Bit Depth I’ll try to unveil a different perspective about how to understand digital signals.
Bit Depth 101
At this point you should know that digital signals have a finite set of possible steps to represent signal amplitudes known as quantization steps. Because of this, when digitizing a signal we encounter discrepancies between the analog true values and their digital representations. At each sample our ADC (analog to digital converter) determines that an analog value (1.0123) is close enough to a digital quantization step (1.01) to be represented by it. Accordingly, we are introducing an error (0.0023) into our signal; a deviation from its real value that represents some loss of information. This error is commonly known as quantization noise, although I prefer the term quantization distortion because although noise and distortion are both aberrations in our signals, distortion is characterized by being signal dependent while noise isn’t. This is an important distinction because it implies that quantization error doesn’t have a constant behavior but a signal dependent behavior as we’ll see later.
As expected this quantization error is directly affected by the number of available steps in our system and because the number of steps available is defined by the bit-depth so is the quantization error. You’ve probably studied how this happens analyzing the number of possible amplitudes expressed by different number of bits. With 1 bit we have two possible options (1, 0) with 2 bits we have 4 (00, 01, 10, 11), etc… then we conclude that for n bits we have 2^n different options. So for 16 bits we have 65536 quantization steps and for 24 bits we have 16777216. However, it is important to understand that there are many many ways to build quantization and codification schemes (it is even possible that many ADCs take different approaches to some details we’ll check later). No matter what, we need a finite number of quantization steps to represent our signal but we can make many decisions on how to assign these quantization steps to amplitudes to achieve specific behaviors. The most common way to achieve this in audio is through a uniform mid tread quantization scheme. However, others exist such as Power-Law Companding, Logarithmic commanding and floating point quantization which we use regularly.
In uniform quantization the number of available steps are evenly spaced in the desired amplitude range to be digitized. This means that for every possible amplitude the error is bounded by half of a quantization step which is equivalent to saying that no sample will be off by more than half a step. Consider said requirement for a 2 bit quantizer.
As you can see there’s no quantization step at 0 which is a bit inconvenient for audio signals since they always have positive and negative values and often swing around or are at an amplitude of 0.
Mid-Tread and Mid-Rise Quantizers
Because of this we have another uncommon distinction between mid tread and mid rise quantizers. In practice, mid-rise quantizers need to "discard" a quantization step to achieve an odd number of steps so that there’s a value precisely at 0. (In the red book this is achieved by allowing one more quantization step on the negative side [−32768, 32767])
Although it seems like a very subtle distinction which very little effect on higher bit depths. I think it brings a fascinating premise to the table. Because we know some generalities of the signals that we will be encoding we can take decisions to optimize our codification scheme to achieve a better behavior of quantization error. Although our systems are very general (for example there are codification schemes specifically aimed to voice signals), we do recognize that an amplitude of 0 is so common that is worth ensuring there’s a quantization step at this amplitude. (Check Huffman encoding as an example of a system that assigns specific codes to more common amplitude values to optimize data storage).
A Different Perspective.
Now that we have a clear preamble, here’s what I’d really like to present. Very often we study bit depth in relation to dynamic range and SNR (signal to noise ratio). Very clearly they are related but its easy to fall in misleading assumptions. For example, every single graph at the beginning of this document present our signal and noise as separate elements given that through this abstraction we are able to visualize what SNR is. In the first graph we see some distance between the signal and the noise although this is quite misleading because SNR is a ratio not a distance. Graphs 2 and 3 are in decibels so a ratio is transformed in a diferencia due to logarithmic properties. BUT! All of them make us believe that noise and signal are distinguishable but clearly the problem with noise is that we cannot separate it from our signal! At every instant noise and signal ride the waveform together as at every instant the waveform (both analog and digtal) hold one and only one amplitude value which is the combination of both signal and noise!. It is a much posterior element, our brain, which discerns which parts of the signal are noise and which are not.
If you ever studied physics in a semi-formal way, you might recall that most measurements have some uncertainty associated with it. Take a measurement tape and measure something around you. 30.4 cm. Why? Why not 30.4562342 cm. Turns out the measurement tape has some limitations on what it can measure. It has some quantization steps based on 1 mm of length so each measurement you make with it actually has an uncertainty of +/- 0.5mm (if you were able to correctly round to the correct step at every measurement). This is a consequence of the measurement tape is discrete system BUT, it is also a consequence of length of objects being a continuous property. Due to this, quantization error is not really a property of discrete systems but it is a consequence of precision loss. In contrast measuring how many pages a notebook has doesn’t have an associated uncertainty as long as you have enough precision to count them all!
Again this is a subtle distinction but an important one. When we digitize an analog signal we start with a continuous domain and discretize it in steps. The signal had an amplitude of 0.844543 but we decided to represent it with 0.84 because that was the quantization step that best represented said amplitude. Accordingly we know that its amplitude is not 0.84 but 0.84 +/- 0.005. Every time we reduce the bit depth we increase the step sizes and accordingly we increase our uncertainty on our measurement thus reducing the SNR. BUT! SNR is clearly both a function of the signal level and the uncertainty associated to that signal level. Accordingly, when the signal acquires a big value it has a much bigger SNR than when it is near its zero crossings! It is said very often that bit-depth only determines the “digital noise-floor” of our file but this leads people into believing that the noise is “down there” while the signal is “up here”. However, if we understand that bit-depth defines the uncertainty of our measurement we realize that we can’t know whether the analog signal had the value expressed by the digital quantization step, we can only have some confidence that it was in the neighborhood of the quantization step and is this neighborhood what is increased with the reduction of bit depth. That is why it is called resolution! It refers to our confidence when resolving that value as a good representation of the analog value.
Now that we understand that quantization error is a byproduct of precision loss, we might even say that some digital signals might have infinite SNR! Wait what? Turns out that by understanding SNR in relationship to the uncertainty produced by digitization we’ve been able to find a case where there’s no noise. Suppose you have a digital signal that travels each quantization step at each sample. At each position in that signal there’s no uncertainty because the signal has exactly an unequivocally the desired amplitude. Theres no noise, because theres absolute confidence that the digital amplitude is exactly what it is intended to be; thus the noise level is -inf and the SNR is infinite. Of course, this is hardly a practical situation; first once that signal is converted to the analog world it will acquire some noise in the DAC and second most musical signals wouldn’t comply with the characteristic of having each sample at an intended amplitude that matches exactly one of the quantization steps, but it does show how different discrete and continuous systems are and begs the question: Can we blindly translate every analog concept into the digital realm without consequences?
A much tougher case:
Floating point codification
Again it seems like a common topic but hardly ever explained in detail. Floating point representation is a quantization and codification scheme were the distance between quantization steps is proportional to the size of the signal giving a relatively constant SNR (make sure by the end of this you understand this definition!).
This is achieved by notating the amplitude in “binary scientific notation”. Take some amplitude of a an analog signal and write it in scientific notation with a mantissa with 3 digits and an exponent with 1 digit (lets allow signs in both for ease): 0.823456 = 8.23 x 10^-1. Notice that the error is 0.003456 and the uncertainty is +/- 0.005. What if the signal had an amplitude of 823456 = 8.23 x 10^5, now both the error of 456 and the uncertainty of +/- 500 is substantially bigger! But what if the signal was 0.0000823456 = 8.23 x 10^-5 ? The error is of 0.0000000456 and the uncertainty is of +/-0.00000005 which is again substantially smaller!. As you can see the behavior is very different to that of a uniform quantizer where the uncertainty was constant!.
In reality, because all of this is implemented in binary notation, the only digits available are 1 and 0 but the essence is the same. IEEE 754 specifies the structure of single precision floating point format or 32 bit float which specifies using 23 bits for a mantissa, 1 bit for a sign and 8 bits for the exponent. Because of this many people say that 32bit floating point has the same resolution as 24 bit fixed point (as 23 bits of mantissa plus 1 sign bit should give 24 bits of resolution). However, the IEEE floating point notation has a clever trick with some weird behaviors that extends this to 25 bits of constant SNR.
Denormalized Numbers!
Remember that in our scientific notation we had three digits for the mantissa? (e.g. 8.23 x 10^5). This was supposed to give us a constant SNR as the uncertainty was proportional to the amplitude of the signal but theres a special case where that doesn’t happen: VERY small numbers. Consider this one 0.0000000000823456 = 8.23 x ^-11 but because we imposed the requirement that the exponent could only be one digit long we have to write it as 0.08 x 10^-9. The consecuente of this is that we lost two orders of precision (increased our uncertainty) because we had to remove the usual .23 of the other cases in order to accommodate the exponent with only one digit. At this point the SNR becomes again variable and dependent on the amplitude just as in uniform quantization. When this happens, when the mantissa has leading zeros because the exponent cannot be anymore negative, even though the number is still representable it is called denormal number, denormalized number of subnormal. Although this is actually quite far from audio usage (something like -758dB) some DSP algorithms approach asymptotically to 0 (such as reverberations and IIR Filters) and when they arrive to the subnormal range most processors incur in performance issues where each operation can run up to 100 times slower! Because of this it is very important that the DSP developers make sure this never happens although that is not always the case.
So what is the nice trick? Turns out that because in binary notation there are only two digits (1 or 0), and most numbers are normalized numbers (those starting with a leading 1 before the “floating point”), the standard stablished that said one is actually implicitly codified. Because of this we save a bit that allows for a total resolution equivalent to 25 bits.
Why all this makes a difference?
Notice that the common approach to SNR and Dynamic Range is to compare the highest possible value accepted by the system to the noise floor of the system either by measuring it or by understanding the minimal value representable by the system. HOWEVER! Notice that when the dynamic range and the SNR are understood from the perspective of the uncertainty of the measurement at every instant, you quickly notice that the uncertainty is bigger when the signal is big in floating point notation, that the SNR is constant through normalized amplitude values, that the uncertainty might be non existent when the signal was never digitized, that the analog concepts don’t translate that well into the digital world so a good understanding of the digital systems is essential to avoid common confusions produced by this.
The Graph that shows it all
The following graph shows the SNR of different bit depths in fixed point on uniform mid tread quantizers, additionally the SNR is shown for IEEE 754 single precession float (which is a non uniform mid tread quantizer). Hopefully by this time this graph makes some sense. The following graphs show the SNR for different formats. The first 8, 16, 24 and 32 bits are fixed point formats and next ones are in a floating point scheme where the first value represents the number of bits assigns to the exponent and the second one to the mantissa. Check how the left side of the floating point graphs becomes a tilted just as the fixed point formats! I suppose you must know by know why this is happening :)
Conclusions
Maybe the toughest thing to understand about all of this is getting rid of the idea that digital systems have a noise floor behavior comparable to that of analog systems. As we saw, the quantization error comes from the loss of precision and not from the discrete nature of the system. This quantization error is bounded by the separation between quantization steps but only in uniform quantization. In contrast, in floating point quantization, que quantization error is proportional to the amplitude of the signal, which implies that the SNR is constant (except in the subnormal range!)
I really hope this gives u a different perspective about dynamic range in digital systems and about how quantization error really behaves. Theres a ton of other things to talk about in relationship to this topic, side effects, consequences, small details and not so small details. How this precision affects DSP? Is it better to do processing in fixed point or in floating point? How could I write so much about this without even getting into the topic of Dither? Is dither something that also applies in a floating point architecture?
An infinity of things that I would like to discuss but for now this is the end of this article. As always, I’m open to any doubt or comment :D
Good thoughts. Audio Workstation makers talk about 32-bits, but (as you note) 32-float can only guarantee bit-perfect processing to the 24th-bit. 32-fixed, or 64-float, is the right approach. As for digital-to-analog conversion, we will soon see a paradigm shift in signal-to-noise performance, using parallel processing paths. DAC's will likely operate at 160-180dB dynamic range, with broadband quiescent noise in the low nano-volt range. This means low-path processing with exceptionally low resistance.