## Real Time

 DIGITAL Signal PROCESSING
## DSP fundamentals

## Number representation and word-length effects.

## Number Representation

## $\square$ 8-bit Binary Data Format (Integer)

$\square$ 8-bit Binary Data Format (Fractional)

(b) Signed integer (2's complement)

(b) Signed fractional (4.4) format

## Integer Fixed-Point Representation

$\square$ N-bit fixed point, 2's complement integer representation
$X=-b_{N-1} 2^{\mathrm{N}-1}+\mathrm{b}_{\mathrm{N}-2} 2^{\mathrm{N}-2}+\ldots+\mathrm{b}_{0} 2^{0}$
$\square$ Difficult to use due to possible overflow
$\square$ In a 16-bit processor, the dynamic range is $-32,768$ to 32,767.
$\square$ Example:
$200 \times 350=70000$, which is an overflow!

## Fractional Fixed-Point Representation

$\square$ Also called Q-format
$\square$ Fractional representation suitable for DSPs algorithms.
$\square$ Fractional number range is between 1 and -1
$\square$ Multiplying a fraction by a fraction always results in a fraction and will not produce an overflow (e.g., $0.99 \times 0.9999$ less than 1)
$\square$ Successive additions may cause overflow
$\square$ Represent numbers between

- -1.0 and $1-2^{-(N-1)}$, when $N$ is number of bits

$$
x_{10}=-b_{N}+\sum_{m=0}^{N-1} b_{m} \cdot 2^{m-N}
$$

Decimal equivalency for QN formats

## General Fixed-Point Representation

$\square$ Qm.n notation
$\square \mathrm{m}$ bits for integer portion
$\square \mathrm{n}$ bits for fractional portion
$\square$ Total number of bits $N=m+n+1$, for signed numbers
$\square$ Example: 16-bit number ( $\mathrm{N}=16$ ) and Q2.13 format

- 2 bits for integer portion
- 13 bits for fractional portion
- 1 signed bit (MSB)
$\square$ Special cases:
- 16-bit integer number ( $\mathrm{N}=16$ ) => Q15.0 format
- 16-bit fractional number $(\mathrm{N}=16)=>$ Q0.15 format; also known as Q. 15 or Q15



## Dynamic Ranges and Precision

| Format (N.M) |  | Largest positive value (0x7FFF) | Least negative value (0x8000) | Precision (0x0001) |  | DR(dB) |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 1 | 15 | 0,999969482421875 | -1 | 3,05176E-05 | 2^-15 | 90,30873362 |
| 2 | 14 | 1,99993896484375 | -2 | 6,10352E-05 | 2^-14 | 90,30873362 |
| 3 | 13 | 3,9998779296875 | -4 | 0,00012207 | 2^-13 | 90,30873362 |
| 4 | 12 | 7,999755859375 | -8 | 0,000244141 | 2^-12 | 90,30873362 |
| 5 | 11 | 15,99951171875 | -16 | 0,000488281 | 2^-11 | 90,30873362 |
| 6 | 10 | 31,99902344 | -32 | 0,000976563 | 2^-10 | 90,30873362 |
| 7 | 9 | 63,99804688 | -64 | 0,001953125 | $2^{\wedge}-9$ | 90,30873362 |
| 8 | 8 | 127,9960938 | -128 | 0,00390625 | $2^{\wedge}-8$ | 90,30873362 |
| 9 | 7 | 255,9921875 | -256 | 0,0078125 | $2^{\wedge}-7$ | 90,30873362 |
| 10 | 6 | 511,984375 | -512 | 0,015625 | $2^{\wedge}-6$ | 90,30873362 |
| 11 | 5 | 1023,96875 | -1024 | 0,03125 | $2^{\wedge}-5$ | 90,30873362 |
| 12 | 4 | 2047,9375 | -2048 | 0,0625 | $2^{\wedge}-4$ | 90,30873362 |
| 13 | 3 | 4095,875 | -4096 | 0,125 | $2^{\wedge}-3$ | 90,30873362 |
| 14 | 2 | 8191,75 | -8192 | 0,25 | $2^{\wedge}-2$ | 90,30873362 |
| 15 | 1 | 16383,5 | -16384 | 0,5 | $2^{\wedge}-1$ | 90,30873362 |
| 16 | 0 | 32767 | -32768 | 1 | $2^{\wedge}-0$ | 90,30873362 |

## Scale Factors and Dynamic Range

| Format | Scaling factor () | Range in Hex (fractional value) |
| :---: | :---: | :---: |
| (1.15) | $2^{15}=32768$ | 0x7FFF (0.99) $\rightarrow 0 \times 8000(-1)$ |
| (2.14) | $2^{14}=16384$ | 0x7FFF (1.99) $\rightarrow 0 \times 8000$ (-2) |
| (3.13) | $213=8192$ | $0 \times 7$ FFF (3.99) $\rightarrow 0 \times 8000$ (-4) |
| (4.12) | $2^{12}=4096$ | 0x7FFF (7.99) $\rightarrow 0 \times 8000$ (-8) |
| (5.11) | $2^{11}=2048$ | 0x7FFF (15.99) $\rightarrow$ 0x8000 (-16) |
| (6.10) | $2^{10}=1024$ | 0x7FFF (31.99) $\rightarrow$ 0x8000 (-32) |
| (7.9) | $2^{9}=512$ | 0x7FFF (63.99) $\rightarrow 0 \times 8000$ (-64) |
| (8.8) | $2^{8}=256$ | 0x7FFF (127.99) $\rightarrow$ 0x8000 (-128) |
| (9.7) | $2^{7}=128$ | 0x7FFF (511.99) $\rightarrow$ 0x8000 (-512) |
| (10.6) | $2^{6}=64$ | 0x7FFF (1023.99) $\rightarrow 0 \times 8000$ (-1024) |
| (11.5) | $2^{5}=32$ | 0x7FFF (2047.99) $\rightarrow 0 \times 8000$ (-2048) |
| (12.4) | $2^{4}=16$ | 0x7FFF (4095.99) $\rightarrow 0 \times 8000$ (-4096) |
| (13.3) | $2^{3}=8$ | 0x7FFF (4095.99) $\rightarrow 0 \times 8000$ (-4096) |
| (14.2) | $2^{2}=4$ | 0x7FFF (8191.99) $\rightarrow$ 0x8000 (-8192) |
| (15.1) | $21=2$ | $0 \times 7$ FFF (16383.99) $\rightarrow 0 \times 8000$ (-16384) |
| (16.0) | $2^{0}=1$ ( Integer) | 0x7FFF (32767) $\rightarrow$ 0x8000h (-32768) |

## Examples

| Hex. Number | $(16.0)$ format | $(4.12)$ format | $(1,1.5)$ format |
| :---: | :---: | :---: | :---: |
| 0x7FFF |  |  |  |
| $0 \times 8000$ |  |  |  |
| $0 \times 1234$ |  |  |  |
| $0 \times 4 B C D$ |  |  |  |
| $0 \times 5566$ |  |  |  |


| Number | $(1.15)$ format | $(2.14)$ format | $(8.8)$ format | (16.0) format |
| :---: | :--- | :--- | :--- | :--- |
| 0.5 |  |  |  |  |
| 1.55 |  |  |  |  |
| -1 |  |  |  |  |
| -2.0345 |  |  |  |  |

## How to convert fractional number into integer

$\square$ Conversion from fractional to integer value:
$\square$ Step 1: normalize the decimal fractional number to the range determined by the desired Q format
$\square$ Step 2: Multiply the normalized fractional number by $2^{n}$
$\square$ Step 3: Round the product to the nearest integer
$\square$ Step 4: Write the decimal integer value in binary using $N$ bits.
$\square$ Example:
$\square$ Convert the value 3,5 into an integer value that can be recognized by a DSP assembler using the Q15 format:
■ 1) Normalize: $3,5 / 4=0,875$;

- 2) Scale: $0.875 * 2 \wedge 15=28.672$;

■ 3) Round: 28.672

## How to convert integer into fractional number

$\square$ Numbers and arithmetic results are stored in the DSP processor in integer form.
$\square$ Need to interpret as a fractional value depending on Q format
$\square$ Conversion of integer into a fractional number for Qm.n format:

- Divide integer by scaling factor of Qm.n => divide by $2^{n}$
$\square$ Example:
$\square$ Which Q15 value does the integer number 2 represent? $2 / 2^{15}=2^{*} 2^{-15}=2^{-14}$


## Two's complement system

| $\mathrm{B}=2$ |  |
| :---: | :---: |
| 011 | 3 |
| 010 | 2 |
| 001 | 1 |
| 000 | 0 |
| 111 | -1 |
| 110 | -2 |
| 101 | -3 |
| 100 | -4 |

$$
\begin{aligned}
& \begin{array}{l}
\text { Range }-2^{\mathrm{b}} \text { to }\left(2^{\mathrm{b}}-1\right) \\
\text { For } \mathrm{b}+1 \text { data bits }
\end{array} \\
& \begin{array}{l}
D R_{d B}: \text { Dynamic Range in } \mathrm{SB} \\
D R_{d B}=20 \cdot \log _{10}\left(\frac{\text { largestpossille word value }}{\text { smalestrposisile word value }}\right)
\end{array} \\
& \begin{array}{l}
011=3 \\
D R_{d B}=20 \cdot \log _{10}\left(\frac{2^{b}-1}{1}\right)=20 \cdot \log _{10}\left(2^{b}-1\right) \\
D R_{d B}=20 \cdot \log _{10}\left(2^{b}\right)=20 \cdot \log _{10}(2) \cdot b \\
D R_{d B}=6.02 \cdot b d B
\end{array}
\end{aligned}
$$

| Negation mechanism |  |
| :--- | :--- |
| Step | Result |
| Original number | $011=3$ |
| 1 complement | 100 |
| Add 1 | 101 |
|  | $101=-3$ |

$\square$ One bit for sign, B for number representation.
$\square$ Very popular system, widely used.
$\square$ Same logic for sum and subtraction.

## Geometric Depiction of Twos Complement Integers


(a) 4-bit numbers

(b) n-bit numbers

## Sum in Two's complement

## Integer Format



Q3 Format


## Sum in Two's complement

## Different Formats



## For C=A+B, where

## A is in P.Q format <br> $B$ is in R.S format

The result $C$ is in $\max (P, R) \cdot \max (Q, S)$ format

## Multiplication in Two's complement

## Integer Format

Fractional Format Q3

| 4.0 x | 1 | 0 | 1 | 1. |  |  |  |  |  | 1.3 x |  |  | 0 | 1 |  | -0.625 ${ }_{10}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 4.0 | 0 | 1 | 1 | 0. | 610 |  |  |  |  | 1. |  | 0. | 1 | 1 |  | $0.75{ }_{10}$ |
|  |  |  | 0 | 0 | 0 | 0 |  |  |  |  |  | 0 | 0 | 0 |  |  |
| 1 | 1 | 1 | 0 | 1 | 1 |  | ext |  | 1 |  | 1 | 0 | 1 | 1 |  |  |
| 1 | 1 | 0 | 1 | 1 |  |  |  |  | 1 | 1 | 0 | 1 | 1 |  |  |  |
| 0 | 0 | 0 | 0 |  |  |  |  |  | 0 |  | 0 | 0 |  |  |  |  |
| 11 | 1 | 0 | 0 | 0 | 1 | 0. | $30_{10}$ | 1 | 1. |  | 0 | 0 | 0 | 1 |  | $-0.4688{ }_{10}$ |

For $\mathrm{C}=\mathrm{AxB}$, where $A$ and $B$ are $B$ bits wide, $C$ is $2 B$ bits wide.

## Multiplication in Two's complement

## Different formats

$$
\begin{aligned}
& \begin{array}{cccccccccccc}
3.1 & 1 & 0 & 1 . & 1 & \mathbf{- 2 . 5} & & \begin{array}{c}
1.3 \\
\mathrm{x}
\end{array} & 1 . & 0 & 1 & 1 \\
\mathrm{x}
\end{array} \mathrm{O} . \mathbf{0 . 6 2 5}_{10} \\
& 2.2 \begin{array}{rrrrr}
0 & 1 . & 1 & 0 & \mathbf{1 . 5}_{10} \\
& 0 & 0 & 0 & 0
\end{array} \\
& \begin{array}{llllllllllllll}
1 & 1 & 1 & 0 & 1 & 1
\end{array} \quad \begin{array}{lllllll} 
& & \text { extension } & 1 & 1 & 1 & 0 \\
1 & 1 & 1
\end{array} \\
& \begin{array}{lllll}
1 & 1 & 0 & 1 & 1
\end{array} \quad \begin{array}{lllll}
1 & 1 & 0 & 1 & 1
\end{array} \\
& \begin{array}{llll}
0 & 0 & 0 & 0
\end{array} \\
& \begin{array}{lllllllll}
1 & 1 & 1 & 0 & 0 . & 0 & 1 & 0 & -3.75_{10}
\end{array} \\
& \begin{array}{llllllllll} 
& 0 & 0 & 0 & 0 & & & & \\
\hline 1 & 1 & 1 . & 0 & 0 & 0 & 1 & 0 & -0.9375_{10}
\end{array}
\end{aligned}
$$

For $\mathrm{C}=\mathrm{AxB}$, where
$A$ is in P.Q format and $B$ is in R.S format
The result's is in (P+R). $(\mathrm{Q}+\mathrm{S})$

## Multiplication: Why MSB is

## Redundant?

$\square$ Number represented by 4 bits ( $\mathrm{N}=4$ ) 2's complement range is from -8 to +7 . The min/max number obtained from multiplication is $-56 /+64 \Rightarrow 7$ bits are enough to represent the result.
$\square \mathrm{NxN}=>2 \mathrm{~N}-1$ bits
$\square$ The additional MSB is a "sign extension bit" and can be removed
Another way to interpret it is that if converted to unsigned the multiplication result will be ( $\mathrm{N}-1$ ) + $(\mathrm{N}-1)+1$ sign bits giving $2 \mathrm{~N}-1$.

## Q format Multiplication

- Product of two Q15 numbers is Q30.
- So we must remember that the 32-bit product has two bits in front of the binary point.
- Since NxN multiplication yields 2N-1 result
- Addition MSB sign extension bit
- Typically, only the most significant 15 bits (plus the sign bit) are stored back into memory, so the write operation requires a left shift by one.



## Dynamic Range, Precision and Quantization errors



## ADC errors and solutions



## Analog Signal and Quantization

$\square$ The codec and system's coefficients are the main generators of quantization noise.
$\square$ Codec's noise can be thought as a uniformly distributed PDF between -LSB/2 and LSB/2.
$\square$ The SNR of an ADC is proportional to the word-length and the loading factor.
Oversampling and Dithering


$\Delta=\frac{2 V p}{2^{B}} \quad$ Quantization Step

$$
\left.\begin{array}{l}
\hline m_{e}=0 \\
\hline \sigma_{e}^{2}=\Delta^{2} / 12
\end{array}\right\} \begin{aligned}
& \text { Mean and } \\
& \text { Variance of } \\
& \begin{array}{l}
\text { Quantization } \\
\text { Error }
\end{array}
\end{aligned}
$$

## Analog Signal and Quantization

$S N R_{A / D}=10 \cdot \log _{10}\left(\frac{\text { input signal variance }}{\text { A/D quantization noise variance }}\right)$
$S N R_{A / D}=10 \cdot \log _{10}\left(\frac{\sigma_{\text {signal }}^{2}}{\sigma_{\text {A/D noise }}^{2}}\right)$
$\sigma_{A / D \text { noise }}^{2}=\frac{\Delta^{2}}{12}=\frac{V p^{2}}{3 \cdot 2^{2 b}} \quad$ where $\quad \Delta=\frac{2 V p}{2^{b}}$
$\sigma_{\text {signal }}^{2}=r m s^{2}=\left(\frac{V p}{\sqrt{2}}\right)^{2}$

$S N R_{A / D}=10 \cdot \log _{10}\left(\frac{V p^{2} / 2}{V p^{2} / 3 \cdot 2^{2 b}}\right)=10 \cdot \log _{10}\left(1.5 \cdot 2^{2 b}\right)$
$S N R_{A / D}=1.76 d B+6.02 d B \cdot b$
$b=16$ bits $\Rightarrow S N R_{A / D}=98.08 d B$

Loading Factor : $L F=\frac{r m s}{V p}=\frac{\sigma_{\text {signal }}}{V p} ; \quad \sigma_{\text {signal }}^{2}=L F^{2} V p^{2} \Rightarrow S N R_{A / D}=4.77 d B+6.02 d B \cdot b+20 \cdot \log _{10}(L F)$

## Oversampling Method



## The number of bits used for the lowpass filter's coefficients and registers must exceed the original number of ADC bits in order to benefit from the oversampling scheme



$$
\text { Prosessing Gain : } P G=10 \log \left(\frac{f s}{2 B W}\right) ; \quad S N R_{A / D}=1.76 d B+6.02 d B \cdot b+10 \log \left(\frac{f s}{2 B W}\right)
$$

## Oversampling Method

## Normal Averaging



It's ideally used in cases where the sampling frequency is low compared to the sampling rate of the ADC

## Rolling Average



It's ideally suited for applications requiring oversampling and higher sample rates

## Dithering Method



## Analog Signal and Quantization

$\square$ SNR $_{A / D}>=S N R_{\text {signal }}$
$\square$ In practice, $S N R_{A / D}=S N R_{A / D \text { ideal }}-3$ to $6 d B$
$\square$ Aperture jitter error

- Missing output bit patterns
$\square$ Other nonlinearities
$\square$ It's imprudent to force an A/D convert's input to full scale. Use LF to determine A/D's SNR.
$\square$ Effective Numbers Of Bits (ENOB)

$$
b_{\text {eff }}=\frac{S N R-1.76}{6.02}
$$

$\square S N R_{D S P}>=S N R_{A / D}$

## Overflow errors and solutions



## Avoiding overflow

| b39-b32 | b31-b16 | b15-b0 |
| :---: | :---: | :---: |
| G | H | L |

$\square$ Always use the maximum capability (guard bits) of the accumulators during internal calculations.
$\square$ Only round (or truncate) the final results to the final data size and format if possible.
$\square$ There is (almost) no lost of precision when handling internal calculations with guard bits.

## Avoiding overflow




$\square$ Scaling down a signal is the most effective technique to prevent overflow.

- Scaling down always implies loss of precision.
$\square$ Both scaling down and guard bits techniques must be used in order to avoid overflow.
- Always is more convenient to scale down system's coefficients instead of signals.


## Avoiding overflow

Effect of $\beta$ in SNR
$\mathrm{SNR}=10 \log _{10}\left(\frac{\beta^{2} \sigma_{x}^{2}}{\sigma_{e}^{2}}\right)=4.77+6.02 B+10 \log _{10} \sigma_{x}^{2}+20 \log _{10} \beta$.
For example adopting $\beta=0.5$ implies a 6.02 dB decrease of SNR. This is

Never
overflows equivalent that dividing by 2 , rotating 1 time to the right, or losing 1 bit of resolution.

- Scaling down always reduces SNR.
$\square$ It is possible to use an absolute safe or a more relaxed criteria to choose $\beta$ value.
$\square$ Many times it is preferable to use different Q fractional formats within an algorithm.
$\square$ As overflow is very probable to happen in fixed point processors, special effort should be taken when coding algorithms and debugging.


## Avoiding overflow



$$
G<\frac{1}{x_{\max } \sum_{k=0}^{N-1}\left|h_{k}\right|}
$$

$$
G<\frac{1}{X_{\max }\left(\sum_{k=0}^{N-1} h_{k}^{2}\right)^{1 / 2}}
$$

$G<\frac{1}{X_{\max } \max \left[H\left(\omega_{k}\right)\right]}$


$$
H(z)=\frac{1+0.72 z^{-1}+z^{-2}}{1+0.052 z^{-1}+0.8 z^{-2}}
$$

$$
\text { I1_norm }=4.8839
$$

$$
\text { I2_norm = } 1.5263
$$

$$
\text { cheb_norm = } 3.4926
$$

## Minimizing overflow effects

Without saturation arithmetic


With saturation arithmetic

$\square$ Always use saturating arithmetic.
$\square$ In case overflow occurs, decrease the probability that an oscillation occurs.

## Truncation and Rounding



## Truncation/Rounding in MultiplyAccumulate




Using ceil.m



## Truncation and Rounding

A limit cycle, sometimes referred to as a multiplier roundoff Iimit cycle, is a low-level oscillation that can exist in an otherwise stable filter as a result of the nonlinearity associated with rounding (or truncating) internal filter calculations<br>Limit cycles require recursion to exist and do not occur in nonrecursive FIR filters



## Coefficient Quantization Error



## Quantization word-length effects

Complex conjugated two poles band pass

$$
\begin{array}{rlr}
\lambda & =r e^{ \pm j \theta} \quad H(z)= \\
& =\lambda_{r} \pm j \lambda_{i} \\
& =r \cos (\theta) \pm j r \sin (\theta)
\end{array}
$$

And its difference equation

$$
y(n)=2 r \cos (\theta) y(n-1)-r^{2} y(n-2)+x(n)
$$


$\square$ When defining a system in term of its coefficients, the finite precision affect the behavior of the system itself.
$\square$ Though there is a grid of possible locations where system's poles can be placed.
$\square$ This grid depends first of the word-length and second of the structure adopted to implement of the system.

## Quantization word-length effects


$\square$ There are structures are less sensitive to coefficient quantization.
$\square$ There is a trade-off between efficiency and sensibility to coefficient quantization.

## Finite Wordlength Effects (I)

$\square$ Discretization (quantization) of the filter coefficients has the effect of perturbing the location of the filter poles and zeroes. This deterministic frequency response error is referred to as coefficient quantization error.
$\square$ The use of finite precision arithmetic makes it necessary to quantize filter calculations by rounding or truncation. Roundoff noise is that error in the filter output that results from rounding or truncating calculations within the filter.

## Finite Wordlength Effects (II)

$\square$ Quantization of the filter calculations also renders the filter slightly nonlinear. However, for recursive filters with a zero or constant input, this nonlinearity can cause spurious oscillations called limit cycles.
$\square$ With fixed-point arithmetic it is possible for filter calculations to overflow. The term overflow oscillation refers to a high-level oscillation that can exist in an otherwise stable filter due to the nonlinearity associated with the overflow of internal filter calculations.

[^0]
## Floating point representation

$\square$ This form of representation overcomes limitations of precision and dynamic range of fixed point.
$\square$ This format segment data in sign, exponent and mantissa.
$\square$ Mantissa is represented as a fixed point number.
$\square$ Exponent is represented in binary offset format.
$\square$ The greater the $\mathrm{b}_{\mathrm{e}}$ the larger the dynamic range.
$\square$ The greater the $\mathrm{b}_{\mathrm{m}}$ the larger the precision.
$\square$ There is a trade off between $b_{m}$ and $b_{e}$, and the best balance occur at $b_{e} \approx b / 4$ and $b_{m} \approx 3 b / 4$.
$\square \mathrm{DR}=6.02 * 2^{\text {be }}$

## Floating point representation (I)

IEEE Standard P754 Format


$$
\text { value }_{\text {ieee }}=(-1)^{s} \cdot 1, f \cdot 2^{e-127}
$$

$\square$ IEEE P754 is the most widely used floating point format.
$\square$ As the point is floating, a process called normalization is performed in order to use the full precision of $b_{m}$ bits, while the exponent is adjusted properly.

- Floating point arithmetic usually requires lot of logical comparisons and branching, so software emulated floating achieves low performance
- Floating point DSPs implements in hardware all arithmetic handling, so these DSPs outperforms their fixed point counterparts in ease of use and performance (of course being more expensive too).


## Floating point representation (II)

|  | Singlle Precision (32 bits) |  |  | Doulble Precision (64 bits) |  |  |  |  |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | Sign | Biased <br> exponent | Fraction | Value | Sign | Biased <br> exponent | Fraction | Value |
| Positive zero | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Negative zero | 1 | 0 | 0 | -0 | 1 | 0 | 0 | -0 |
| Plus infinity | 0 | $255($ all 1s) | 0 | $\infty$ | 0 | $255($ all 1s) | 0 | $\infty$ |
| Minus infinity | 1 | $255($ all 1s) | 0 | $-\infty$ | 1 | $255($ all 1s) | 0 | $-\infty$ |
| NaN | 0 or 1 | $255($ all 1s) | $\neq 0$ | $N a N$ | 001 | $255($ all 1s) | $\neq 0$ | NaN |
| Positive <br> normalized | 0 | $0<\mathrm{e}<255$ | f | $2^{\mathrm{e}-127}(1, \mathrm{f})$ | 0 | $0<\mathrm{e}<2047$ | f | $2^{\mathrm{e}-1023}(1, \mathrm{f})$ |
| Negative <br> normalized | 1 | $0<\mathrm{e}<255$ | f | $-2^{\mathrm{e}-127}(1, \mathrm{f})$ | 1 | $0<\mathrm{e}<2047$ | f | $-2^{\mathrm{e}-1023}(1, \mathrm{f})$ |
| Positive <br> denormalized | 0 | 0 | $\mathrm{f} \neq 0$ | $2^{-126}(0, \mathrm{f})$ | 0 | 0 | $\mathrm{f} \neq 0$ | $2^{-1022(0, f)}$ |
| Negative <br> denormalized | 1 | 0 | $\mathrm{f} \neq 0$ | $-2^{-126}(0, \mathrm{f})$ | 1 | 0 | $\mathrm{f} \neq 0$ | $-2^{-1022(0, f)}$ |

## Normalized \& Denormalized numbers (32-bit format )

Unused
Normalized numbers ( $\mathbf{1 , f} \mathbf{2}^{\mathbf{e - 1 2 7}}$ )


Min. Positive Normalized
$1 \times 2^{1-127}=2^{-126}$
$1.175494350822287507968 \mathrm{e}-38$

$\rightarrow 00000000000000000000000000000001$

> Min. Positive Denormalized
> $\left(1 \times 2^{-23}\right) \times 2^{-126} \cong 2^{-149}$
> $1.4012984643248170709 \mathrm{e}-45$

## Multiply



## Division




## Fixed Point hardware or Floating

 Point Hardware ?$\square$ There are both benefits and trade-offs to using Fix-PHw rather than FL-PHw. Many applications require low-power and cost-effective circuitry, which makes Fix-PHw a natural choice.
$\square$ Fix-PHw tends to be simpler and smaller. As a result, these units require less power and cost less to produce than floating-point circuitry.
$\square$ FL-PHw is usually larger because it demands functionality and ease of development. FL-PHw can accurately represent real-world numbers, and its large dynamic range reduces the risk of overflow, quantization errors, and the need for scaling. In contrast, the smaller dynamic range of Fix-PHw that allows for low-power, inexpensive units brings the possibility of these problems.
$\square$ Therefore, fixed-point development must minimize the negative effects of these factors, while exploiting the benefits of Fix-PHw ; cost- and size-effective units, less power and memory usage, and fast real-time processing.

## Recommended bibliography

$\square$ RG Lyons, Understanding Digital Signal Processing $2^{\text {nd }}$ ed. Prentice Hall. 2004.
$\square$ Ch12: Digital Data Formats and their effects.
$\square$ SW Smith, The Scientist and Engineer's guide to DSP. California Tech. Pub. 1997.

- Ch4: DSP software.
$\square$ VK Madisetti, DB Williams. Digital Signal Processing Handbook. CRC Press.
- Ch3: Finite Wordlength Effects. Bruce W. Bomar
$\square$ SM Kuo, BH Lee. Real-Time Digital Signal Processing 2 ${ }^{\text {nd }}$ ed. John Wiley and Sons. 2006.
- Ch 3.4 to 3.6: DSP Fundamentals and Implementations Considerations.
$\square$ WS Gan, SM Kuo. Embedded Signal Processing with the MSA. John Wiley and Sons. 2007
- Ch 6: Real Time DSP Fundamentals and Implementations Considerations.


Thank you!


[^0]:    * Bruce W. Bomar - University of Tennessee Space Institute

