



## **ELEC 305**

# **Digital System Design Lab**

Fall 2024

### Lecture 5:

Number Formats and Arithmetics

## **Recap of Part 1**



- Up to now we've seen how to...
  - <u>describe</u> a circuit using ("DUT") VHDL based on a given set of specifications, use Vivado to automatically <u>synthesize</u> the DUT and <u>implement</u> it on the FPGA
  - <u>simulate</u> DUT <u>behavior</u> in VHDL, characterize its accuracy, <u>constrain</u> and <u>analyze</u> its <u>timing</u> characteristics, and consequently <u>optimize</u> it if it fails
- This pretty much sums up the tasks of a digital designer, but there's a catch, a recurring theme → Vivado keeps getting better at doing most of these automatically. We therefore need to climb higher up on the abstraction ladder.
- The designer should also be proficient in building / testing / optimizing on the <u>algorithms</u> level to advise Vivado properly.
- That's what we'll be covering in Part 2.

| 1.       | library IEEE;                                                |
|----------|--------------------------------------------------------------|
| 2.       | use IEEE.STD LOGIC 1164.ALL;                                 |
| 3.       |                                                              |
| 3.<br>4. |                                                              |
|          | entity coffeemaker is                                        |
| 5.       | <pre>Port ( clk : in STD_LOGIC;</pre>                        |
| 6.       | <pre>led : out STD_LOGIC;</pre>                              |
| 7.       | sw : in STD_LOGIC                                            |
| 8.       | );                                                           |
| 9.       | end coffeemaker;                                             |
| L0.      |                                                              |
| 11.      | architecture Behavioral of coffeemaker is                    |
| 12.      | <pre>signal pulse : std logic := '0';</pre>                  |
| L3.      | <pre>signal count : integer range 0 to 199999999 := 0;</pre> |
| 14.      | begin                                                        |
| L5.      | <pre>process(clk, sw)</pre>                                  |
| L6.      | begin                                                        |
| L7.      |                                                              |
| L8.      | end process;                                                 |
| 19.      | Failed Timing! -8.808 -8.8                                   |
| 20.      | <pre>led &lt;= pulse;</pre>                                  |
| 21.      | end Behavioral;                                              |
|          | cha Denaviolat,                                              |





- The family of circuits we've dealt with so far are typically called "boolean circuits": Inputs and outputs were binary, and even when we used vectors we considered each bit separately
- The only arithmetic component we saw was the simple counter, which incremented (or decremented, or reset) its unsigned integer "count value" one by one
- While such circuits are indeed a large application area for reconfigurable logic ICs, FPGAs are nowadays hailed for their abilities of parallel processing of custom arithmetics, mostly due to the AI hype and its endless hunger for more powerful compute
- In this part of the course, we'll start by defining arithmetics for FPGAs and discuss migrating the computational graphs we define on paper to the digital domain
- Then, we will dive into building useful applications with those



- Quantization
- Fixed point (FxP) vs. floating point
- FxP number formats and arithmetic operations



- Fixed point (FxP) vs. floating point
- FxP number formats and arithmetic operations

#### © 2024 Burak Son

- Remember this analog to digital conversion slide from lecture 2? →→→
- digitization = sampling + quantization

Quantization

- We mentioned that we would cover quantization in detail (for sampling, check, e.g., ELEC 303). We're at that point now.
- Basic definition of quantization: To get deterministic and repeatable outputs despite noise, we map sets of "real" values to discrete intervals. Larger intervals mean we need less resources but it also means more error w.r.t. to the "real" value, so we try to engineer these discrete intervals to minimize both resource usage and quantization error



img src: https://web.mit.edu/6.111/volume2/www/f2019/handouts/L01.pdf



#### © 2024 Burak Sonei

#### To migrate algorithms to the digital domain, we apply quantization to every number on the computational graph of that algorithm.

- Depending on the quantization scheme we choose, we will get different amounts of error on each number (compared to the "real value"), which will affect the accuracy of the algorithm
- The algorithms designer in a digital design team needs to take this into account while developing the computational graph (typically via tweaking some parameters)
- The error is non-deterministic (you never know what the real value is), so you design "around it" (e.g., if the max quantization error is too high for your specs, you increase the bitwidth etc.)



#### img src: https://github.com/dicearr/neuron-vhdl/tree/master



• For n = p = 1 (single unit, single parameter), the equation becomes:

 $\boldsymbol{y}_1 = \boldsymbol{\beta}_0 + \boldsymbol{x}_1^* \boldsymbol{\beta}_1$ 

the computational graph for this algorithm looks like this:

- $\begin{array}{c} x_1 \\ & \vdots \\ & \beta_1 \\ & \beta_0 \end{array}$
- Basically there are 6 numbers here : x<sub>1</sub>, y<sub>1</sub>, t<sub>1</sub>, t<sub>2</sub>, β<sub>0</sub>, β<sub>1</sub>
- They have "real values" (in ℝ<sup>6</sup>) which we need to quantize





- Recall how we tackled this in lecture 2 with a single number, what we did was just one way of doing this. We'll now see alternative methods and their pros/cons and do this on a computational graph level (to all 6 numbers, considering their application requirements).
  - Quantization → Let's chop this 0.3V-2.7V voltage range up to 32 pieces and represent it with a uniform fixed-point number representation (5-bits)



 This means we'll have the following values to work with (other values will be rounded to these somehow): {0.300, 0.375, 0.450, ..., 2.550, 2.625}

 $\rightarrow$  Note how we don't have 2.7 anymore, that would require a 33rd value

 We can now treat this set of 32 values as a 5-bit digital value set (i.e., {b00000, b00001, b00010, ..., b11110, b11111}) and design around it



- Fixed point (FxP) vs. floating point
- FxP number formats and arithmetic operations



- Fixed point (FxP) vs. floating point
- FxP number formats and arithmetic operations

- We make the following choices when quantizing our numbers:
  - uniform vs. non-uniform: regarding the distribution of the sizes of discretization intervals
  - scalar vs. vector: regarding whether numbers are quantized with a scheme individually, or are they grouped into vectors and quantized relatively.
     We will mostly cover scalar methods.
  - how is the "rounding" done: when we quantize the analog reading, which way do we snap the value? (ceil, floor, towards zero, nearest etc.)



img src: https://medium.com/@florian\_algo/model-quantization-2-uniform-and-non-uniform-g uantization-47ca5b5d3ec0







- The uniform vs. non-uniform choice is typically made looking at the dynamic range (max-min) and precision (step size) requirements of the quantities that you're interested in.
- If you need a relatively small range of numbers with the same precision requirements all throughout the range, you should go with uniform quantization, specifically with fixed point.
- If you need to cover a very large range, and you need high precision in the lower ranges but are OK with lower precision in the higher ranges, you can go with non-uniform quantization, specifically with floating point
- However don't let this difference trick you, you're still getting approximations of real numbers in both cases (in fractional form). Only difference is in how well that approximation is and "where" it is (on the number line)



- Fixed point (FxP) number representations (uniform and scalar) are called so because they "affix the decimal point" (i.e., the fractional part of the number). They are defined by three parameters: 1) a scale factor, 2) a zero point mapping (bias), and 3) a rounding method
- You might have run into these in an image processing task with the name "normalization", where you
  map a range of "real" numbers (actually they were floats), say x, in the range {0,1} to uint8 (8-bit, 256
  different values) numbers in {0,255} to run imshow() on the result.
- Here, scale factor is 255, the zero point is  $0.5 \rightarrow 127.5$ , and arbitrary rounding can be chosen.
- The mapping is thus from {0, 0.00390625, 0.0078125, ..., 0,99609375} to {0, 1, 2, ..., 255}. Mark how this mapping does not have exact representations of the real values 1 (out of range) or 0.5 (since 127.5 cannot exist in the uint8 space, the closest are 127 and 128).



http://www.cburch.com/books/float/





- To better picture this, we can think of FxP having an "internal representation" which the designer knows of, and a "represented value" which the circuit will operate on.
- The {0, 0.00390625, 0.0078125, ..., 0,99609375} range is the internal representation, and the {0, 1, 2, ..., 255} range is the values represented by them.
- One thing to note here before moving on is that every interval within those ranges have the same value (i.e., the number of "ticks" are the same for any given two intervals of the same size within the range). This is why FxP is **uniform quantization**.
- Furthermore, every value is quantized individually. This is why FxP is scalar quantization.
- If vector quantization was used, the scale factor and zero points would be determined based on groups within ranges (e.g., {0, 0.2} would have one factor, {0.2, 1.0} would have another)



- Floating point is non-uniform (but still scalar quantization). The number of ticks in two intervals of the same size are different when they are at different points
- The closer you are to 0, the more resolution you get, and the farther away, the less.
- However, thanks to this we can represent very large numbers that would require many many many more bits to represent with FxP!
- A good read: <u>"What Every Computer Scientist</u> <u>Should Know About Floating-Point Arithmetic"</u>



img src: https://blogs.mathworks.com/cleve/2014/07/07/floating-point-numbers/

- Which one should we use?
- Like we said earlier: This depends on the application requirements, but the trade-off looks like this →→
- There is one additional consideration for us since we're designing hardware: If we are going to use floating point numbers, we need to have the circuitry that can do arithmetic with them!
- The arithmetic operators we saw up to now (full adders etc.) worked on integers, they didn't consider decimal points and fractions.
- These components can be used with FxP, but floating point numbers need special circuitry

### Why use fixed-point?



 We will focus on FxP representations for the rest of this course



- Fixed point (FxP) vs. floating point
- FxP number formats and arithmetic operations



- Quantization
- Fixed point (FxP) vs. floating point
- FxP number formats and arithmetic operations

## **Fixed Point Number Formats and Arithmetic**



• Let's discuss this over Dr. Zhu's slides:





© 2024 Burak Soner



## $\textbf{next} \rightarrow \textbf{DSP} + \textbf{optimization}$

