float32转float16、snorm/sunorm8/16 学习及实现

1、基础

彻底搞懂float16与float32的计算方式-CSDN博客

**例1：**float32 0x3fd00000 = 32'b0 011_1111 _1 101_0000_0000_0000_0000_0000

sign=0

exp=8'b0111_1111 = 'h7f = 'd127 =>0ffset = 127-127 = 0

mantissa = 'b101_0000_0000_0000_0000_0000(补1，1.1010000....)

(-1)^sign * 2^exp_offset * mantissa = 1.101b = 2^0 + 2^(-1) + 2^(-3) = 1.625 d

例2： float32 0x3f200000 = 32'b0 011_1111 _0 010_0000_0000_0000_0000_0000

sign=0

exp=8'b0111_1110 = 'h7e = 'd127 =>0ffset = 126-127 = -1

mantissa = 'b010_0000_0000_0000_0000_0000(补1，1.0100000....)

(-1)^sign * 2^exp_offset * mantissa = 1.010 * 2^(-1) b = 0.101= 2^(-1) + 2^(-3) = 0.625 d

特殊情况，exp 全1时候的nan、inf，exp=0 时候的denorm / 0

DXGI FOMAT中的SNORM和 UNORM格式-CSDN博客

根据上面的理解，要实现的

snorm8 能够表示的数据为 int（-1.0 ，1.0）* 127 ；

unorm8 能够表示的数据为 int（0，1.0）*255;

snorm16 能够表示的数据为 int（-1.0 ，1.0）* （2^15-1）；

unorm16 能够表示的数据为 int（0，1.0）*(2^16-1);

2、实现（sv）

Matlab 复制代码

function bit[31:0] float32_to_newformat(bit [31:0] float32_in, TRANSFORM_TYPE type);
    bit [31:0] rt_data_out;
    bit [7: 0] unorm8_out;
    bit [7: 0] snorm8_out;
    bit [15:0] float16_out;
    bit [15:0] unorm16_out;
    bit [15:0] snrom16_out;
    
    bit sign;
    bit [7: 0] exponent;
    bit [22:0] mantissa;
    bit [23:0] norm_mantissa;
    
    int exp_offset;
    bit [47:0] shifted_value;
    bit [7: 0] unorm8_value;
    bit [7: 0] snorm8_value;
    bit [15:0] unorm16_value;
    bit [15:0] snrom16_value;

    bit [4: 0] exp16;
    bit [9: 0] mant16;
    bit overflow,underflow,is_inf,is_nan;
    
    //note1
    sign     = float32_in[31];
    exponent = float32_in[30:23];
    mantissa = float32_in[22:0];
    norm_mantissa = {1'b1,mantissa};
    
    if (type == UNORM8)begin
        if (exponent >=127)begin                        //note2.1
            if (exponent == 8'ff && mantissa != 0)      //note2.1.1
                unorm8_value = 8'h0;
            else                                        //note2.1.2
                unorm8_value = 8'hff;
        end else if (exponent >=118)begin               //note2.2
            shifted_value = norm_mantissa[23:0] * 255;
            if (shifted_value[23+127-exponent-1] == 1)  //note2.2.1
                unorm8_value = (shifted_value>>(23+127-exponent)) + 1;
            else
                unorm8_value = shifted_value>>(23+127-exponent);
        end else                                        //note2.3
            unorm8_value = 8'h0;
        unrom8_out = sign? 8'h0 : unorm8_value;         //note2.4
    end else if (type == SNORM8)begin
        if (exponent >=127)begin                        
            if (exponent == 8'ff && mantissa != 0)     
                snorm8_value = 8'h0;
            else                                        
                snorm8_value = 8'h7f;
        end else if (exponent >=119)begin               
            shifted_value = norm_mantissa[23:0] * 127;
            if (shifted_value[23+127-exponent-1] == 1)  
                snorm8_value = (shifted_value>>(23+127-exponent)) + 1;
            else
                snorm8_value = shifted_value>>(23+127-exponent);
        end else                                        
            snorm8_value = 8'h0;
        snrom8_out = sign? (-snorm8_value): snorm8_value;        
    end else if (type == FLOAT16)begin
        exp_offset = exponent - 127 + 15;
        overflow   = (exp_offset >= 31);
        underflow  = (exp_offset <= 0);
        is_inf     = (exponent = 8'hff && mantissa == 23'h0);
        is_nan     = (exponent = 8'hff && mantissa != 23'h0);
        //note3.1
        exp16      = (overflow && exponent != 8'hff) ? 5'h1e :
                                   (is_inf | is_nan) ? 5'h1f :
                                           underflow ? 5'h0  : exp_offset[4:0];
        //note3.2
        mant16     = (overflow && exponent != 8'hff) ? 10'h3ff :
                                             is_inf  ? 10'h0   :
                               is_nan  ? {1'b1,mantissa[21:13]}:
        underflow ?(mantissa == 23'h0 ? 10'h0 : norm_mantissa[23:13] >> (1-exp_offset)): mantissa[22:13];
        float16_out = {sign,exp16,mant16};                                         

    end else if (type == UNORM16)begin
        if (exponent >=127)begin                        
            if (exponent == 8'ff && mantissa != 0)     
                unorm16_value = 16'h0;
            else                                        
                unorm16_value = 16'hffff;
        end else if (exponent >=111)begin               
            shifted_value = norm_mantissa[23:0] * 65535;
            if (shifted_value[23+127-exponent-1] == 1)  
                unorm16_value = (shifted_value>>(23+127-exponent)) + 1;
            else
                unorm16_value = shifted_value>>(23+127-exponent);
        end else                                        
            unorm16_value = 16'h0;
        unrom16_out = sign? 16'h0: unorm16_value;  
    end else if (tyep == SNORM16)begin
        if (exponent >=127)begin                        
            if (exponent == 8'ff && mantissa != 0)     
                snorm16_value = 16'h0;
            else                                        
                snorm16_value = 16'h7fff;
        end else if (exponent >=112)begin               
            shifted_value = norm_mantissa[23:0] * 32767;
            if (shifted_value[23+127-exponent-1] == 1)  
                snorm16_value = (shifted_value>>(23+127-exponent)) + 1;
            else
                snorm16_value = shifted_value>>(23+127-exponent);
        end else                                        
            snorm16_value = 16'h0;
        snrom16_out = sign? (-snorm16_value): snorm8_value;  
    end
        
endfunction

3、代码说明

note1 float32 的拆分，1bit sign，8bit exp，23 bit尾数，补上舍去的1

note2.1 对于norm/snorm 来讲表示-1.0~1.0或者0~1.0 之间的数值，如果指数大于127表示offset 大于0，浮点数大于1了，可以直接设置为最大值

note2.1.1 但是对于特殊情况，exp 为全1，且mantissa 不等于0 表示无效值，针对无效值赋值为0

note 2.1.2 对于unorm 不考虑符号，所以最大值为全1，对于snorm ，考虑符号位，最高位为0，其余为1

note2.2 对于unrom8，数值转换需要× 255，约2^8，exp>118 才有机会大于0，如果exp再小一些，等于0

note2.2.1 round

note 2.3 见 note2.2

note 2.4 unorm时，sign为1，表示负数，不在unorm表示范围，赋值为0，snorm时候，取负值

note 3.1 exp16转换时候的特殊情况：

overflow时且is_inf ,is_nan不成立，表示为能够表示的最大数（1f是特殊值，所以是1e）

is_inf,is_nan 成立时候exp16 全1，

underflow时为0，

其余取计算出来的exp_offset

note 3.2 mant16 转换时候的特殊情况：

overflow时且is_inf ,is_nan不成立，表示为能够表示的最大数

is_inf,时候根据定义mant16 应该为0，

is_nan 时候按道理取mant32 的高10位即可，但是防止高10位都为0的情况（这样会转换成is_inf),所以最高位强制为1（反正无效，哪一位强制都可以）

underflow时，如果mant32 为0 那就是0，否则要使用特殊公式

正常情况取mant32的高10位

其余取计算出来的exp_offset