第3B部分：使用向量API实现BFP FIR滤波器

与第3A部分一样，第3B部分使用块浮点算术实现了FIR滤波器。

在第3A部分中，我们使用纯C实现了块浮点FIR滤波器。我们需要自己管理指数和头空间，以及计算内积的逻辑，这与第2A部分中的类似。

在第3B部分中，我们将使用与第3A部分相同的数据结构来表示BFP向量，但是我们将用lib_xcore_math的低级向量API函数调用来替换计算和管理头空间和指数的C代码的工作。我们还将用vect_s32_dot()替换计算内积的for循环，我们在第2C部分中遇到过。

这也意味着我们将使用VPU来完成大部分工作，而不是标量单元。

来自`lib_xcore_math`

本页面引用了lib_xcore_math中的以下操作：

实现

在第3B部分中，calc_headroom()只是vect_s32_headroom()的一个包装函数。这不仅更简单，而且速度更快得多。vect_s32_headroom()仍然需要遍历整个vec[]数组来确定头空间，但与我们的C实现不同，vect_s32_headroom()使用VPU的专用头空间硬件每3个指令处理8个元素。

快速检查第3A部分的calc_headroom()反汇编结果表明，它处理1个元素大约需要10个指令。

在第3B部分中，filter_frame()与第3A部分有所不同。第3B部分使用lib_xcore_math中的vect_s32_dot_prepare()来帮助确定输出指数和位移值。我们在第3部分中见过vect_s32_dot_prepare()。

简单来说，vect_s32_dot_prepare()接受两个输入向量的指数和头空间，以及向量的长度，然后使用这些值确定输出指数和需要传递给vect_s32_dot()的两个right_shift_t值。

这里有一个复杂性。vect_s32_dot()（以及filter_sample()）返回int64_t值。（这是lib_xcore_math中返回标量的32位向量函数的典型做法，比如vect_s32_dot()、vect_s32_energy()和vect_s32_sum()）。因为32位模式下的VPU累加器是40位的，我们知道为了安全起见，我们需要将结果额外向右移动8位。

因此，我们将vect_s32_dot_prepare()给出的指数加上8，并将filter_sample()的结果向右移动8位。

信息

我们可以在filter_sample()的输出上向右移动8位，而不是在b_shr和c_shr输出之间总共添加8位，但是因为这些位移是在乘法之前应用的，所以添加到它们可能会导致不必要的精度损失。

src/part3B/part3B.c
// 应用滤波器以生成单个输出样本。
int64_t filter_sample(
    const int32_t sample_history[TAP_COUNT],
    const right_shift_t b_shr,
    const right_shift_t c_shr)
{
  // 使用给定的位移参数计算内积的尾数。
  return vect_s32_dot(&sample_history[0], 
                      &filter_bfp.data[0], TAP_COUNT,
                      b_shr, c_shr);
}

filter_sample()接受样本历史和由vect_s32_dot_prepare()生成的位移参数，并调用vect_s32_dot()，类似于第2C部分中的方式，只是现在返回64位结果。

src/part3B/part3B.c
// 接受新音频数据的帧并将其合并到样本历史中
static inline 
void rx_and_merge_frame(
    int32_t sample_history[HISTORY_SIZE],
    exponent_t* sample_history_exp,
    headroom_t* sample_history_hr,
    const chanend_t c_audio)
{
  // 用于放置新帧的BFP向量。
  struct {
    int32_t data[FRAME_SIZE];   // 样本数据
    exponent_t exp;             // 指数
    headroom_t hr;              // 头空间
  } frame_in = {{0},0,0};

  // 接受新的输入帧
  rx_frame(frame_in.data, 
           &frame_in.exp, 
           &frame_in.hr, 
           c_audio);

  // 如果需要，重新调整BFP向量的尺度，以便它们可以合并
  const exponent_t min_frame_in_exp = frame_in.exp - frame_in.hr;
  const exponent_t min_history_exp = *sample_history_exp - *sample_history_hr;
  const exponent_t new_exp = MAX(min_frame_in_exp, min_history_exp);

  const right_shift_t hist_shr = new_exp - *sample_history_exp;
  const right_shift_t frame_in_shr = new_exp - frame_in.exp;

  if(hist_shr) {
    vect_s32_shr(&sample_history[0], 
                 &sample_history[0], 
                 HISTORY_SIZE,
                 hist_shr);
    *sample_history_exp = new_exp;
  }

  if(frame_in_shr){
    vect_s32_shr(&frame_in.data[0],
                 &frame_in.data[0],
                 FRAME_SIZE,
                 frame_in_shr);
  }
  
  // 现在我们可以合并新帧（逆序）
  for(int k = 0; k < FRAME_SIZE; k++)
    sample_history[FRAME_SIZE-k-1] = frame_in.data[k];

  // 确保头空间正确
  *sample_history_hr = calc_headroom(sample_history, HISTORY_SIZE);
}

rx_and_merge_frame()在第3A部分和第3B部分之间的唯一区别是样本如何进行位移。第3B部分不再逐个样本循环，而是调用vect_s32_shr()，该函数使用VPU来应用有符号、算术、饱和的右移操作。

结果

时间

时间类型	测量时间
每个滤波器系数	15.14 ns
每个输出样本	15498.91 ns
每帧	4054082.75 ns

输出波形

第3B部分的输出

来自lib_xcore_math​

实现​

结果​

时间​

输出波形​

来自`lib_xcore_math`

实现

结果

时间

输出波形