数据科学与SQL：如何计算排列熵？| 基于SQL实现

[0 引言](#0 引言)

[1 排列熵的计算原理](#1 排列熵的计算原理)

[2 数据准备](#2 数据准备)

[3 问题分析](#3 问题分析)

[4 小结](#4 小结)

0 引言

把"熵"应用在系统论中的信息管理方法称为熵方法。熵越大，说明系统越混乱，携带的信息越少；熵越小，说明系统越有序，携带的信息越多。在传感器信息处理中，可以利用熵方法描述传感器信号的特征，进而对传感器信号进行有效分析。

排列熵（Permutation Entropy, PE）作为一种衡量一维时间序列复杂度的平均熵参数，它不仅能够度量一个非线性信号的不确定性，而且具有计算简单、抗噪声能力强等优点。因此，可以选择排列熵对IMF中包含的故障特征进行提取。通过集合经验模态分解后得到的每个IMF分量包含传感器信号在不同时间尺度下的特征。通过计算各个IMF分量的排列熵值并把它们组成特征向量，能够有效地突出在多尺度下的传感器故障特征。

1 排列熵的计算原理

对于某个长度为n的排列x，其元素分别为x1,x2,...,xn

①采用相空间重构延迟坐标法对一维时间序列x中任意一个元素x(i)进行相空间重构，得到如下矩阵：

其中，j=1, 2，...，K, K为重构分量的数目，m为嵌入维数，τ为延迟时间，x(j)为重构矩阵的第j行分量。

②对x(i)的重构向量的各元素进行升序排列，得到j1, j2，...，jm。m维相空间映射下最多可以得到m！个不同的排列模式，P(l)表示其中一种排列的模式

其中，l=1, 2，...，k，且k≤m！。

③对x序列各种排列情况下出现次数进行统计，计算各种排列情况出现的相对频率

其概率为p1, p2，...，pk。

④信号排列模式的熵为：

⑤计算序列归一化后的排列熵为：

当P j = 1 / m ! ，也就是每种符号都有且它们的概率都相等，此时时间序列的复杂程度最高，所以排列熵最大，为 ln(m!)。另外，为了方便表示，通常会将H(m)除以一个 ln(m!)来归一化，这样

计算举例：

按照步骤举个例子，便于理解：

x={2,4,5,6,3,7,1}，其长度n=7

设嵌入维度m=3（3-neightborhood），时间延迟t=1（没有skip）

得到k=n-(m-1)t=5个子序列，即：

(1) 2,4,5

(2) 4,5,6

(3) 5,6,3

(4) 6,3,7

(5) 3,7,1

转换为大小关系的排列，分别为：

针对每个子序列K,对其值从小到大排序（如果值相同按照索引排序），返回对应的索引值。

注意此处有两种理解方式：

（1）该数排在第几个位置

例如【5，6，3】，该数排名后的位置为【2，3，1】

解释：5这个数排在第2个位置，6这个数排在第3个位置，3这个数排在第一个位置，所以返回【2，3，1】

（2）排在该位置【1，2，3】的是第几个数

例如【5，6，3】，其排序后的索引为【3，1，2】

解释：排在第一个位置的元素索引是3，排在第2个位置的元素索引是1，排在第三个位置的元素索引是2，所以返回【3，1，2】

这两种情况都不影响最终的结果.本文采取第一种类型计算结果如下：

(1) 1,2,3

(2) 1,2,3

(3) 2,3,1

(4) 2,1,3

(5) 2,3,1

以上排列共有3种，分别为2次(1,2,3)，2次(3,1,2）和1次(2,1,3)，这些排列的概率分别为：

(1) P(1,2,3) = 2/5

(2) P(2,3,1) = 2/5

(3) P(2,1,3) = 1/5

计算信息熵，得到H(3)= 0.4*log2(2.5)+0.4*log2(2.5)+0.2*log2(5)=1.5219

2 数据准备

sql 复制代码

create table permutation_entropy as
    (select stack(
                    7,
                    1, '2',
                    2, '4',
                    3, '5',
                    4, '6',
                    5, '3',
                    6, '7',
                    7, '1'
            ) as (id, data));

3 问题分析

第一步：计算m=3,t=1时分割的数据块

sql 复制代码

select id,
       data,
       data_block
from (select id,
             data,
             collect_list(data) over (order by id rows between current row and 2 following) data_block
      from permutation_entropy) t
where size(data_block) >= 3

第二步：计算块中从小到大排序后的索引

sql 复制代码

select id,
       data_block,
       pos + 1  pos,
       tmp.data data,
       row_number() over (partition by data_block order by tmp.data) rn
from (select id,
             data,
             data_block
      from (select id,
                   data,
                   collect_list(data) over (order by id rows between current row and 2 following) data_block
            from permutation_entropy) t
      where size(data_block) >= 3) data_block
         lateral view posexplode(data_block) tmp as pos, data

其中POS字段即为返回的索引值。

返回索引数组SQL如下：

sql 复制代码

select id,
                    data_block,
                    collect_list(pos) pos_arr

             from (select id,
                          data_block,
                          pos + 1                                                       pos,
                          tmp.data                                                      data,
                          row_number() over (partition by data_block order by tmp.data) rn
                   from (select id,
                                data,
                                data_block
                         from (select id,
                                      data,
                                      collect_list(data)
                                                   over (order by id rows between current row and 2 following) data_block
                               from permutation_entropy) t
                         where size(data_block) >= 3) data_block
                            lateral view posexplode(data_block) tmp as pos, data) t
             group by id, data_block

第三步：计算分块排列后的概率

sql 复制代码

with pos as (select id,
                    data_block,
                    collect_list(pos) pos_arr

             from (select id,
                          data_block,
                          pos + 1                                                       pos,
                          tmp.data                                                      data,
                          row_number() over (partition by data_block order by tmp.data) rn
                   from (
                   select id,
                                data,
                                data_block
                         from (select id,
                                      data,
                                      collect_list(data)
                                                   over (order by id rows between current row and 2 following) data_block
                               from permutation_entropy) t
                         where size(data_block) >= 3
                         ) data_block
                            lateral view posexplode(data_block) tmp as pos, data) t
             group by id, data_block
             )
select pos_arr
     , count(1) data_block_cnt
     , max(ttl_cnt) ttl_cnt
     , cast(count(1) / nullif(max(ttl_cnt),0) as decimal(18,4))  p
from
    (select id,
            data_block,
            pos_arr,
            count(1) over () ttl_cnt
     from pos
    ) t
group by  pos_arr

第四步：按照熵的公式计算最终结果

sql 复制代码

with pos as (select id,
                    data_block,
                    collect_list(pos) pos_arr

             from (select id,
                          data_block,
                          pos + 1                                                       pos,
                          tmp.data                                                      data,
                          row_number() over (partition by data_block order by tmp.data) rn
                   from (
                   select id,
                                data,
                                data_block
                         from (select id,
                                      data,
                                      collect_list(data)
                                                   over (order by id rows between current row and 2 following) data_block
                               from permutation_entropy) t
                         where size(data_block) >= 3
                         ) data_block
                            lateral view posexplode(data_block) tmp as pos, data) t
             group by id, data_block
             )
select
       cast(-sum(p*log2(p)) as decimal(18, 4)) permutation_entropy
from
    (select pos_arr
          , count(1)                                                   data_block_cnt
          , max(ttl_cnt)                                               ttl_cnt
          , cast(count(1) / nullif(max(ttl_cnt), 0) as decimal(18, 4)) p
     from (select id,
                  data_block,
                  pos_arr,
                  count(1) over () ttl_cnt
           from pos) t
     group by pos_arr) t

第六步：计算归一化结果.

为了将熵值的范围调整到 0 到 1 的范围内，进行数据归一化

sql 复制代码

with pos as (select id,
                    data_block,
                    collect_list(pos) pos_arr

             from (select id,
                          data_block,
                          pos + 1                                                       pos,
                          tmp.data                                                      data,
                          row_number() over (partition by data_block order by tmp.data) rn
                   from (
                   select id,
                                data,
                                data_block
                         from (select id,
                                      data,
                                      collect_list(data)
                                                   over (order by id rows between current row and 2 following) data_block
                               from permutation_entropy) t
                         where size(data_block) >= 3
                         ) data_block
                            lateral view posexplode(data_block) tmp as pos, data) t
             group by id, data_block
             )
select permutation_entropy
     , cast( permutation_entropy / log2(3*2*1) as  decimal(18, 4)) normal_permutation_entropy
from
    (select cast(-sum(p * log2(p)) as decimal(18, 4)) permutation_entropy
     from (select pos_arr
                , count(1)                                                   data_block_cnt
                , max(ttl_cnt)                                               ttl_cnt
                , cast(count(1) / nullif(max(ttl_cnt), 0) as decimal(18, 4)) p
           from (select id,
                        data_block,
                        pos_arr,
                        count(1) over () ttl_cnt
                 from pos) t
           group by pos_arr) t) t

4 小结

本文利用SQL语言实现了时间序列分析时常用的特征排列熵。排列熵只能反映当前一维时间序列的复杂度。考虑到外界温度、天气等因素的影响，信号也可能会突变，产生噪声，因此需要排除噪声的干扰。排列熵作为衡量时间序列复杂程度的指标，越规则的时间序列，它对应的排列熵越小；越复杂的时间序列，它对应的排列熵越大。但是这样的结果是建立在合适的 m 的选择的基础上的，如果 m 的选取很小，如1或者2的话，那么它的排列空间就会很小（1!、2!）。由排列熵的计算过程看出，排列熵的值与嵌入维数m、延迟时间t及数据长度N有关。文献研究表明，嵌入维数m为4～8时，对传感器不同状态下的信号区分度良好。实际上，当嵌入维数m<4时，排列熵无法准确地检测出传感器信号中的动态变化，而当m>8时，不仅会使排列熵的计算量增大，而且会使排列熵的变化范围变窄而难于准确地衡量信号复杂度。延迟时间t的取值对排列熵的影响不大。但是，当t>5时，排列熵不能准确地检测传感器信号中的微小变化。数据长度N也是影响排列熵计算结果的重要参数，N值过大时会把信号平滑，不能准确地衡量信号的动态变化。N值也不能太小，否则，计算结果将失去统计意义。

参考文献：

刘永斌.基于非线性信号分析的滚动轴承状态监测诊断研究[D].合肥：中国科学技术大学，2011.

Christoph B, Bernd P.Permutation entropy: a natural complexity measure for time series [J].Physical Review Letters, 2002, 88(17):174102.

如果您觉得本文还不错，对你有帮助，那么不妨可以关注一下我的数字化建设实践之路专栏，这里的内容会更精彩。

专栏原价99，现在活动价59.9，按照阶梯式增长，还差5个人上升到69.9，最终恢复到原价。

专栏优势：

（1）一次收费持续更新。

（2）实战中总结的SQL技巧，帮助SQLBOY 在SQL语言上有质的飞越，无论你应对业务难题及面试都会游刃有余**【全网唯一讲SQL实战技巧，方法独特】**

SQL很简单，可你却写不好？每天一点点，收获不止一点点-CSDN博客

（3）实战中数仓建模技巧总结，让你认识不一样的数仓。【数据建模+业务建模，不一样的认知体系】（如果只懂数据建模而不懂业务建模，数仓体系认知是不全面的）

（4）数字化建设当中遇到难题解决思路及问题思考。

我的专栏具体链接如下：

数字化建设通关指南_莫叫石榴姐的博客-CSDN博客