Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, 和 Michael Rubinstein. 2018. Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation. ACM Trans. Graph. 37, 4, Article 112 (August 2018), 11 pages. https://doi.org/10.1145/3197517.3201357
在之前的研究中[Wang and Chen 2017; Wang et al. 2014],乘法掩模被观察到比其他选择更有效,比如直接预测频谱幅度或直接预测时域波形。在源分离文献中存在许多基于掩模的训练目标[Wang and Chen 2017],我们尝试了其中的两种:比率掩模(RM)和复数比率掩模(cRM)。
理想的比率掩模(RM)被定义为干净频谱和噪声频谱之间的幅度比值,而且它被规范在0和1之间。
当使用比率掩模时,我们将预测的比率掩模和噪声频谱的幅度进行逐点乘法,然后与噪声原始相位一起进行逆短时傅里叶变换(ISTFT),得到去噪后的波形 [Wang and Chen 2017]。
复数理想比率掩模被定义为复数干净频谱和噪声频谱之间的比值。复数理想比率掩模有一个实部和一个虚部,这两部分在实域中分别进行估计。复数掩模的实部和虚部通常在-1和1之间,然而,我们使用sigmoid函数压缩将这些复数掩模值限制在0和1之间[Wang et al. 2016]。
所有音频数据都被重新采样为16kHz,并且立体声音频将通过仅使用左声道转换为单声道。使用长度为25毫秒的Hann窗口、10毫秒的跳跃长度和FFT大小为512计算STFT,从而得到一个 257 × 298 × 2 257\times298\times2 257×298×2个标量的输入音频特征。采用 p = 0.3 p=0.3 p=0.3( A 0.3 A^{0.3} A0.3,其中 A A A是输入/输出音频频谱图)进行幂律压缩。
在这里,我们考虑从两个说话者和非语音背景噪声的混合中隔离出一个说话者的声音的任务。据我们所知,这个音频-视觉任务之前还没有被解决过。训练数据是通过将两个不同说话者的干净语音(如2S clean任务所生成的)与 A u d i o S e t AudioSet AudioSet的背景噪声混合而成的:
M i x i = A V S j + A V S k + 0.3 ∗ A u d i o S e t l Mix_i=AVS_j+AVS_k+0.3*AudioSet_l Mixi=AVSj+AVSk+0.3∗AudioSetl
为了量化网络输出之间的差异,我们使用SNR,将没有遮挡的结果视为"信号"^5^。也就是说,对于每个空间-时间遮挡器,我们计算:
E = 10 ⋅ l o g ( S o r i g 2 ( S o c c − S o r i g ) 2 ) (1) E=10\cdot{log(\frac{{S_{orig}}^2}{(S_{occ}-S_{orig})^2})}\tag{1} E=10⋅log((Socc−Sorig)2Sorig2)(1)
对视频中的所有空间-时间遮挡器重复这个过程,会得到每个帧的热图。为了进行可视化,我们将热图归一化为视频的最大SNR:
E ~ = E m a x − E \tilde{E}=E_{max}−E E~=Emax−E
R. Gao, R. Feris, and K. Grauman. 2018. 《通过观看未标记视频学习分离物体声音》. arXiv预印本arXiv:1804.01665 (2018).
Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. 《音频集:音频事件的本体和人工标记数据集》. 2017年IEEE ICASSP会议文集。
Elana Zion Golumbic, Gregory B Cogan, Charles E. Schroeder, and David Poeppel. 2013. 《视觉输入增强了"鸡尾酒会"中听觉皮层对选择性语音包络的跟踪》. 美国神经科学学会官方期刊《神经科学》33卷4期(2013),1417--26。
David F. Harwath, Antonio Torralba, and James R. Glass. 2016. 《带有视觉背景的无监督学习口语》. In NIPS.
John Hershey, Hagai Attias, Nebojsa Jojic, and Trausti Kristjansson. 2004. 《语音处理的音频-视觉图形模型》. IEEE国际声学、语音和信号处理会议(ICASSP)。
John R Hershey and Michael Casey. 2002. 《使用隐马尔可夫模型的音频-视觉声音分离》. Advances in Neural Information Processing Systems. 1173--1180.
John R. Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe. 2016. 《深度聚类:分割和分离的鉴别嵌入》. IEEE国际声学、语音和信号处理会议(ICASSP) (2016),31--35。
Andrew Hines, Eoin Gillen, Damien Kelly, Jan Skoglund, Anil C. Kokaram, and Naomi Harte. 2015. 《低比特率编解码器的客观音频质量度量ViSQOLAudio》. 《美国声学学会杂志》137卷6期(2015),EL449--55。
Yongtao Hu, Jimmy SJ Ren, Jingwen Dai, Chang Yuan, Li Xu, and Wenping Wang. 2015. 《深度多模态说话者命名》. Proceedings of the 23rd ACM international conference on Multimedia. ACM, 1107--1110。
Sergey Ioffe and Christian Szegedy. 2015. 《批量标准化:通过减少内部协变量转移加速深度网络训练》. 《国际机器学习会议》。
Yusuf Isik, Jonathan Le Roux, Zhuo Chen, Shinji Watanabe, and John R Hershey. 2016. 《使用深度聚类的单声道多说话者分离》. Interspeech (2016),545--549。
Faheem Khan. 2016. 《音频-视觉说话者分离》. 博士学位论文。东安格利亚大学。
Wei Ji Ma, Xiang Zhou, Lars A. Ross, John J. Foxe, and Lucas C. Parra. 2009. 《在中等噪声下,通过高维特征空间的贝叶斯解释辅助词汇识别》. PLoS ONE 4卷(2009),233--252。
Josh H McDermott. 2009. 《鸡尾酒会问题》. 《当代生物学》19卷22期(2009),R1024--R1027。
Gianluca Monaci. 2011. 《实时音频视觉说话者定位的发展》. Signal Processing Conference,2011年第19届欧洲。IEEE,1055--1059。
Youssef Mroueh, Etienne Marcheret, and Vaibhava Goel. 2015. 《用于音频-视觉语音识别的深度多模态学习》. In 2015年IEEE国际声学、语音和信号处理会议(ICASSP)。IEEE,2130--2134。
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. 《多模态深度学习》. In ICML.
Andrew Owens and Alexei A Efros. 2018. 《使用自监督多感官特征的音频-视觉场景分析》。 (2018)。
Eric K. Patterson, Sabri Gurbuz, Zekeriya Tufekci, and John N. Gowdy. 2002. 《CUAVE多模态语音语料库的移动说话人、说话人独立特征研究和基线结果》. 《欧拉西亚先进信号处理期刊》2002卷(2002),1189--1201。
Jie Pu, Yannis Panagakis, Stavros Petridis, and Maja Pantic. 2017. 《使用低秩和稀疏性的音频-视觉对象定位和分离》. In 2017年IEEE国际声学、语音和信号处理会议(ICASSP)。IEEE,2901--2905。
Bertrand Rivet, Wenwu Wang, Syed M. Naqvi, and Jonathon A. Chambers. 2014. 《音频-视觉说话者分离:关键方法概述》. IEEE信号处理杂志31期(2014),125--134。
Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra. 2001. 《语音质量的感知评估(PESQ)------一种用于电话网络和编解码器语音质量评估的新方法》. 《声学、语音和信号处理》2001年国际会议(ICASSP'01)。IEEE,749--752。
Ethan M Rudd, Manuel Günther, and Terrance E Boult. 2016. 《Moon:用于识别面部属性的混合目标优化网络》. 《欧洲计算机视觉大会》。Springer,19--35。
J S Garofolo, Lori Lamel, W M Fisher, Jonathan Fiscus, D S. Pallett, N L. Dahlgren, and V Zue. 1992. 《TIMIT语音语音语料库》。 (1992)。
Lei Sun, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2017. 《基于LSTM-RNN的多目标深度学习语音增强》。在HSCMA。
Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen. 2010. 《用于时频加权噪声语音的短时客观可懂性测量》。在2010年IEEE国际声学、语音和信号处理会议(ICASSP)。IEEE,4214--4217。
Emmanuel Vincent, Jon Barker, Shinji Watanabe, Jonathan Le Roux, Francesco Nesta, and Marco Matassoni. 2013. 《第二届"钟声"语音分离和识别挑战:数据集、任务和基线》。在2013年IEEE国际声学、语音和信号处理会议(ICASSP)。IEEE,126--130。
E. Vincent, R. Gribonval, and C. Fevotte. 2006. 《盲音频源分离的性能测量》。《音频、语音和语言处理的交易》14卷4期(2006),1462--1469。
DeLiang Wang and Jitong Chen. 2017. 《基于深度学习的监督语音分离:综述》。CoRR abs/1708.07524 (2017)。
Yuxuan Wang, Arun Narayanan, and DeLiang Wang. 2014. 《用于监督语音分离的训练目标》。IEEE/ACM音频、语音和语言处理的交易(TASLP) 22卷12期(2014),1849--1858。
Felix Weninger, Hakan Erdogan, Shinji Watanabe, Emmanuel Vincent, Jonathan Le Roux, John R. Hershey, and Björn W. Schuller. 2015. 《使用LSTM递归神经网络进行语音增强及其在噪声鲁棒性ASR中的应用》。在LVA/ICA。
Matthew D Zeiler and Rob Fergus. 2014. 《可视化和理解卷积网络》。在欧洲计算机视觉大会。Springer,818--833。
Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. 2018. 《像素的声音》。 (2018)。
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2014. 《深度场景CNN中出现的物体探测器》。arXiv预印本arXiv:1412.6856 (2014)。
REFERENCES
T. Afouras, J. S. Chung, and A. Zisserman. 2018. The Conversation: Deep Audio-Visual Speech Enhancement. In arXiv:1804.04121.
Anna Llagostera Casanovas, Gianluca Monaci, Pierre Vandergheynst, and Rémi Gribonval. 2010. Blind audiovisual source separation based on sparse redundant representations. IEEE Transactions on Multimedia 12, 5 (2010), 358--371.
E Colin Cherry. 1953. Some experiments on the recognition of speech, with one and with two ears. The Journal of the acoustical society of America 25, 5 (1953), 975--979.
Joon Son Chung, Andrew W. Senior, Oriol Vinyals, and Andrew Zisserman. 2016. Lip Reading Sentences in the Wild. CoRR abs/1611.05358 (2016).
Forrester Cole, David Belanger, Dilip Krishnan, Aaron Sarna, Inbar Mosseri, and William T Freeman. 2016. Synthesizing normalized faces from facial identity features. In CVPR'17.
Pierre Comon and Christian Jutten. 2010. Handbook of Blind Source Separation: Independent component analysis and applications. Academic press.
Masood Delfarah and DeLiang Wang. 2017. Features for Masking-Based Monaural Speech Separation in Reverberant Conditions. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (2017), 1085--1094.
Ariel Ephrat, Tavi Halperin, and Shmuel Peleg. 2017. Improved Speech Reconstruction from Silent Video. In ICCV 2017 Workshop on Computer Vision for Audio-Visual Media.
Hakan Erdogan, John R. Hershey, Shinji Watanabe, and Jonathan Le Roux. 2015. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015).
Weijiang Feng, Naiyang Guan, Yuan Li, Xiang Zhang, and Zhigang Luo. 2017. Audio-visual speech recognition with multimodal recurrent neural networks. In Neural Networks (IJCNN), 2017 International Joint Conference on. IEEE, 681--688.
Aviv Gabbay, Ariel Ephrat, Tavi Halperin, and Shmuel Peleg. 2018. Seeing Through Noise: Speaker Separation and Enhancement using Visually-derived Speech. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018).
Aviv Gabbay, Asaph Shamir, and Shmuel Peleg. 2017. Visual Speech Enhancement using Noise-Invariant Training. arXiv preprint arXiv:1711.08789 (2017).
R. Gao, R. Feris, and K. Grauman. 2018. Learning to Separate Object Sounds by Watching Unlabeled Video. arXiv preprint arXiv:1804.01665 (2018).
Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio Set: An ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017.
Elana Zion Golumbic, Gregory B Cogan, Charles E. Schroeder, and David Poeppel. 2013. Visual input enhances selective speech envelope tracking in auditory cortex at a "cocktail party". The Journal of neuroscience: the official journal of the Society for Neuroscience 33 4 (2013), 1417--26.
Naomi Harte and Eoin Gillen. 2015. TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE Transactions on Multimedia 17, 5 (2015), 603--615.
David F. Harwath, Antonio Torralba, and James R. Glass. 2016. Unsupervised Learning of Spoken Language with Visual Context. In NIPS.
John Hershey, Hagai Attias, Nebojsa Jojic, and Trausti Kristjansson. 2004. Audio-visual graphical models for speech processing. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
John R Hershey and Michael Casey. 2002. Audio-visual sound separation via hidden Markov models. In Advances in Neural Information Processing Systems. 1173--1180.
John R. Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe. 2016. Deep clustering: Discriminative embeddings for segmentation and separation. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), 31--35.
Andrew Hines, Eoin Gillen, Damien Kelly, Jan Skoglund, Anil C. Kokaram, and Naomi Harte. 2015. ViSQOLAudio: An objective audio quality metric for low bitrate codecs. The Journal of the Acoustical Society of America 137 6 (2015), EL449--55.
Andrew Hines and Naomi Harte. 2012. Speech Intelligibility Prediction Using a Neurogram Similarity Index Measure. Speech Commun. 54, 2 (Feb. 2012), 306--320. DOI: http://dx.doi.org/10.1016/j.specom.2011.09.004
Ken Hoover, Sourish Chaudhuri, Caroline Pantofaru, Malcolm Slaney, and Ian Sturdy. 2017. Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers. CoRR abs/1706.00079 (2017).
Jen-Cheng Hou, Syu-Siang Wang, Ying-Hui Lai, Jen-Chun Lin, Yu Tsao, Hsiu-Wen Chang, and Hsin-Min Wang. 2018. Audio-Visual Speech Enhancement Using Multi-modal Deep Convolutional Neural Networks. IEEE Transactions on Emerging Topics in Computational Intelligence 2, 2 (2018), 117--128.
Yongtao Hu, Jimmy SJ Ren, Jingwen Dai, Chang Yuan, Li Xu, and Wenping Wang. 2015. Deep multimodal speaker naming. In Proceedings of the 23rd ACM international conference on Multimedia. ACM, 1107--1110.
Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML.
Yusuf Isik, Jonathan Le Roux, Zhuo Chen, Shinji Watanabe, and John R Hershey. 2016. Single-Channel Multi-Speaker Separation Using Deep Clustering. Interspeech (2016), 545--549.
Faheem Khan. 2016. Audio-visual speaker separation. Ph.D. Dissertation. University of East Anglia.
Wei Ji Ma, Xiang Zhou, Lars A. Ross, John J. Foxe, and Lucas C. Parra. 2009. Lip-Reading Aids Word Recognition Most in Moderate Noise: A Bayesian Explanation Using High-Dimensional Feature Space. PLoS ONE 4 (2009), 233 -- 252.
Josh H McDermott. 2009. The cocktail party problem. Current Biology 19, 22 (2009), R1024--R1027.
Gianluca Monaci. 2011. Towards real-time audiovisual speaker localization. In Signal Processing Conference, 2011 19th European. IEEE, 1055--1059.
Youssef Mroueh, Etienne Marcheret, and Vaibhava Goel. 2015. Deep multimodal learning for audio-visual speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2130--2134.
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal Deep Learning. In ICML.
Andrew Owens and Alexei A Efros. 2018. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. (2018).
Eric K. Patterson, Sabri Gurbuz, Zekeriya Tufekci, and John N. Gowdy. 2002. Moving-Talker, Speaker-Independent Feature Study, and Baseline Results Using the CUAVE Multimodal Speech Corpus. EURASIP J. Adv. Sig. Proc. 2002 (2002), 1189--1201.
Jie Pu, Yannis Panagakis, Stavros Petridis, and Maja Pantic. 2017. Audio-visual object localization and separation using low-rank and sparsity. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2901--2905.
Bertrand Rivet, Wenwu Wang, Syed M. Naqvi, and Jonathon A. Chambers. 2014. Audio-visual Speech Source Separation: An overview of key methodologies. IEEE Signal Processing Magazine 31 (2014), 125--134.
Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra. 2001. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP'01). 2001 IEEE International Conference on, Vol. 2. IEEE, 749--752.
Ethan M Rudd, Manuel Günther, and Terrance E Boult. 2016. Moon: A mixed objective optimization network for the recognition of facial attributes. In European Conference on Computer Vision. Springer, 19--35.
J S Garofolo, Lori Lamel, W M Fisher, Jonathan Fiscus, D S. Pallett, N L. Dahlgren, and V Zue. 1992. TIMIT Acoustic-phonetic Continuous Speech Corpus. (11 1992).
Lei Sun, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2017. Multiple-target deep learning for LSTM-RNN based speech enhancement. In HSCMA.
Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen. 2010. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 4214--4217.
Emmanuel Vincent, Jon Barker, Shinji Watanabe, Jonathan Le Roux, Francesco Nesta, and Marco Matassoni. 2013. The second 'chime' speech separation and recognition challenge: Datasets, tasks and baselines. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (2013), 126--130.
E. Vincent, R. Gribonval, and C. Fevotte. 2006. Performance Measurement in Blind Audio Source Separation. Trans. Audio, Speech and Lang. Proc. 14, 4 (2006), 1462--1469.
DeLiang Wang and Jitong Chen. 2017. Supervised Speech Separation Based on Deep Learning: An Overview. CoRR abs/1708.07524 (2017).
Yuxuan Wang, Arun Narayanan, and DeLiang Wang. 2014. On training targets for supervised speech separation. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 22, 12 (2014), 1849--1858.
Ziteng Wang, Xiaofei Wang, Xu Li, Qiang Fu, and Yonghong Yan. 2016. Oracle performance investigation of the ideal masks. In IWAENC.
Felix Weninger, Hakan Erdogan, Shinji Watanabe, Emmanuel Vincent, Jonathan Le Roux, John R. Hershey, and Björn W. Schuller. 2015. Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR. In LVA/ICA.
Dong Yu, Morten Kolbæk, Zheng-Hua Tan, and Jesper Jensen. 2017. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017), 241--245.
Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European conference on computer vision. Springer, 818--833.
Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. 2018. The Sound of Pixels. (2018).
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2014. Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856 (2014).
SDR是最通用的分数,常用于报告语音分离算法的性能。它以分贝(dB)为单位衡量,定义如下:
S D R : = 10 ⋅ log 10 ( ∣ ∣ S t a r g e t ∣ ∣ 2 ∣ ∣ e i n t e r f + e n o i s e + e a r t i f ∣ ∣ 2 ) (2) SDR:=10\cdot\log_{10}(\frac{||S_{target}||^{2}}{||e_{interf}+e_{noise}+e_ {artif}||^ {2}})\tag{2} SDR:=10⋅log10(∣∣einterf+enoise+eartif∣∣2∣∣Starget∣∣2)(2)
虚拟语音质量客观监听器(ViSQOL)是一种客观语音质量模型,由Hines等人[2015]提出。该指标使用参考(r)和降质(d)语音信号之间的谱时相似度测量来建模人类的语音质量感知,并基于Neurogram相似性指数测量(NSIM)[Hines和Harte 2012]。 NSIM的定义如下:
N S I M ( r , d ) = 2 μ r μ d + C 1 μ r 2 + μ d 2 + C 1 ⋅ σ r d + C 2 σ r σ d + C 2 (3) NSIM(r,d)=\frac{2\mu_{r}\mu_{d}+C_{1}}{\mu_{r}^{2}+\mu^{2}{d}+C{1}}\cdot\frac{\sigma {rd}+C{2}}{\sigma_{r}\sigma_{d}+C_{2}}\tag{3} NSIM(r,d)=μr2+μd2+C12μrμd+C1⋅σrσd+C2σrd+C2(3)
DR是最通用的分数,常用于报告语音分离算法的性能。它以分贝(dB)为单位衡量,定义如下:
S D R : = 10 ⋅ log 10 ( ∣ ∣ S t a r g e t ∣ ∣ 2 ∣ ∣ e i n t e r f + e n o i s e + e a r t i f ∣ ∣ 2 ) (2) SDR:=10\cdot\log_{10}(\frac{||S_{target}||^{2}}{||e_{interf}+e_{noise}+e_ {artif}||^ {2}})\tag{2} SDR:=10⋅log10(∣∣einterf+enoise+eartif∣∣2∣∣Starget∣∣2)(2)
虚拟语音质量客观监听器(ViSQOL)是一种客观语音质量模型,由Hines等人[2015]提出。该指标使用参考(r)和降质(d)语音信号之间的谱时相似度测量来建模人类的语音质量感知,并基于Neurogram相似性指数测量(NSIM)[Hines和Harte 2012]。 NSIM的定义如下:
N S I M ( r , d ) = 2 μ r μ d + C 1 μ r 2 + μ d 2 + C 1 ⋅ σ r d + C 2 σ r σ d + C 2 (3) NSIM(r,d)=\frac{2\mu_{r}\mu_{d}+C_{1}}{\mu_{r}^{2}+\mu^{2}{d}+C{1}}\cdot\frac{\sigma {rd}+C{2}}{\sigma_{r}\sigma_{d}+C_{2}}\tag{3} NSIM(r,d)=μr2+μd2+C12μrμd+C1⋅σrσd+C2σrd+C2(3)