统计研究 ›› 2020, Vol. 37 ›› Issue (5): 104-116.doi: 10.19343/j.cnki.11-1302/c.2020.05.009

• • 上一篇    下一篇

稳健高效的高维成分数据近似零值插补方法及应用

熊巍 潘晗 刘立新   

  • 出版日期:2020-05-25 发布日期:2020-05-12

Robust Efficient Imputation of Rounded Zeros in High-dimensional Compositional Data and Its Applications

Xiong Wei Pan Han Liu Lixin   

  • Online:2020-05-25 Published:2020-05-12

摘要: 随着计算机技术的迅猛发展,高维成分数据不断涌现并伴有大量近似零值和缺失,数据的高维特性不仅给传统统计方法带来了巨大的挑战,其厚尾特征、复杂的协方差结构也使得理论分析难上加难。于是如何对高维成分数据的近似零值进行稳健的插补,挖掘潜在的内蕴结构成为当今学者研究的焦点。对此,本文结合修正的EM算法,提出基于R型聚类的Lasso-分位回归插补法(SubLQR)对高维成分数据的近似零值问题予以解决。与现有高维近似零值插补方法相比,本文所提出的SubLQR具有如下优势。①稳健全面性:利用Lasso-分位回归方法,不仅可以有效地探测到响应变量的整个条件分布,还能提供更加真实的高维稀疏模式;②有效准确性:采用基于R型聚类的思想进行插补,可以降低计算复杂度,极大提高插补的精度。模拟研究证实,本文提出的SubLQR高效灵活准确,特别在零值、异常值较多的情形更具优势。最后将SubLQR方法应用于罕见病代谢组学研究中,进一步表明本文所提出的方法具有广泛的适用性。

关键词: 高维成分数据, 近似零值, Lasso-分位回归, 修正EM算法, 稳健

Abstract: High-dimensional compositional data with massive rounded zeros and missing values are arising with the fast development of computer technology and bring much challenge to traditional statistical methods. The thick-tail and complicated covariance structure make the analysis even more difficult, thus exploring robust methods for imputation of rounded zeros in high-dimensional compositional data becomes the current focus of academic research. To this end, a robust method (SubLQR) based on modified EM algorithm is proposed, combining R-type clustering and Lasso-Quantile regression. The proposed SubLQR is superior to the existing imputation methods in the following aspects: 1) Robustness: with the application of Lasso-Quantile regression, a sparse pattern is provided; 2) Efficiency: with the use of R-type clustering, computation cost is reduced and precision is improved. Simulation results suggest that the proposed method performs better than the existing methods, especially when the percentage of zeros and outliners is large. Finally, real data analysis in metabolomics study of rare disease indicates the wide applicability of the proposed SubLQR.

Key words: High-dimensional Compositional Data, Rounded Zeros, Lasso-Quantile Regression, Modified EM Algorithm, Robustness