统计研究 ›› 2021, Vol. 38 ›› Issue (5): 136-146.doi: 10.19343/j.cnki.11-1302/c.2021.05.011

• • 上一篇    下一篇

基于变系数模型的高维数据异同性识别方法研究

孙怡帆 王彩晶 罗梓烨   

  • 出版日期:2021-05-25 发布日期:2021-05-25

A Study on Identification of Commonality and Difference among High-dimensional Data Based on a Varying-coefficient Model

Sun Yifan Wang Caijing Luo Ziye   

  • Online:2021-05-25 Published:2021-05-25

摘要: 随着信息技术的发展,高维数据日益丰富。现实中,很多高维数据由多个主体各异的数据集融合而成。如何准确识别出高维数据集间的异同性成为大数据分析的目标之一。本文提出了变系数模型下的高维数据整合分析方法。该方法可以同时对多个数据集进行变量选择和系数估计,并且能 够自动识别出变量系数在数据集间的异同性。模拟结果表明本文方法在异同性识别、变量选择、系数估 计和预测等方面明显优于对比方法。在肺癌致病基因识别的应用研究中,本文方法能够识别出具有生物解释的致病基因并发现了两种亚型之间的异同性。

关键词: 高维数据, 异同性, 变系数模型, 整合分析

Abstract: With the development of information technology, high-dimensional data has become increasingly rich. In reality, a lot of high-dimensional data is a mixture of multiple datasets with heterogeneous sources or subjects. The identification of commonality and difference among high-dimensional datasets has become one of the goals of big data analysis. This paper proposes a novel integrative analysis method for high-dimensional data based on a varying-coefficient model. It can simultaneously conduct variable selection, coefficient estimation, and, most importantly, automatically identify the commonality and difference among multiple datasets. Simulations demonstrate that the proposed method outperforms alternative methods in commonality identification, variable selection, coefficient estimation, and forecast. Finally, the proposed method is applied to lung cancer datasets and biologically meaningful pathogenic genes are identified as well as the commonality and difference of two sub-types.

Key words: High Dimensional Data, Commonality and Difference, Varying-coefficient Model, Integrative Analysis