基于变系数模型的高维数据异同性识别方法研究

doi:10.19343/j.cnki.11-1302/c.2021.05.011

统计研究 ›› 2021, Vol. 38 ›› Issue (5): 136-146.doi: 10.19343/j.cnki.11-1302/c.2021.05.011

基于变系数模型的高维数据异同性识别方法研究

孙怡帆王彩晶罗梓烨

出版日期:2021-05-25 发布日期:2021-05-25

A Study on Identification of Commonality and Difference among High-dimensional Data Based on a Varying-coefficient Model

Sun Yifan Wang Caijing Luo Ziye

Online:2021-05-25 Published:2021-05-25

摘要/Abstract

摘要： 随着信息技术的发展,高维数据日益丰富。现实中,很多高维数据由多个主体各异的数据集融合而成。如何准确识别出高维数据集间的异同性成为大数据分析的目标之一。本文提出了变系数模型下的高维数据整合分析方法。该方法可以同时对多个数据集进行变量选择和系数估计,并且能够自动识别出变量系数在数据集间的异同性。模拟结果表明本文方法在异同性识别、变量选择、系数估计和预测等方面明显优于对比方法。在肺癌致病基因识别的应用研究中,本文方法能够识别出具有生物解释的致病基因并发现了两种亚型之间的异同性。

关键词: 高维数据, 异同性, 变系数模型, 整合分析

Abstract: With the development of information technology, high-dimensional data has become increasingly rich. In reality, a lot of high-dimensional data is a mixture of multiple datasets with heterogeneous sources or subjects. The identification of commonality and difference among high-dimensional datasets has become one of the goals of big data analysis. This paper proposes a novel integrative analysis method for high-dimensional data based on a varying-coefficient model. It can simultaneously conduct variable selection, coefficient estimation, and, most importantly, automatically identify the commonality and difference among multiple datasets. Simulations demonstrate that the proposed method outperforms alternative methods in commonality identification, variable selection, coefficient estimation, and forecast. Finally, the proposed method is applied to lung cancer datasets and biologically meaningful pathogenic genes are identified as well as the commonality and difference of two sub-types.

Key words: High Dimensional Data, Commonality and Difference, Varying-coefficient Model, Integrative Analysis

孙怡帆等. 基于变系数模型的高维数据异同性识别方法研究[J]. 统计研究, 2021, 38(5): 136-146.

Sun Yifan et al. A Study on Identification of Commonality and Difference among High-dimensional Data Based on a Varying-coefficient Model[J]. Statistical Research, 2021, 38(5): 136-146.

[1]	谭祥勇等. 部分函数型线性变系数模型的序列相关检验[J]. 统计研究, 2021, 38(2): 135-145.
[2]	何胜美等. 基于秩能量距离的超高维特征筛选研究[J]. 统计研究, 2020, 37(8): 117-128.
[3]	苍玉权等. 基于带跳时变系数模型的PPI与CPI相关性研究[J]. 统计研究, 2019, 36(2): 101-111.
[4]	方匡南赵梦峦. 基于多源数据融合的个人信用评分研究 [J]. 统计研究, 2018, 35(12): 92-101.
[5]	孟好. 我国城乡居民消费行为差异研究[J]. 统计研究, 2016, 33(9): 78-85.
[6]	田茂再. 大数据时代统计学重构研究中的几个热点问题[J]. 统计研究, 2015, 32(5): 3-12.
[7]	马双鸽等. 大数据的整合分析方法[J]. 统计研究, 2015, 32(11): 3-11.
[8]	章上峰等. 时变弹性生产函数模型统计学与经济学检验[J]. 统计研究, 2011, 28(6): 92-97.

基于变系数模型的高维数据异同性识别方法研究

A Study on Identification of Commonality and Difference among High-dimensional Data Based on a Varying-coefficient Model

赞

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 8

Metrics

本文评价

推荐阅读 10