考虑数据源网络结构的高维数据整合分析与子群识别研究

doi:10.19343/j.cnki.11–1302/c.2022.07.010

统计研究 ›› 2022, Vol. 39 ›› Issue (7): 125-136.doi: 10.19343/j.cnki.11–1302/c.2022.07.010

考虑数据源网络结构的高维数据整合分析与子群识别研究

方匡南张晴雯林洪伟

出版日期:2022-07-25 发布日期:2022-07-25

High-dimensional Data Integrative Analysis and Subgroup Identification Incorporating Data Source Network Structure

Fang Kuangnan Zhang Qingwen Lin Hongwei

Online:2022-07-25 Published:2022-07-25

1. 考虑数据源网络结构的高维数据整合分析与子群识别研究.pdf(722KB)

摘要/Abstract

摘要： 大数据时代,收集到的数据维度越来越高,数据来源也越来越多。针对多源高维数据,本文提出了一种考虑数据源网络结构的多源高维数据整合分析方法,利用 k 近邻方法构建数据源间的网络结构,对于有网络连接的数据集的模型系数施加Network MCP惩罚来自动识别同质数据和异质数据,并利用MCP惩罚筛选每个数据集的重要变量,能同时进行各数据源的模型估计、变量选择以及数据源的子群识别。模拟实验表明,在不同的模拟设置下本文所提方法在变量选择、参数估计和分类预测准确率上都有良好的效果。最后,将该方法应用到房地产租赁评价数据上,利用经纬度位置信息构建数据源间的网络结构,可以很好地识别出房地产子市场,并在模型上具有更好的解释性。

关键词: 多源高维数据, 整合分析, 网络结构, 子群识别

Abstract: In the era of big data, the dimensions of collected data are getting increasingly higher, with data sources diversified. Considering multi-source high-dimensional data, this paper proposes a new integrative analysis method using the K-nearest neighbor method to construct a network structure between data sources. It combines Network MCP penalty with separate MCP penalty to not only automatically identify homogeneous datasets and heterogeneous datasets, but also select the important variable sets of each dataset. In this way, our method can simultaneously conduct the model estimation, variable selection and subgroup identification of data sources. Simulation experiments show that the proposed method has a significant advantage in variable selection, parameter estimation and classification prediction accuracy under different settings. Finally, through experiments on real estate lease evaluation datasets which provide latitude and longitude location information for network construction, it is empirically shown that the proposed method can well identify the sub-markets of real estate and has better interpretability.

Key words: Multi-socure High-dimensional Data, Integrative Analysis, Network Structure, Subgroup Identification

方匡南等. 考虑数据源网络结构的高维数据整合分析与子群识别研究[J]. 统计研究, 2022, 39(7): 125-136.

Fang Kuangnan et al.. High-dimensional Data Integrative Analysis and Subgroup Identification Incorporating Data Source Network Structure[J]. Statistical Research, 2022, 39(7): 125-136.

[1]	吴梦云等. 多源高维数据的多分类纵向整合分析及应用[J]. 统计研究, 2021, 38(8): 132-145.
[2]	孙怡帆等. 基于变系数模型的高维数据异同性识别方法研究[J]. 统计研究, 2021, 38(5): 136-146.
[3]	方匡南赵梦峦. 基于多源数据融合的个人信用评分研究 [J]. 统计研究, 2018, 35(12): 92-101.
[4]	方匡南等. 基于网络结构Logistic模型的企业信用风险预警[J]. 统计研究, 2016, 33(4): 50-55.
[5]	王娜. 基于大数据的碳价预测[J]. 统计研究, 2016, 33(11): 56-62.
[6]	田茂再. 大数据时代统计学重构研究中的几个热点问题[J]. 统计研究, 2015, 32(5): 3-12.
[7]	马双鸽等. 大数据的整合分析方法[J]. 统计研究, 2015, 32(11): 3-11.

考虑数据源网络结构的高维数据整合分析与子群识别研究

High-dimensional Data Integrative Analysis and Subgroup Identification Incorporating Data Source Network Structure

赞

补充材料

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 7

Metrics

本文评价

推荐阅读 10