统计研究 ›› 2022, Vol. 39 ›› Issue (7): 125-136.doi: 10.19343/j.cnki.11–1302/c.2022.07.010

• • 上一篇    下一篇

考虑数据源网络结构的高维数据整合分析与子群识别研究

方匡南 张晴雯 林洪伟   

  • 出版日期:2022-07-25 发布日期:2022-07-25

High-dimensional Data Integrative Analysis and Subgroup Identification Incorporating Data Source Network Structure

Fang Kuangnan Zhang Qingwen Lin Hongwei   

  • Online:2022-07-25 Published:2022-07-25

摘要: 大数据时代,收集到的数据维度越来越高,数据来源也越来越多。针对多源高维数据,本文提出了一种考虑数据源网络结构的多源高维数据整合分析方法,利用 k 近邻方法构建数据源间的网络结构,对于有网络连接的数据集的模型系数施加Network MCP惩罚来自动识别同质数据和异质数据,并利用MCP惩罚筛选每个数据集的重要变量,能同时进行各数据源的模型估计、变量选择以及数据源的子群识别。模拟实验表明,在不同的模拟设置下本文所提方法在变量选择、参数估计和分类预测准确率上都有良好的效果。最后,将该方法应用到房地产租赁评价数据上,利用经纬度位置信息构建数据源间的网络结构,可以很好地识别出房地产子市场,并在模型上具有更好的解释性。

关键词: 多源高维数据, 整合分析, 网络结构, 子群识别

Abstract: In the era of big data, the dimensions of collected data are getting increasingly higher, with data sources diversified. Considering multi-source high-dimensional data, this paper proposes a new integrative analysis method using the K-nearest neighbor method to construct a network structure between data sources. It combines Network MCP penalty with separate MCP penalty to not only automatically identify homogeneous datasets and heterogeneous datasets, but also select the important variable sets of each dataset. In this way, our method can simultaneously conduct the model estimation, variable selection and subgroup identification of data sources. Simulation experiments show that the proposed method has a significant advantage in variable selection, parameter estimation and classification prediction accuracy under different settings. Finally, through experiments on real estate lease evaluation datasets which provide latitude and longitude location information for network construction, it is empirically shown that the proposed method can well identify the sub-markets of real estate and has better interpretability.

Key words: Multi-socure High-dimensional Data, Integrative Analysis, Network Structure, Subgroup Identification