统计研究 ›› 2021, Vol. 38 ›› Issue (8): 132-145.doi: 10.19343/j.cnki.11-1302/c.2021.08.011

• • 上一篇    下一篇

多源高维数据的多分类纵向整合分析及应用

吴梦云 蒋浩宇 冯士倩   

  • 出版日期:2021-08-25 发布日期:2021-08-25

Vertical Integrative Analysis of Multi-dimensional Data for Multiple Categorical Response and Its Application

Wu Mengyun Jiang Haoyu Feng Shiqian   

  • Online:2021-08-25 Published:2021-08-25

摘要: 多分类数据分析在实证研究中具有重要意义。然而,由于高维数、小样本及低信噪比等原因,现有的多分类方法仍面临信息量不足而导致的效果不佳问题。为此,学者们通过收集更多信息源 数据以更全面地刻画实际问题。不同于收集相同自变量的不同源样本,目前较为流行的多源数据收集了相同样本的不同源自变量,它们的独立性和相关性为统计建模带来了新的挑战。本文提出基于典型变量回归的多分类纵向整合分析方法,其中利用惩罚技术实现变量选择,并独特地考虑不同源数据间的关联结构,提出高效的ADMM算法进行模型优化。数值模拟结果表明,该方法在变量选择和分类预测 上均具有优越性。基于我国上证50的多源股票数据,利用该方法对2019年股票日收益率的影响因素进行了实证探究。研究表明,本文提出的多分类整合分析在筛选出具有解释意义变量的同时具有更好的预测效果。

关键词: 纵向整合分析, 多分类数据, 变量选择, 典型变量回归

Abstract: Multiple categorical data analysis plays an important role in empirical research. However, with the high dimensionality, small sample size, low signal-to-noise ratio, and other reasons, most of the existing analysis methods suffer from a lack of information, leading to unsatisfactory results. To tackle this problem, multi-source data have been extensively collected to provide a more comprehensive understanding of the underlying processes. Different from studies collecting multiple independent data with the same predictors, recently, a prominent trend is to conduct multi-dimensional studies by collecting different predictors from multiple dimensions on the same samples. These multi-dimensional data usually contain independent as well as overlapping information, posing new challenges to the existing statistical models. In this study, we propose a novel vertical integrative analysis approach for multiple categorical data based on the canonical variate regression, jointly analyzing multi-dimensional data. Penalization techniques are adopted for variable selection and, more importantly, accommodating the overlapping information across different dimensions. An efficient ADMM algorithm is developed for optimization. Simulation results demonstrate the superior performance of the proposed approach in variable selection and categorical prediction. An empirical analysis is conducted on the multi-dimensional data from the Shanghai Stock Exchange 50 index to explore the influencing factors associated with the daily profit rate of stocks in 2019. The proposed integrative analysis leads to sensible variable selection with satisfactory prediction accuracy.

Key words: Vertical Integrative Analysis, Multiple Categorical Data, Variable Selection, Canonical Variate Regression