统计研究 ›› 2020, Vol. 37 ›› Issue (10): 104-114.doi: 10.19343/j.cnki.11-1302/c.2020.10.009

• • 上一篇    下一篇

异质性大数据的分布式估计

郭婧璇 徐慧超 祝婉晴 田茂再   

  • 出版日期:2020-10-25 发布日期:2020-10-21

Distributed Estimation for Heterogeneous Big Data

Guo Jingxuan Xu Huichao Zhu Wanqing Tian Maozai   

  • Online:2020-10-25 Published:2020-10-21

摘要: 随着物联网技术的进步,大数据给网络带宽和计算机存储能力带来巨大挑战,传统的集中式数据处理难以实现,客观上促进了分布式统计学习的发展。在无迭代算法研究中,Zhang等(2013)证明了当数据集个数s=O(N) 时,基于局部经验风险最小化的分治(DC)简单平均估计量具有O(N-1)均方误差收敛速度,Huang和Huo(2019)在M估计框架下进一步提出分布式一步估计量,但上述方法均未考虑海量数据可能存在的异质性对分治估计效果的影响。本文在线性模型框架下提出海量异质数据的分治一步加权估计,证明了估计量的渐近性质并考虑了异质性检验问题。将本文提出的方法应用于美国医疗保险实际数据分析,结果表明该方法能更好地拟合数据的线性趋势且显著提高了计算效率。

关键词: 分治策略, 一步估计, 海量数据, 异质性, 医疗保险

Abstract: With the rapid development of IoT technology, big data brings great challenge to network bandwidth and computer storage capacity, which makes traditional centralized data processing difficult to achieve. Distributed computing came into being in this background. The idea of distributed computing, known in statistics as divide-and-conquer (DC), is attracting more and more attention from statisticians. Zhang et al.(2013) demonstrated the simple average of local empirical risk minimization estimation has mean square error rate O(N-1 ) when the number of data sets s = O( N ). On this basis, Huang and Huo (2019)proposed a distributed one-step estimator of M-estimation with Newton-Raphson iteration. However, the above methods do not consider the effect of heterogeneity in big data on estimation results. In this paper, a distributed one-step weight estimation for heterogeneous big data is proposed in the framework of linear model and its asymptotic properties are proved and used to test heterogeneity in big data. In addition, the proposed method is applied to the actual data analysis of medical insurance in the United States. The results show that compared with the simple average estimation, the method presented in this paper can better fit the linear trend of data and significantly improve the computational efficiency.

Key words: Divide-and-conquer, One-step Estimator, Big Data, Heterogeneity, Medical Insurance