统计研究 ›› 2018, Vol. 35 ›› Issue (7): 125-128.doi: 10.19343/j.cnki.11-1302/c.2018.07.011

• • 上一篇    

融合统计思想的大数据算法

李扬等   

  • 出版日期:2018-07-25 发布日期:2018-07-10

Statistical Algorithms for Big Data

Li Yang et al.   

  • Online:2018-07-25 Published:2018-07-10

摘要: 海量化的数据规模作为大数据的第一个特征,带来计算方面的首要挑战。大规模样本不一定可以完全替代总体,因此大数据分析的算法设计不仅要考虑精简计算成本,还要考虑如何刻画估计结果的不确定性。本文以分治自助算法和子集双重自助算法为例讨论兼具计算效率提升和不确定性评价的可并行计算的大数据统计算法设计,通过比较分析探讨设计思想与未来研究方向。

关键词: 自助法, 不确定性, 大规模数据, 并行计算

Abstract: The large volume of massive dataset is the key feature of Big Data which brings the main challengers for computing. The dataset with large sample size cannot always stands for the population, therefore the algorithms design for Big Data should consider how to reduce computing cost and how to characterize the uncertainty of the estimated results. In this paper, we study the design of statistical algorithms for massive dataset considering both computing efficiency and uncertainty assessment. Both the Bag of Little Bootstrap (BLB) and Subsampled Double Bootstrap (SDB) algorithms are discussed as illustrative examples. Additionally, a comparison of BLB and SDB is discussed with conclusions of future work.

Key words: Bootstrap, Uncertainty, Massive Data, Parallel Computing