大数据背景下非概率抽样的统计推断问题

统计研究

大数据背景下非概率抽样的统计推断问题

金勇进刘展

出版日期:2016-03-15 发布日期:2016-03-21

Statistical Inference Problems of Non-probability Sampling under the Background of Big Data

Jin Yongjin & Liu Zhan

Online:2016-03-15 Published:2016-03-21

摘要/Abstract

摘要：

利用大数据进行抽样，很多情况下抽样框的构造比较困难，使得抽取的样本属于非概率样本，难以将传统的抽样推断理论应用到非概率样本中，如何解决非概率抽样的统计推断问题，是大数据背景下抽样调查面临的严重挑战。本文提出了解决非概率抽样统计推断问题的基本思路：一是抽样方法，可以考虑基于样本匹配的样本选择、链接跟踪抽样方法等，使得到的非概率样本近似于概率样本，从而可采用概率样本的统计推断理论；二是权数的构造与调整，可以考虑基于伪设计、模型和倾向得分等方法得到类似于概率样本的基础权数；三是估计，可以考虑基于伪设计、模型和贝叶斯的混合概率估计。最后，以基于样本匹配的样本选择为例探讨了具体解决方法。

关键词: 大数据, 非概率抽样, 统计推断

Abstract:

When sampling is done with big data, the construction of sampling frame is difficult in many cases, so that the sample belongs to non-probability sample, and it is difficult to apply the traditional inference theory of sampling to the non-probability sample. Therefore, under the background of big data it is a serious challenge to sampling survey to solve the statistical inference problems of non-probability sampling. The research proposes some basic ideas to solve the statistical inference problems of non-probability sampling. First, sampling methods such as the sample selection method based on sample matching and the method of link-tracing sampling can be considered, so that the obtained non-probability sample approximates to probability sample and then the statistical inference theory of probability sample can be used. Second, the construction and adjustment methods of weights based on pseudo design, models and propensity score can be considered to obtain the base weights similar to probability sample. Third, the estimation methods based on pseudo design, models and Bayesian hybrid probability can be considered. Finally, the sample selection method based on sample matching is taken as an example to discuss concrete solutions to the statistical inference problems of non-probability sampling.

Key words: Big Data, Non-probability Sampling, Statistical Inference

金勇进刘展. 大数据背景下非概率抽样的统计推断问题[J]. 统计研究, 2016, 33(3): 11-17.

Jin Yongjin & Liu Zhan. Statistical Inference Problems of Non-probability Sampling under the Background of Big Data[J]. Statistical Research, 2016, 33(3): 11-17.

[1]	雷泽坤等. 基于电商平台大数据的特征价格指数研究[J]. 统计研究, 2020, 37(8): 22-34.
[2]	刘展潘莹丽. 大数据背景下网络调查样本的建模推断问题研究——以广义Boosted模型的倾向得分推断为例[J]. 统计研究, 2019, 36(9): 93-.
[3]	黄恒君. 政府统计生产体系中的大数据融入探讨——基于数据源与数据质量的分析[J]. 统计研究, 2019, 36(7): 3-12.
[4]	鲁永刚张凯. 地理距离、方言文化与劳动力空间流动 [J]. 统计研究, 2019, 36(3): 88-99.
[5]	刘华军雷名雨. 交通拥堵与雾霾污染的因果关系——基于收敛交叉映射技术的经验研究[J]. 统计研究, 2019, 36(10): 43-57.
[6]	陈光慧刘建平. 构建新时代现代化统计调查体系的问题研究[J]. 统计研究, 2018, 35(6): 11-17.
[7]	胡英. 关于人口统计调查方法体系存在的问题和改革设想[J]. 统计研究, 2018, 35(4): 94-103.
[8]	种照辉覃成林叶信岳. 城市群经济网络与经济增长——基于大数据与网络分析方法的研究[J]. 统计研究, 2018, 35(1): 13-21.
[9]	董倩. 重复特征“R-H”交易法 ──二手房价格指数编制方法研究 [J]. 统计研究, 2017, 34(3): 118-128.
[10]	唐晓彬等. 大数据背景下网络突发事件动态监测研究 [J]. 统计研究, 2017, 34(2): 44-54.
[11]	范超等. 新经济业态P2P网络借贷的风险甄别研究[J]. 统计研究, 2017, 34(2): 33-43.
[12]	李金昌. 关于统计数据的几点认识[J]. 统计研究, 2017, 34(11): 3-14.
[13]	“大数据中的统计方法”课题组. 大数据时代统计学发展的若干问题[J]. 统计研究, 2017, 34(1): 5-11.
[14]	黄恒君等. 单位名录库更新：互联网大数据源及其数据质量评估[J]. 统计研究, 2017, 34(1): 12-22.
[15]	秦磊等. 大数据下Leverage重要性抽样方法的稳健改进[J]. 统计研究, 2016, 33(8): 101-105.