统计研究

• 论文 • 上一篇    下一篇

单位名录库更新:互联网大数据源及其数据质量评估

黄恒君等   

  • 出版日期:2017-01-15 发布日期:2017-02-09

Business Register Database Revision: Internet Data Sources and Data Quality Assessment

Huang Hengjun et al.   

  • Online:2017-01-15 Published:2017-02-09

摘要: 在大数据时代,互联网数据资源的充分利用对提高政府统计能力的影响是不可忽视的,但互联网数据质量问题值得探讨。以单位名录库为研究对象,本文讨论了互联网数据作为单位名录库更新数据源的质量评估方法:从多维度视角比较分析了互联网数据源与传统数据源的数据质量;从准确性方面探讨了互联网数据源的数据质量评估框架,给出了单源质量评估、多源整合评估、事件信息辅助评估的做法和要点。分析结果表明,互联网数据源能够完成名录库“及时更新”的任务,可以辅助实现名录库更新的“真实准确”和“不重不漏”,但不足以生成“统一完整”的名录库。同时,利用大众点评网、百度糯米网、地理信息系统等异源异构数据整合,给出了一个餐饮业名录库更新的数据质量评估实例。

关键词: 大数据, 名录库, 政府统计, 数据质量

Abstract: Internet data sources should be considered and utilized for enhancing official statistical ability in the era of Big Data. However, the quality of these data sources is questionable. In this paper, we attempt to investigate the quality of data generated from Internet sources. The quality issues of the Internet data sources as well as traditional data sources are compared from the multi-dimensional perspective, a quality evaluation framework mainly aims at accuracy for Internet data source is proposed, and key points of quality evaluation are given out, which include single source quality assessment, multi-sources integration evaluation, and event-assist assessment approaches. The results show that the Internet data sources have the ability for assisting to update business register database timely and accurately, but fail to generate unified one. An application of multi-sources integration is also involving in this paper.

Key words: Big Data, Business Register Database, Official Statistics, Data Quality