统计研究 ›› 2021, Vol. 38 ›› Issue (4): 131-144.doi: 10.19343/j.cnki.11-1302/c.2021.04.010

• • 上一篇    下一篇

大数据背景下基于社交网络的聚类随机游走抽样算法研究

贺建风 李宏煜   

  • 出版日期:2021-04-25 发布日期:2021-04-25

The Study on Clustering Random Walk Sampling Algorithm Based on Social Network in the Context of Big Data

He Jianfeng Li Hongyu   

  • Online:2021-04-25 Published:2021-04-25

摘要: 数字经济时代,社交网络作为数字化平台经济的重要载体,受到了国内外学者的广泛关注。大数据背景下,社交网络的商业应用价值巨大,但由于其网络规模空前庞大,传统的网络分析方法 因计算成本过高而不再适用。而通过网络抽样算法获取样本网络,再推断整体网络,可节约计算资源, 因此抽样算法的好坏将直接影响社交网络分析结论的准确性。现有社交网络抽样算法存在忽略网络内部拓扑结构、容易陷入局部网络、抽样效率过低等缺陷。为了弥补现有社交网络抽样算法的缺陷,本文结合大数据社交网络的社区特征,提出了一种聚类随机游走抽样算法。该方法首先使用社区聚类算法将原始网络节点进行社区划分,得到多个社区网络,然后分别对每个社区进行随机游走抽样获取样本网 络。数值模拟和案例应用的结果均表明,聚类随机游走抽样算法克服了传统网络抽样算法的缺点,能够在降低网络规模的同时较好地保留原始网络的结构特征。此外,该抽样算法还可以并行运算,有效提升抽样效率,对于大数据背景下大规模社交网络的抽样实践具有重大现实意义。

关键词: 大数据, 社交网络, 社区聚类, 随机游走抽样

Abstract: In the era of the digital economy, social networks as an important carrier of the digital platform economy have been widely studied by scholars at home and abroad. The commercial application value of the social network is huge in the context of big data, but because of its unprecedented network scale, the traditional network analysis method is no longer applicable at a high computational cost. Using a network sampling algorithm to obtain the sample network to infer the whole network can save computing resources. Therefore, the quality of the sampling algorithm will directly affect the accuracy of social network analysis conclusions, but the existing social network sampling algorithm is prone to ignoring the network internal topology and falling into the local network, low sampling efficiency, and other defects. To make up for such defects we propose a clustering random walk sampling algorithm based on the community characteristics of social networks with big data. This method first uses the community clustering algorithm to divide the original network nodes into communities to obtain multiple community networks, and then conducts random walk sampling for each community to obtain sample networks for inference. The results of numerical simulation and case application show that the proposed algorithm outperforms the traditional methods and can better retain the structural features of the original network while reducing the network size. In addition, the algorithm can also carry out parallel operations to promote sampling efficiency, which has great practical significance for the sampling of large-scale social networks in the context of big data.

Key words: Big Data, Social Network, Community Clustering, Random Walk Sampling