统计研究 ›› 2021, Vol. 38 ›› Issue (8): 146-160.doi: 10.19343/j.cnki.11-1302/c.2021.08.012

• • 上一篇    

大数据背景下CPI预测问题的文本挖掘技术设计与应用

唐晓彬 董曼茹 徐荣   

  • 出版日期:2021-08-25 发布日期:2021-08-25

Design and Application of Text Mining Technology for CPI Prediction Based on Big Data

Tang Xiaobin Dong Manru Xu Rong   

  • Online:2021-08-25 Published:2021-08-25

摘要: 本文创新地将半监督交互式关键词提取算法词频-逆向文件频率( Term Frequency- Inverse Document Frequency, TF-IDF )与基于 Transformer 的 双 向 编 码 表 征 ( Bidirectional Encoder Representation from Transformers,BERT)模型相结合,设计出一种扩展CPI预测种子关键词的文本挖掘技术。采用交互式TF-IDF算法,对原始CPI预测种子关键词汇广度上进行扩展,在此基础上通过BERT“两段式”检索过滤模型深入挖掘文本信息并匹配关键词,实现CPI预测关键词深度上的扩展,从而构建了CPI预测的关键词库。在此基础上,本文进一步对文本挖掘技术特征扩展前后的关键词建立预测模型进行对比分析。研究表明,相比于传统的关键词提取算法,交互式TF-IDF算法不仅无需借助语料库,而且还允许种子词的输入。同时,BERT模型通过迁移学习的方式对基础模型进行微调,学习特定领域知识,在CPI预测问题中很好地实现了语言表征、语义拓展与人机交互。相对于传统文本挖掘技术,本文设计的文本挖掘技术具有较强的泛化表征能力,在84个CPI预测关键种子词的基础上,扩充后的关键词对CPI具有更高的预测准确度和更充分的解释性。本文针对CP 预测问题设计的文本挖掘技术,也为建立其他宏观经济指标关键词词库提供新的研究思路与参考价值。

关键词: 关键词提取, CPI 预测, 文本挖掘技术, 交互式 TF-IDF 算法, BERT 模型

Abstract: This paper innovatively combines the semi-supervised interactive keyword extraction algorithm Term Frequency-Inverse Document Frequency ( TF-IDF) with the Bidirectional Encoder Representation from Transformers (BERT) model, and designs a text mining technology that expands CPI prediction seed keywords. Using the interactive TF-IDF algorithm, the original CPI prediction seed keywords are expanded in breadth. On this basis, the BERT “ two-stage” search and filter model is used to deeply mine text information and match keywords to realize the expansion of the depth of CPI prediction keywords, thereby constructing the CPI prediction keyword database. Furthermore, for the keywords before and after the feature expansion of text mining technology, a predictive model is established for comparative analysis. The research shows that compared with traditional keyword extraction algorithms, the interactive TF-IDF algorithm does not need a corpus, and also allows the input of seed words. Simultaneously, the BERT model fine-tunes the basic model through transfer learning, learns the knowledge in specific domains, and implements language representation, semantic expansion and human-computer interaction in CPI prediction. Compared with traditional text mining technology, this paper designs a text mining technology with strong generalization and representation for CPI prediction problems. On the basis of 84 CPI prediction key seed words, the research mines deeper into the text, and the expanded keyword glossary has higher accuracy and more comprehensive interpretability in CPI prediction. The text mining technology designed in this paper for the CPI prediction also provides new research ideas and references for the establishment of databases of other macroeconomic index keywords.

Key words: Keyword Extraction, CPI Prediction, Text Mining Technology, Interactive TF-IDF Algorithm, BERT Model