统计研究

• 论文 • 上一篇    下一篇

Logistic模型对非平衡数据的敏感性:测度、修正与比较

魏瑾瑞 吕晓云   

  • 出版日期:2016-02-15 发布日期:2016-03-02

The Sensitivity of Logistic Model to Unbalanced Data:Measurement, Correction and Comparison

Wei Jinrui & Lv Xiaoyun   

  • Online:2016-02-15 Published:2016-03-02

摘要: 以UCI数据库为研究样本,分析logistic模型对不同程度非平衡数据的敏感性。研究表明,(1)数据非平衡程度越高,logistic回归对稀有类的识别能力越差。(2)相对于其他修正方法,OSS方法的改进效果不显著且不稳定;相对于复杂抽样,简单抽样修正结果更优。(3)AUC值不适宜于非平衡数据条件下的模型选择,因为在非平衡数据条件下,它既不能有效地区分四种修正方法之优劣,而且修正前后的差异亦不能辩。

关键词: Logistic模型, 非平衡数据, ROC曲线, AUC值, 平衡化的五折交叉验证

Abstract: Based on the UCI database, this paper analyzes the sensitivity of the logistic model to different degree of unbalanced data. The research shows that: (1) the higher the degree of unbalanced data is, the poorer the ability that logistic regression identifying the rare classes is. (2)Compared to other corrected methods, OSS method is neither significant nor stable; Simple sampling has better performance relative to complex sampling.(3) AUC is not suitable for model selection under the condition of unbalanced data. Because it cannot distinguish the four corrected methods effectively, and cannot tell the differences before and after correction.

Key words: Logistic Model, Unbalanced Data, ROC Curve, AUC, Balanced 5-CV