Logistic回归的双层变量选择研究

统计研究

• 论文 • 上一篇

Logistic回归的双层变量选择研究

王小燕等

出版日期:2014-09-15 发布日期:2014-10-14

Research on Bi-level Variable Selection for Logistic Regression

Wang Xiaoyan et al.

Online:2014-09-15 Published:2014-10-14

摘要/Abstract

摘要： 变量选择是统计建模的重要环节，选择合适的变量可以建立结构简单、预测精准的稳健模型。本文在logistic回归下提出了新的双层变量选择惩罚方法——adaptive Sparse Group Lasso(adSGL)，其独特之处在于基于变量的分组结构作筛选,实现了组内和组间双层选择。该方法的优点是对各单个系数和组系数采取不同程度的惩罚，避免了过度惩罚大系数，从而提高了模型的估计和预测精度。求解的难点是惩罚似然函数不是严格凸的，因此本文基于组坐标下降法求解模型，并建立了调整参数的选取准则。模拟分析表明，对比现有代表性方法Sparse Group Lasso、Group Lasso及Lasso，adSGL法不仅提高了双层选择精度，而且降低了模型误差。最后本文将adSGL法应用到信用卡信用评分研究，对比logistic回归，它具有更高的分类精度和稳健性。

关键词: 变量选择, 群组变量, 惩罚似然, 信用评分

Abstract: Variable selection is of great importance in statistical modeling. Suitable variables can make the model simple and have favorite performance of prediction. We propose a novel penalized bi-level variable selection method——adaptive Sparse Group Lasso (adSGL), under the framework of logistic regression. Its uniqueness is that it does selection based on the grouping structure of predictors, which realizes selections at both group and individual level. It has the advantage of allowing different amounts of shrinkage for different individuals and groups, which can avoid over shrinkage for large coefficients and improve the accuracies of estimate and prediction. The difficulties of solution lies in the non-strict convexity of the penalized likelihood function so we solve the model based on block coordinate descent and establish selection criteria of tuning parameter. Simulation studies show that in compare with three representative methods Sparse Group Lasso、Group Lasso and Lasso, adSGL not only enhances bi-level selection accuracy, but also reduces model error. In the application of credit card credit scoring dataset shows that in compare with logistic regression，adSGL method has higher classification accuracy and better robustness.

Key words: Variable Selection, Grouped Variables, Penalized Likelihood, Credit Scoring

王小燕等. Logistic回归的双层变量选择研究[J]. 统计研究, 2014, 31(9): 107-112.

Wang Xiaoyan et al.. Research on Bi-level Variable Selection for Logistic Regression[J]. Statistical Research, 2014, 31(9): 107-112.

[1]	史兴杰等. 高维数据的稳健二分类方法[J]. 统计研究, 2020, 37(9): 95-105.
[2]	赵为华等. 有序响应变量的贝叶斯模型选择及其在COPD疾病防治中的应用[J]. 统计研究, 2020, 37(3): 85-93.
[3]	黎春周振宇. 信用评分模型中拒绝推断问题研究：基于半监督协同训练法的改进[J]. 统计研究, 2019, 36(9): 82-.
[4]	胡亚南田茂再. 零膨胀计数数据的联合建模及变量选择 [J]. 统计研究, 2019, 36(1): 104-114.
[5]	方匡南杨阳. SGL-SVM方法研究及其在财务困境预测中的应用[J]. 统计研究, 2018, 35(8): 104-115.
[6]	吴翌琳李宪. 劳动力市场匹配效率的影响因素研究[J]. 统计研究, 2018, 35(5): 110-118.
[7]	方匡南赵梦峦. 基于多源数据融合的个人信用评分研究 [J]. 统计研究, 2018, 35(12): 92-101.
[8]	张元庆陶志鹏. 广义嵌套空间模型变量选择研究——基于广义空间信息准则[J]. 统计研究, 2017, 34(9): 100-.
[9]	李仲达等. 非连续型高维阈值回归理论：稀疏建模与推断[J]. 统计研究, 2017, 34(4): 89-100.
[10]	斯介生等. 基于异质性数据的Logit变量选择模型研究[J]. 统计研究, 2017, 34(12): 110-118.
[11]	林存洁李扬 . 大数据分析仍需要统计思想——以ARGO模型为例[J]. 统计研究, 2016, 33(11): 109-112.
[12]	陈心洁等. 线性混合效应模型的FIC选择准则[J]. 统计研究, 2015, 32(3): 100-103.
[13]	马双鸽等. 大数据的整合分析方法[J]. 统计研究, 2015, 32(11): 3-11.
[14]	李扬曾宪斌. 面板数据模型的惩罚似然变量选择方法研究[J]. 统计研究, 2014, 31(3): 83-89.
[15]	王磊等. 数据挖掘模型在小企业主信用评分领域的应用 [J]. 统计研究, 2014, 31(10): 89-98.

Logistic回归的双层变量选择研究

Research on Bi-level Variable Selection for Logistic Regression

赞

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 10