统计研究 ›› 2020, Vol. 37 ›› Issue (9): 95-105.doi: 10.19343/j.cnki.11-1302/c.2020.09.009

• • 上一篇    下一篇

高维数据的稳健二分类方法

史兴杰 王赛旎 李扬   

  • 出版日期:2020-09-25 发布日期:2020-09-18

Robust Binary Classification of High-dimensional Data

Shi Xingjie Wang Saini Li Yang   

  • Online:2020-09-25 Published:2020-09-18

摘要: 对于实证研究中经常遇到变量维数高和存在异常值的二分类问题,探索稳健的高维二分类方法显得尤为重要。本文提出基于Lasso惩罚的光滑0-1损失函数二分类法,并利用Fabs 算法高效地解决了变量选择和参数估计问题。数值模拟的结果表明,在不同异常值比例下该方法均具有良好的稳健性。基于CHIP 2013年度数据,利用该方法对农民工子女高中入学决定的影响因素进行了实证研究。分析发现,农民工父母的教育水平、教育水平与家庭经济状况的交互作用、农民工子女性别、性别与民族的交互作用均对农民工子女的入学决定有重要影响。

关键词: 0-1损失, Fabs算法, 变量选择, 稳健二分类

Abstract: In the empirical research, we usually face high-dimensional data with outliers, so robust binary classification methods are very important. In this paper, we propose a robust method, the Lasso regularized smooth 0-1 loss. Based on Fabs algorithm, we provide an efficient solution for variable selection and parameter estimation. The simulation result shows that the proposed method has robust performance in the presence of various proportions of outliers. An empirical study based on CHIP 2013 dataset demonstrates that many factors have an important effect on the high school enrollment of migrant workers’ children, namely parents’ education levels, students’ gender, and interaction effects between parents’ education levels and family financial conditions, and between students’ gender and ethnicity.

Key words: 0-1 Loss, Fabs Algorithm, Variable Selection, Robust Binary Classification