统计研究 ›› 2019, Vol. 36 ›› Issue (3): 124-128.doi: 10.19343/j.cnki.11-1302/c.2019.03.011

• • 上一篇    

高维大数据基因网络中的社区发现——以NC方法为例

孙怡帆等   

  • 出版日期:2019-03-25 发布日期:2019-03-27

Community Detection in Genetic Network Big Data: Taking NC Method as an Example

Sun Yifan et al.   

  • Online:2019-03-25 Published:2019-03-27

摘要: 从大量基因中识别出致病基因是大数据下的一个十分重要的高维统计问题。基因间网络结构的存在使得对于致病基因的识别已从单个基因识别扩展到基因模块识别。从基因网络中挖掘出基因模块就是所谓的社区发现(或节点聚类)问题。绝大多数社区发现方法仅利用网络结构信息,而忽略节点本身的信息。Newman和Clauset于2016年提出了一个将二者有机结合的基于统计推断的社区发现方法(简称为NC方法)。本文以NC方法为案例,介绍统计方法在实际基因网络中的应用和取得的成果,并从统计学角度提出了改进措施。通过对NC方法的分析可以看出对于以基因网络为代表的非结构化数据,统计思想和原理在数据分析中仍然处于核心地位。而相应的统计方法则需要针对数据的特点及关心的问题进行相应的调整和优化。

关键词: 基因网络, 社区发现, 元数据

Abstract: The identification of disease genes is an important high dimensional statistical problem. The network structure of genes has inspired researchers to shift their attention from single gene identification to genetic module identification. Detecting the genetic module from genetic network is the so-called community detection (or node clustering). Most research in this area only use the topology structure, but neglect the metadata on the nodes. Newman and Clauset proposed a statistical inference based community detection method (NC method) to combine the metadata and topology structure-. In this paper, we take NC method as an example to demonstrate the applications and achievements of statistical methods in genetic network, and discuss the potential improvements from statistical point of view. The analysis of NC method indicates that the statistical thinking and principle play an important role in the analysis of unstructured data, such as the genetic network. The statistical methods need adjustment and optimization according to the characteristics of data and questions of interest.

Key words: Genetic network, Community detection, Metadata