随机森林算法

How to grow a Decision Tree

source : [3](3.html)

LearnUnprunedTree(X ,Y)

Input: X a matrix of R rows and M columns where X{ }{}{~}ij{~} = the value of the j 'th attribute in the i 'th input datapoint. Each column consists of either all real values or all categorical values. Input: Y a vector of R elements, where Y{ }{}{~}i{~} = the output class of the i 'th datapoint. The Y{ }{}{~}i{~} values are categorical. Output: An Unpruned decision tree

If all records in X have identical values in all their attributes (this includes the case where R<2 ), return a Leaf Node predicting the majority output, breaking ties randomly. This case also includes If all values in Y are the same, return a Leaf Node predicting this value as the output Else select m variables at random out of the M variables For j = 1 .. m If j 'th attribute is categorical IG{ }{}{~}j{~} = IG(Y |X{ }{}{~}j{~} ) (see Information Gain) Else (j 'th attribute is real-valued) IG{ }{}{~}j{~} = IG*(* Y*|* X{}{ }{~}j{~}) (see Information Gain) Let *j* = argmax{~}j~ IG{ }{}{~}j{~} (this is the splitting attribute we'll use) If j* is categorical then For each value v of the j 'th attribute Let X{ }{}{^}v{^} = subset of rows of X in which X{ }{}{~}ij{~} = v . Let Y{ }{}{^}v{^} = corresponding subset of Y Let Child{ }{}{^}v{^} = LearnUnprunedTree(X{ }{}{^}v{^} ,Y{ }{}{^}v{^} ) Return a decision tree node, splitting on j 'th attribute. The number of children equals the number of values of the j 'th attribute, and the v 'th child is Child{ }{}{^}v{^} Else j* is real-valued and let t be the best split threshold Let X{ }{}{^}LO{^} = subset of rows of X in which X{ }{}{~}ij{~} <= t . Let Y{ }{}{^}LO{^} = corresponding subset of Y Let Child{ }{}{^}LO{^} = LearnUnprunedTree(X{ }{}{^}LO{^} ,Y{ }{}{^}LO{^} ) Let X{ }{}{^}HI{^} = subset of rows of X in which X{ }{}{~}ij{~} > t . Let Y{ }{}{^}HI{^} = corresponding subset of Y Let Child{ }{}{^}HI{^} = LearnUnprunedTree(X{ }{}{^}HI{^} ,Y{ }{}{^}HI{^} ) Return a decision tree node, splitting on j 'th attribute. It has two children corresponding to whether the j'th attribute is above or below the given threshold.

Note: There are alternatives to Information Gain for splitting nodes

Information gain

source : [3](3.html)

h4. nominal attributes

suppose X can have one of m values V{~}1~,V{~}2~,...,V{~}m~ P(X=V{~}1~)=p{~}1~, P(X=V{~}2~)=p{~}2~,...,P(X=V{~}m~)=p{~}m~ H(X)= -sum{~}j=1{~}{^}m^ p{~}j~ log{~}2~ p{~}j~ (The entropy of X) H(Y|X=v) = the entropy of Y among only those records in which X has value v H(Y|X) = sum{~}j~ p{~}j~ H(Y|X=v{~}j~) IG(Y|X) = H(Y) - H(Y|X)

h4. real-valued attributes

suppose X is real valued define IG(Y|X:t) as H(Y) - H(Y|X:t) define H(Y|X:t) = H(Y|X<t) P(X<t) + H(Y|X>=t) P(X>=t) define IG*(Y|X) = max{~}t~ IG(Y|X:t)

How to grow a Random Forest

source : [1](1.html)

Each tree is grown as follows:

if the number of cases in the training set is N , sample N cases at random -but with replacement, from the original data. This sample will be the training set for the growing tree.
if there are M input variables, a number m << M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
each tree is grown to its large extent possible. There is no pruning.

Random Forest parameters

source : [2](2.html) Random Forests are easy to use, the only 2 parameters a user of the technique has to determine are the number of trees to be used and the number of variables (m ) to be randomly selected from the available set of variables. Breinman's recommendations are to pick a large number of trees, as well as the square root of the number of variables for m.

How to predict the label of a case

Classify(node ,V ) Input: node from the decision tree, if node.attribute = j then the split is done on the j'th attribute

Input: V a vector of M columns where V{ }{}{~}j{~} = the value of the j 'th attribute. Output: label of V

If node is a Leaf then Return the value predicted by node

Else Let j = node.attribute If j is categorical then Let v = V{ }{}{~}j{~} Let child{ }{}{^}v{^} = child node corresponding to the attribute's value v Return Classify(child{ }{}{^}v{^} ,V)

Else j is real-valued Let t = node.threshold (split threshold) If Vj < t then Let child{ }{}{^}LO{^} = child node corresponding to (<t ) Return Classify(child{ }{}{^}LO{^} ,V ) Else Let child{ }{}{^}HI{^} = child node corresponding to (>=t ) Return Classify(child{ }{}{^}HI{^} ,V)

The out of bag (oob) error estimation

source : [1](1.html)

in random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows:

each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases left of the bootstrap sample and not used in the construction of the kth tree.
put each case left out in the construction of the kth tree down the kth{ }tree to get a classification. In this way, a test set classification is obtained for each case in about one-thrid of the trees. At the end of the run, take j to be the class that got most of the the votes every time case n was oob . The proportion of times that j is not equal to the true class of n averaged over all cases is the oob error estimate. This has proven to be unbiased in many tests.

Other RF uses

source : [1](1.html)

variable importance
gini importance
proximities
scaling
prototypes
missing values replacement for the training set
missing values replacement for the test set
detecting mislabeled cases
detecting outliers
detecting novelties
unsupervised learning
balancing prediction error Please refer to [1](1.html) for a detailed description

References

1\](1.html) Random Forests - Classification Description [Random forests - classification description](http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm "Random forests - classification description") \[2\](2.html) B. Larivi�re \& D. Van Den Poel, 2004. "Predicting Customer Retention and Profitability by Using Random Forests and Regression Forests Techniques," Working Papers of Faculty of Economics and Business Administration, Ghent University, Belgium 04/282, Ghent University, Faculty of Economics and Business Administration. Available online : [Predicting Customer Retention and Profitability by Using Random Forests and Regression Forests Techniques](http://ideas.repec.org/p/rug/rugwps/04-282.html "Predicting Customer Retention and Profitability by Using Random Forests and Regression Forests Techniques") \[3\](3.html) Decision Trees - Andrew W. Moore\[4\] http://www.cs.cmu.edu/\~awm/tutorials\[1\](1.html) \[4\](4.html) Information Gain - Andrew W. Moore [http://www.cs.cmu.edu/\~awm/tutorials](http://www.cs.cmu.edu/~awm/tutorials "http://www.cs.cmu.edu/~awm/tutorials") Copyright © 2014-2024 [The Apache Software Foundation](http://www.apache.org/ "The Apache Software Foundation"), Licensed under the Apache License, Version 2.0. #### 如何培育决策树 来源：\[3\](3.html) 学习未剪枝树（*X* ，*Y*） 输入：*X 是一个由* *R* 行和*M* 列组成的矩阵，其中*X{* }{ *}{\~}ij{\~} =* *第 i个输入数据点中第* *j* 个属性的值。每列由所有实数值或所有分类值组成。输入：*Y 是一个由* *R* 个元素组成的向量，其中*Y{* }{ *}{\~}i{\~} = 第* *i* 个数据点的输出类。Y *{* }{ *}{\~}i{\~}*值为分类值。输出：未剪枝的决策树 如果*X中的所有记录在其所有属性中具有相同的值（包括* *R \<2* 的情况），则返回一个预测多数输出的叶节点，随机打破平局。这种情况还包括如果*Y* 中的所有值都相同，则返回一个叶节点，预测该值作为输出，否则从*M 个变量中随机选择* *m 个* 变量对于*j* = 1 .. *m* ，如果*第 j* 个属性是分类属性 *IG{* }{ *}{\~}j{\~}* = IG( *Y* \| *X{* }{ *}{\~}j{\~}* )（参见信息增益）否则（*第 j* 个属性是实值） *IG{* }{ *}{\~}j{\~}* = IG *(* Y *\|* X{ *}{* }{\~}j{\~} *)（参见信息增益）让 \*j\** = argmax{\~}j\~ *IG{* }{ *}{\~}j{\~}* （这是我们将使用的分裂属性）如果*j\*是分类属性，则对于第* *j* 个属性的每个值*v* ，让 *X{* }{ *}{\^}v{\^} =* *X* 的行子集，其中*X{* }{ *}{\~}ij{\~}* = *v* 。令*Y{* }{ *}{\^}v{\^} =* *Y* 的对应子 集令*Child{* }{ *}{\^}v{\^}* = LearnUnprunedTree( *X{* }{ *}{\^}v{\^}* , *Y{* }{ *}{\^}v{\^}* ) 返回一个决策树节点，在*第 j* 个属性上进行拆分。子节点的数量等于第*j* 个属性的值的数量，第*v* 个子节点为 *Child{* }{ *}{\^}v{\^}* 否则*j\** 为实值，令*t* 为最佳拆分阈值令*X{* }{ *}{\^}LO{\^} =* *X* 的行子集，其中*X{* }{ *}{\~}ij{\~}* *\<= t* 。令*Y{* }{ *}{\^}LO{\^} =* *Y* 的对应子 集令*Child{* }{ *}{\^}LO{\^}* = LearnUnprunedTree( *X{* }{ *}{\^}LO{\^}* , *Y{* }{ *}{\^}LO{\^}* ) 令*X{* }{ *}{\^}HI{\^} =* *X* 的行子集 ，其中*X{* }{ *}{\~}ij{\~}* *\> t* 。令*Y{* }{ } *{\^}HI{\^} =* *Y* 的对应子集 令*Child{* }{ *}{\^}HI{\^}* = LearnUnprunedTree( *X{* }{ *}{\^}HI{\^}* , *Y{* }{ *}{\^}HI{\^}* ) 返回决策树节点，根据 *第 j* 个属性进行拆分。它有两个子节点，分别对应第*j*个属性是高于还是低于给定阈值。 *注意*：除了信息增益之外，还有其他方法可以分割节点 #### 信息增益 来源：\[3\](3.html) 1. h4. 名义属性 假设 X 可以具有 m 个值之一 V{\~}1\~,V{\~}2\~,...,V{\~}m\~ P(X=V{\~}1\~)=p{\~}1\~, P(X=V{\~}2\~)=p{\~}2\~,...,P(X=V{\~}m\~)=p{\~}m\~ H(X)= -sum{\~}j=1{\~}{\^}m\^ p{\~}j\~ log{\~}2\~ p{\~}j\~ (X 的熵) H(Y\|X=v) = 仅在 X 具有值 v 的记录中 Y 的熵 H(Y\|X) = sum{\~}j\~ p{\~}j\~ H(Y\|X=v{\~}j\~) IG(Y\|X) = H(Y) - H(Y\|X) 1. h4. 实值属性 假设 X 是实值，定义 IG(Y\|X:t) 为 H(Y) - H(Y\|X:t) 定义 H(Y\|X:t) = H(Y\|X\=t) P(X\>=t) 定义 IG\*(Y\|X) = max{\~}t\~ IG(Y\|X:t) #### 如何培育随机森林 来源：\[1\](1.html) 每棵树的生长方式如下： 1. 如果训练集中的案例数为*N* ，则从原始数据中随机抽取*N个案例（但有替换）。该样本将成为生长树的训练集。* 2. 如果有*M 个* 输入变量，则指定一个数字*m \<\< M* ，这样在每个节点上，从*M* 中随机选择*m 个变量，并使用这* *m 个* 变量中的最佳分割来分割节点。在森林生长过程中， *m*的值保持不变。 3. 每棵树都尽可能地长大。没有修剪。 #### 随机森林参数 来源：\[2\](2.html) 随机森林易于使用，该技术的用户只需确定 2 个参数：要使用的树的数量和从可用变量集中随机选择的变量数量 ( *m* )。 Breinman 的建议是选择大量树，以及变量数量的平方根（ *m* ） 。 #### 如何预测案件的标签 分类（*节点* ，*V* ）输入：决策树中的*节点，如果* *node.attribute = j ，则对第* *j*个属性进行拆分 输入：*V* 是一个*M* 列向量，其中 *V{* }{ *}{\~}j{\~} = 第* *j* 个属性的值。输出：*V的标签* 如果*节点是叶子，则返回* *节点*预测的值 否则，让*j = node.attribute* 如果*j* 是分类的，则让*v* = *V{* }{ *}{\~}j{\~}* 让*child{* }{ *}{\^}v{\^}* = 对应于属性值*v* 的子节点 返回 Classify( *child{* }{ *}{\^}v{\^}* , *V* ) 否则*j* 为实值 设*t = node.threshold* （分割阈值） 如果 Vj \< t 则 设*child{* }{ *}{\^}LO{\^}* = 对应于（*\=t* ）的子节点 返回 Classify( *child{* }{ *}{\^}HI{\^}* , *V* ) #### 袋外（oob）误差估计 来源：\[1\](1.html) 在随机森林中，不需要交叉验证或单独的测试集来获得测试集误差的无偏估计。在运行过程中，它会在内部进行估计，如下所示： * 每棵树都是使用来自原始数据的不同引导样本构建的。大约三分之一的案例是引导样本中剩下的，没有用于构建第*k*棵树。 * 将构建*第 k* 棵树时遗漏的每个案例放到 *第 k{* }树中，得到一个分类。这样，大约三分之一的树中的每个案例都会得到一个测试集分类。在运行结束时，取*j作为每次案例* *n* 为*oob* 时获得大多数投票的类。所有案例中*j* 不等于*n* 的真实类别的次数的平均值就是*oob 错误估计*。这在许多测试中已被证明是无偏的。 #### 其他射频用途 来源：\[1\](1.html) * 变量重要性 * 基尼重要性 * 邻近 * 扩展 * 原型 * 训练集缺失值替换 * 测试集缺失值替换 * 检测错误标记的案例 * 检测异常值 * 发现新奇事物 * 无监督学习 * 平衡预测误差的详细说明请参考\[1\](1.html) #### 参考 \[1\](1.html) 随机森林 - 分类描述 [Random forests - classification description](http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm "Random forests - classification description") \[2\](2.html) B. Larivi�re \& D. Van Den Poel，2004 年。"使用随机森林和回归森林技术预测客户保留率和盈利能力，"比利时根特大学经济与工商管理学院工作论文 04/282，根特大学，经济与工商管理学院。在线获取：[Predicting Customer Retention and Profitability by Using Random Forests and Regression Forests Techniques](http://ideas.repec.org/p/rug/rugwps/04-282.html "Predicting Customer Retention and Profitability by Using Random Forests and Regression Forests Techniques") \[3\](3.html) 决策树 - Andrew W. Moore\[4\] http://www.cs.cmu.edu/\~awm/tutorials\[1\](1.html) \[4\](4.html) 信息增益 - Andrew W. Moore [http://www.cs.cmu.edu/\~awm/tutorials](http://www.cs.cmu.edu/~awm/tutorials "http://www.cs.cmu.edu/~awm/tutorials")