
How to grow a Decision Tree

LearnUnprunedTree(X ,Y)

Input: X a matrix of R rows and M columns where X{ }{}{~}ij{~} = the value of the j 'th attribute in the i 'th input datapoint. Each column consists of either all real values or all categorical values. Input: Y a vector of R elements, where Y{ }{}{~}i{~} = the output class of the i 'th datapoint. The Y{ }{}{~}i{~} values are categorical. Output: An Unpruned decision tree

If all records in X have identical values in all their attributes (this includes the case where R<2 ), return a Leaf Node predicting the majority output, breaking ties randomly. This case also includes If all values in Y are the same, return a Leaf Node predicting this value as the output Else select m variables at random out of the M variables For j = 1 .. m If j 'th attribute is categorical IG{ }{}{~}j{~} = IG(Y |X{ }{}{~}j{~} ) (see Information Gain) Else (j 'th attribute is real-valued) IG{ }{}{~}j{~} = IG*(* Y*|* X{}{ }{~}j{~}) (see Information Gain) Let *j* = argmax{~}j~ IG{ }{}{~}j{~} (this is the splitting attribute we'll use) If j* is categorical then For each value v of the j 'th attribute Let X{ }{}{^}v{^} = subset of rows of X in which X{ }{}{~}ij{~} = v . Let Y{ }{}{^}v{^} = corresponding subset of Y Let Child{ }{}{^}v{^} = LearnUnprunedTree(X{ }{}{^}v{^} ,Y{ }{}{^}v{^} ) Return a decision tree node, splitting on j 'th attribute. The number of children equals the number of values of the j 'th attribute, and the v 'th child is Child{ }{}{^}v{^} Else j* is real-valued and let t be the best split threshold Let X{ }{}{^}LO{^} = subset of rows of X in which X{ }{}{~}ij{~} <= t . Let Y{ }{}{^}LO{^} = corresponding subset of Y Let Child{ }{}{^}LO{^} = LearnUnprunedTree(X{ }{}{^}LO{^} ,Y{ }{}{^}LO{^} ) Let X{ }{}{^}HI{^} = subset of rows of X in which X{ }{}{~}ij{~} > t . Let Y{ }{}{^}HI{^} = corresponding subset of Y Let Child{ }{}{^}HI{^} = LearnUnprunedTree(X{ }{}{^}HI{^} ,Y{ }{}{^}HI{^} ) Return a decision tree node, splitting on j 'th attribute. It has two children corresponding to whether the j'th attribute is above or below the given threshold.

Note: There are alternatives to Information Gain for splitting nodes

Information gain

  1. h4. nominal attributes

suppose X can have one of m values V{~}1~,V{~}2~,...,V{~}m~ P(X=V{~}1~)=p{~}1~, P(X=V{~}2~)=p{~}2~,...,P(X=V{~}m~)=p{~}m~ H(X)= -sum{~}j=1{~}{^}m^ p{~}j~ log{~}2~ p{~}j~ (The entropy of X) H(Y|X=v) = the entropy of Y among only those records in which X has value v H(Y|X) = sum{~}j~ p{~}j~ H(Y|X=v{~}j~) IG(Y|X) = H(Y) - H(Y|X)

  1. h4. real-valued attributes

suppose X is real valued define IG(Y|X:t) as H(Y) - H(Y|X:t) define H(Y|X:t) = H(Y|X<t) P(X<t) + H(Y|X>=t) P(X>=t) define IG*(Y|X) = max{~}t~ IG(Y|X:t)

How to grow a Random Forest

source : [1](1.html)

Each tree is grown as follows:

  1. if the number of cases in the training set is N , sample N cases at random -but with replacement, from the original data. This sample will be the training set for the growing tree.
  2. if there are M input variables, a number m << M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
  3. each tree is grown to its large extent possible. There is no pruning.

Random Forest parameters

source : [2](2.html) Random Forests are easy to use, the only 2 parameters a user of the technique has to determine are the number of trees to be used and the number of variables (m ) to be randomly selected from the available set of variables. Breinman's recommendations are to pick a large number of trees, as well as the square root of the number of variables for m.

How to predict the label of a case

Classify(node ,V ) Input: node from the decision tree, if node.attribute = j then the split is done on the j'th attribute

Input: V a vector of M columns where V{ }{}{~}j{~} = the value of the j 'th attribute. Output: label of V

If node is a Leaf then Return the value predicted by node

Else Let j = node.attribute If j is categorical then Let v = V{ }{}{~}j{~} Let child{ }{}{^}v{^} = child node corresponding to the attribute's value v Return Classify(child{ }{}{^}v{^} ,V)

Else j is real-valued Let t = node.threshold (split threshold) If Vj < t then Let child{ }{}{^}LO{^} = child node corresponding to (<t ) Return Classify(child{ }{}{^}LO{^} ,V ) Else Let child{ }{}{^}HI{^} = child node corresponding to (>=t ) Return Classify(child{ }{}{^}HI{^} ,V)

The out of bag (oob) error estimation

source : [1](1.html)

in random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows:

  • each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases left of the bootstrap sample and not used in the construction of the kth tree.
  • put each case left out in the construction of the kth tree down the kth{ }tree to get a classification. In this way, a test set classification is obtained for each case in about one-thrid of the trees. At the end of the run, take j to be the class that got most of the the votes every time case n was oob . The proportion of times that j is not equal to the true class of n averaged over all cases is the oob error estimate. This has proven to be unbiased in many tests.

Other RF uses

source : [1](1.html)

  • variable importance
  • gini importance
  • proximities
  • scaling
  • prototypes
  • missing values replacement for the training set
  • missing values replacement for the test set
  • detecting mislabeled cases
  • detecting outliers
  • detecting novelties
  • unsupervised learning
  • balancing prediction error Please refer to [1](1.html) for a detailed description


[1](1.html) Random Forests - Classification Description Random forests - classification description [2](2.html) B. Larivi�re & D. Van Den Poel, 2004. "Predicting Customer Retention and Profitability by Using Random Forests and Regression Forests Techniques," Working Papers of Faculty of Economics and Business Administration, Ghent University, Belgium 04/282, Ghent University, Faculty of Economics and Business Administration. Available online : Predicting Customer Retention and Profitability by Using Random Forests and Regression Forests Techniques [3](3.html) Decision Trees - Andrew W. Moore[4] http://www.cs.cmu.edu/\~awm/tutorials\[1\](1.html) [4](4.html) Information Gain - Andrew W. Moore http://www.cs.cmu.edu/~awm/tutorials

输入:X 是一个由 R 行和M 列组成的矩阵,其中X{ }{ }{~}ij{~} = 第 i个输入数据点中第 j 个属性的值。每列由所有实数值或所有分类值组成。输入:Y 是一个由 R 个元素组成的向量,其中Y{ }{ }{~}i{~} = 第 i 个数据点的输出类。Y { }{ *}{~}i{~}*值为分类值。输出:未剪枝的决策树

如果X中的所有记录在其所有属性中具有相同的值(包括 R <2 的情况),则返回一个预测多数输出的叶节点,随机打破平局。这种情况还包括如果Y 中的所有值都相同,则返回一个叶节点,预测该值作为输出,否则从M 个变量中随机选择 m 个 变量对于j = 1 .. m ,如果第 j 个属性是分类属性 IG{ }{ }{~}j{~} = IG( Y | X{ }{ }{~}j{~} )(参见信息增益)否则(第 j 个属性是实值) IG{ }{ }{~}j{~} = IG ( Y | X{ }{ }{~}j{~} )(参见信息增益)让 *j* = argmax{~}j~ IG{ }{ }{~}j{~} (这是我们将使用的分裂属性)如果j*是分类属性,则对于第 j 个属性的每个值v ,让 X{ }{ }{^}v{^} = X 的行子集,其中X{ }{ }{~}ij{~} = v 。令Y{ }{ }{^}v{^} = Y 的对应子 集令Child{ }{ }{^}v{^} = LearnUnprunedTree( X{ }{ }{^}v{^} , Y{ }{ }{^}v{^} ) 返回一个决策树节点,在第 j 个属性上进行拆分。子节点的数量等于第j 个属性的值的数量,第v 个子节点为 Child{ }{ }{^}v{^} 否则j* 为实值,令t 为最佳拆分阈值令X{ }{ }{^}LO{^} = X 的行子集,其中X{ }{ }{~}ij{~} <= t 。令Y{ }{ }{^}LO{^} = Y 的对应子 集令Child{ }{ }{^}LO{^} = LearnUnprunedTree( X{ }{ }{^}LO{^} , Y{ }{ }{^}LO{^} ) 令X{ }{ }{^}HI{^} = X 的行子集 ,其中X{ }{ }{~}ij{~} > t 。令Y{ }{ } {^}HI{^} = Y 的对应子集 令Child{ }{ }{^}HI{^} = LearnUnprunedTree( X{ }{ }{^}HI{^} , Y{ }{ }{^}HI{^} ) 返回决策树节点,根据 第 j 个属性进行拆分。它有两个子节点,分别对应第j个属性是高于还是低于给定阈值。




  1. h4. 名义属性

假设 X 可以具有 m 个值之一 V{~}1~,V{~}2~,...,V{~}m~ P(X=V{~}1~)=p{~}1~, P(X=V{~}2~)=p{~}2~,...,P(X=V{~}m~)=p{~}m~ H(X)= -sum{~}j=1{~}{^}m^ p{~}j~ log{~}2~ p{~}j~ (X 的熵) H(Y|X=v) = 仅在 X 具有值 v 的记录中 Y 的熵 H(Y|X) = sum{~}j~ p{~}j~ H(Y|X=v{~}j~) IG(Y|X) = H(Y) - H(Y|X)

  1. h4. 实值属性

假设 X 是实值,定义 IG(Y|X:t) 为 H(Y) - H(Y|X:t) 定义 H(Y|X:t) = H(Y|X<t) P(X<t) + H(Y|X>=t) P(X>=t) 定义 IG*(Y|X) = max{~}t~ IG(Y|X:t)




  1. 如果训练集中的案例数为N ,则从原始数据中随机抽取N个案例(但有替换)。该样本将成为生长树的训练集。
  2. 如果有M 个 输入变量,则指定一个数字m << M ,这样在每个节点上,从M 中随机选择m 个变量,并使用这 m 个 变量中的最佳分割来分割节点。在森林生长过程中, m的值保持不变。
  3. 每棵树都尽可能地长大。没有修剪。


来源:[2](2.html) 随机森林易于使用,该技术的用户只需确定 2 个参数:要使用的树的数量和从可用变量集中随机选择的变量数量 ( m )。 Breinman 的建议是选择大量树,以及变量数量的平方根( m ) 。


分类(节点V )输入:决策树中的节点,如果 node.attribute = j ,则对第 j个属性进行拆分

输入:V 是一个M 列向量,其中 V{ }{ }{~}j{~} = 第 j 个属性的值。输出:V的标签

如果节点是叶子,则返回 节点预测的值

否则,让j = node.attribute 如果j 是分类的,则让v = V{ }{ }{~}j{~}child{ }{ }{^}v{^} = 对应于属性值v 的子节点 返回 Classify( child{ }{ }{^}v{^} , V )

否则j 为实值 设t = node.threshold (分割阈值) 如果 Vj < t 则 设child{ }{ }{^}LO{^} = 对应于(<t )的子节点 返回 Classify( child{ }{ }{^}LO{^} , V ) 否则 设child{ }{ }{^}HI{^} = 对应于(>=t )的子节点 返回 Classify( child{ }{ }{^}HI{^} , V )




  • 每棵树都是使用来自原始数据的不同引导样本构建的。大约三分之一的案例是引导样本中剩下的,没有用于构建第k棵树。
  • 将构建第 k 棵树时遗漏的每个案例放到 第 k{ }树中,得到一个分类。这样,大约三分之一的树中的每个案例都会得到一个测试集分类。在运行结束时,取j作为每次案例 noob 时获得大多数投票的类。所有案例中j 不等于n 的真实类别的次数的平均值就是oob 错误估计。这在许多测试中已被证明是无偏的。



  • 变量重要性
  • 基尼重要性
  • 邻近
  • 扩展
  • 原型
  • 训练集缺失值替换
  • 测试集缺失值替换
  • 检测错误标记的案例
  • 检测异常值
  • 发现新奇事物
  • 无监督学习
  • 平衡预测误差的详细说明请参考[1](1.html)


[1](1.html) 随机森林 - 分类描述 Random forests - classification description [2](2.html) B. Larivi�re & D. Van Den Poel,2004 年。"使用随机森林和回归森林技术预测客户保留率和盈利能力,"比利时根特大学经济与工商管理学院工作论文 04/282,根特大学,经济与工商管理学院。在线获取:Predicting Customer Retention and Profitability by Using Random Forests and Regression Forests Techniques [3](3.html) 决策树 - Andrew W. Moore[4] http://www.cs.cmu.edu/\~awm/tutorials\[1\](1.html) [4](4.html) 信息增益 - Andrew W. Moore http://www.cs.cmu.edu/~awm/tutorials

