随机森林算法

How to grow a Decision Tree

source : [3](3.html)

LearnUnprunedTree(X ,Y)

Input: X a matrix of R rows and M columns where X{ }{}{~}ij{~} = the value of the j 'th attribute in the i 'th input datapoint. Each column consists of either all real values or all categorical values. Input: Y a vector of R elements, where Y{ }{}{~}i{~} = the output class of the i 'th datapoint. The Y{ }{}{~}i{~} values are categorical. Output: An Unpruned decision tree

If all records in X have identical values in all their attributes (this includes the case where R<2 ), return a Leaf Node predicting the majority output, breaking ties randomly. This case also includes If all values in Y are the same, return a Leaf Node predicting this value as the output Else select m variables at random out of the M variables For j = 1 .. m If j 'th attribute is categorical IG{ }{}{~}j{~} = IG(Y |X{ }{}{~}j{~} ) (see Information Gain) Else (j 'th attribute is real-valued) IG{ }{}{~}j{~} = IG*(* Y*|* X{}{ }{~}j{~}) (see Information Gain) Let *j* = argmax{~}j~ IG{ }{}{~}j{~} (this is the splitting attribute we'll use) If j* is categorical then For each value v of the j 'th attribute Let X{ }{}{^}v{^} = subset of rows of X in which X{ }{}{~}ij{~} = v . Let Y{ }{}{^}v{^} = corresponding subset of Y Let Child{ }{}{^}v{^} = LearnUnprunedTree(X{ }{}{^}v{^} ,Y{ }{}{^}v{^} ) Return a decision tree node, splitting on j 'th attribute. The number of children equals the number of values of the j 'th attribute, and the v 'th child is Child{ }{}{^}v{^} Else j* is real-valued and let t be the best split threshold Let X{ }{}{^}LO{^} = subset of rows of X in which X{ }{}{~}ij{~} <= t . Let Y{ }{}{^}LO{^} = corresponding subset of Y Let Child{ }{}{^}LO{^} = LearnUnprunedTree(X{ }{}{^}LO{^} ,Y{ }{}{^}LO{^} ) Let X{ }{}{^}HI{^} = subset of rows of X in which X{ }{}{~}ij{~} > t . Let Y{ }{}{^}HI{^} = corresponding subset of Y Let Child{ }{}{^}HI{^} = LearnUnprunedTree(X{ }{}{^}HI{^} ,Y{ }{}{^}HI{^} ) Return a decision tree node, splitting on j 'th attribute. It has two children corresponding to whether the j'th attribute is above or below the given threshold.

Note: There are alternatives to Information Gain for splitting nodes

Information gain

source : [3](3.html)

  1. h4. nominal attributes

suppose X can have one of m values V{~}1~,V{~}2~,...,V{~}m~ P(X=V{~}1~)=p{~}1~, P(X=V{~}2~)=p{~}2~,...,P(X=V{~}m~)=p{~}m~ H(X)= -sum{~}j=1{~}{^}m^ p{~}j~ log{~}2~ p{~}j~ (The entropy of X) H(Y|X=v) = the entropy of Y among only those records in which X has value v H(Y|X) = sum{~}j~ p{~}j~ H(Y|X=v{~}j~) IG(Y|X) = H(Y) - H(Y|X)

  1. h4. real-valued attributes

suppose X is real valued define IG(Y|X:t) as H(Y) - H(Y|X:t) define H(Y|X:t) = H(Y|X<t) P(X<t) + H(Y|X>=t) P(X>=t) define IG*(Y|X) = max{~}t~ IG(Y|X:t)

How to grow a Random Forest

source : [1](1.html)

Each tree is grown as follows:

  1. if the number of cases in the training set is N , sample N cases at random -but with replacement, from the original data. This sample will be the training set for the growing tree.
  2. if there are M input variables, a number m << M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
  3. each tree is grown to its large extent possible. There is no pruning.

Random Forest parameters

source : [2](2.html) Random Forests are easy to use, the only 2 parameters a user of the technique has to determine are the number of trees to be used and the number of variables (m ) to be randomly selected from the available set of variables. Breinman's recommendations are to pick a large number of trees, as well as the square root of the number of variables for m.

How to predict the label of a case

Classify(node ,V ) Input: node from the decision tree, if node.attribute = j then the split is done on the j'th attribute

Input: V a vector of M columns where V{ }{}{~}j{~} = the value of the j 'th attribute. Output: label of V

If node is a Leaf then Return the value predicted by node

Else Let j = node.attribute If j is categorical then Let v = V{ }{}{~}j{~} Let child{ }{}{^}v{^} = child node corresponding to the attribute's value v Return Classify(child{ }{}{^}v{^} ,V)

Else j is real-valued Let t = node.threshold (split threshold) If Vj < t then Let child{ }{}{^}LO{^} = child node corresponding to (<t ) Return Classify(child{ }{}{^}LO{^} ,V ) Else Let child{ }{}{^}HI{^} = child node corresponding to (>=t ) Return Classify(child{ }{}{^}HI{^} ,V)

The out of bag (oob) error estimation

source : [1](1.html)

in random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows:

  • each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases left of the bootstrap sample and not used in the construction of the kth tree.
  • put each case left out in the construction of the kth tree down the kth{ }tree to get a classification. In this way, a test set classification is obtained for each case in about one-thrid of the trees. At the end of the run, take j to be the class that got most of the the votes every time case n was oob . The proportion of times that j is not equal to the true class of n averaged over all cases is the oob error estimate. This has proven to be unbiased in many tests.

Other RF uses

source : [1](1.html)

  • variable importance
  • gini importance
  • proximities
  • scaling
  • prototypes
  • missing values replacement for the training set
  • missing values replacement for the test set
  • detecting mislabeled cases
  • detecting outliers
  • detecting novelties
  • unsupervised learning
  • balancing prediction error Please refer to [1](1.html) for a detailed description

References

1\](1.html) Random Forests - Classification Description [Random forests - classification description](http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm "Random forests - classification description") \[2\](2.html) B. Larivi�re \& D. Van Den Poel, 2004. "Predicting Customer Retention and Profitability by Using Random Forests and Regression Forests Techniques," Working Papers of Faculty of Economics and Business Administration, Ghent University, Belgium 04/282, Ghent University, Faculty of Economics and Business Administration. Available online : [Predicting Customer Retention and Profitability by Using Random Forests and Regression Forests Techniques](http://ideas.repec.org/p/rug/rugwps/04-282.html "Predicting Customer Retention and Profitability by Using Random Forests and Regression Forests Techniques") \[3\](3.html) Decision Trees - Andrew W. Moore\[4\] http://www.cs.cmu.edu/\~awm/tutorials\[1\](1.html) \[4\](4.html) Information Gain - Andrew W. Moore [http://www.cs.cmu.edu/\~awm/tutorials](http://www.cs.cmu.edu/~awm/tutorials "http://www.cs.cmu.edu/~awm/tutorials") Copyright © 2014-2024 [The Apache Software Foundation](http://www.apache.org/ "The Apache Software Foundation"), Licensed under the Apache License, Version 2.0. #### 如何培育决策树 来源:\[3\](3.html) 学习未剪枝树(*X* ,*Y*) 输入:*X 是一个由* *R* 行和*M* 列组成的矩阵,其中*X{* }{ *}{\~}ij{\~} =* *第 i个输入数据点中第* *j* 个属性的值。每列由所有实数值或所有分类值组成。输入:*Y 是一个由* *R* 个元素组成的向量,其中*Y{* }{ *}{\~}i{\~} = 第* *i* 个数据点的输出类。Y *{* }{ *}{\~}i{\~}*值为分类值。输出:未剪枝的决策树 如果*X中的所有记录在其所有属性中具有相同的值(包括* *R \<2* 的情况),则返回一个预测多数输出的叶节点,随机打破平局。这种情况还包括如果*Y* 中的所有值都相同,则返回一个叶节点,预测该值作为输出,否则从*M 个变量中随机选择* *m 个* 变量对于*j* = 1 .. *m* ,如果*第 j* 个属性是分类属性 *IG{* }{ *}{\~}j{\~}* = IG( *Y* \| *X{* }{ *}{\~}j{\~}* )(参见信息增益)否则(*第 j* 个属性是实值) *IG{* }{ *}{\~}j{\~}* = IG *(* Y *\|* X{ *}{* }{\~}j{\~} *)(参见信息增益)让 \*j\** = argmax{\~}j\~ *IG{* }{ *}{\~}j{\~}* (这是我们将使用的分裂属性)如果*j\*是分类属性,则对于第* *j* 个属性的每个值*v* ,让 *X{* }{ *}{\^}v{\^} =* *X* 的行子集,其中*X{* }{ *}{\~}ij{\~}* = *v* 。令*Y{* }{ *}{\^}v{\^} =* *Y* 的对应子 集令*Child{* }{ *}{\^}v{\^}* = LearnUnprunedTree( *X{* }{ *}{\^}v{\^}* , *Y{* }{ *}{\^}v{\^}* ) 返回一个决策树节点,在*第 j* 个属性上进行拆分。子节点的数量等于第*j* 个属性的值的数量,第*v* 个子节点为 *Child{* }{ *}{\^}v{\^}* 否则*j\** 为实值,令*t* 为最佳拆分阈值令*X{* }{ *}{\^}LO{\^} =* *X* 的行子集,其中*X{* }{ *}{\~}ij{\~}* *\<= t* 。令*Y{* }{ *}{\^}LO{\^} =* *Y* 的对应子 集令*Child{* }{ *}{\^}LO{\^}* = LearnUnprunedTree( *X{* }{ *}{\^}LO{\^}* , *Y{* }{ *}{\^}LO{\^}* ) 令*X{* }{ *}{\^}HI{\^} =* *X* 的行子集 ,其中*X{* }{ *}{\~}ij{\~}* *\> t* 。令*Y{* }{ } *{\^}HI{\^} =* *Y* 的对应子集 令*Child{* }{ *}{\^}HI{\^}* = LearnUnprunedTree( *X{* }{ *}{\^}HI{\^}* , *Y{* }{ *}{\^}HI{\^}* ) 返回决策树节点,根据 *第 j* 个属性进行拆分。它有两个子节点,分别对应第*j*个属性是高于还是低于给定阈值。 *注意*:除了信息增益之外,还有其他方法可以分割节点 #### 信息增益 来源:\[3\](3.html) 1. h4. 名义属性 假设 X 可以具有 m 个值之一 V{\~}1\~,V{\~}2\~,...,V{\~}m\~ P(X=V{\~}1\~)=p{\~}1\~, P(X=V{\~}2\~)=p{\~}2\~,...,P(X=V{\~}m\~)=p{\~}m\~ H(X)= -sum{\~}j=1{\~}{\^}m\^ p{\~}j\~ log{\~}2\~ p{\~}j\~ (X 的熵) H(Y\|X=v) = 仅在 X 具有值 v 的记录中 Y 的熵 H(Y\|X) = sum{\~}j\~ p{\~}j\~ H(Y\|X=v{\~}j\~) IG(Y\|X) = H(Y) - H(Y\|X) 1. h4. 实值属性 假设 X 是实值,定义 IG(Y\|X:t) 为 H(Y) - H(Y\|X:t) 定义 H(Y\|X:t) = H(Y\|X\=t) P(X\>=t) 定义 IG\*(Y\|X) = max{\~}t\~ IG(Y\|X:t) #### 如何培育随机森林 来源:\[1\](1.html) 每棵树的生长方式如下: 1. 如果训练集中的案例数为*N* ,则从原始数据中随机抽取*N个案例(但有替换)。该样本将成为生长树的训练集。* 2. 如果有*M 个* 输入变量,则指定一个数字*m \<\< M* ,这样在每个节点上,从*M* 中随机选择*m 个变量,并使用这* *m 个* 变量中的最佳分割来分割节点。在森林生长过程中, *m*的值保持不变。 3. 每棵树都尽可能地长大。没有修剪。 #### 随机森林参数 来源:\[2\](2.html) 随机森林易于使用,该技术的用户只需确定 2 个参数:要使用的树的数量和从可用变量集中随机选择的变量数量 ( *m* )。 Breinman 的建议是选择大量树,以及变量数量的平方根( *m* ) 。 #### 如何预测案件的标签 分类(*节点* ,*V* )输入:决策树中的*节点,如果* *node.attribute = j ,则对第* *j*个属性进行拆分 输入:*V* 是一个*M* 列向量,其中 *V{* }{ *}{\~}j{\~} = 第* *j* 个属性的值。输出:*V的标签* 如果*节点是叶子,则返回* *节点*预测的值 否则,让*j = node.attribute* 如果*j* 是分类的,则让*v* = *V{* }{ *}{\~}j{\~}* 让*child{* }{ *}{\^}v{\^}* = 对应于属性值*v* 的子节点 返回 Classify( *child{* }{ *}{\^}v{\^}* , *V* ) 否则*j* 为实值 设*t = node.threshold* (分割阈值) 如果 Vj \< t 则 设*child{* }{ *}{\^}LO{\^}* = 对应于(*\=t* )的子节点 返回 Classify( *child{* }{ *}{\^}HI{\^}* , *V* ) #### 袋外(oob)误差估计 来源:\[1\](1.html) 在随机森林中,不需要交叉验证或单独的测试集来获得测试集误差的无偏估计。在运行过程中,它会在内部进行估计,如下所示: * 每棵树都是使用来自原始数据的不同引导样本构建的。大约三分之一的案例是引导样本中剩下的,没有用于构建第*k*棵树。 * 将构建*第 k* 棵树时遗漏的每个案例放到 *第 k{* }树中,得到一个分类。这样,大约三分之一的树中的每个案例都会得到一个测试集分类。在运行结束时,取*j作为每次案例* *n* 为*oob* 时获得大多数投票的类。所有案例中*j* 不等于*n* 的真实类别的次数的平均值就是*oob 错误估计*。这在许多测试中已被证明是无偏的。 #### 其他射频用途 来源:\[1\](1.html) * 变量重要性 * 基尼重要性 * 邻近 * 扩展 * 原型 * 训练集缺失值替换 * 测试集缺失值替换 * 检测错误标记的案例 * 检测异常值 * 发现新奇事物 * 无监督学习 * 平衡预测误差的详细说明请参考\[1\](1.html) #### 参考 \[1\](1.html) 随机森林 - 分类描述 [Random forests - classification description](http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm "Random forests - classification description") \[2\](2.html) B. Larivi�re \& D. Van Den Poel,2004 年。"使用随机森林和回归森林技术预测客户保留率和盈利能力,"比利时根特大学经济与工商管理学院工作论文 04/282,根特大学,经济与工商管理学院。在线获取:[Predicting Customer Retention and Profitability by Using Random Forests and Regression Forests Techniques](http://ideas.repec.org/p/rug/rugwps/04-282.html "Predicting Customer Retention and Profitability by Using Random Forests and Regression Forests Techniques") \[3\](3.html) 决策树 - Andrew W. Moore\[4\] http://www.cs.cmu.edu/\~awm/tutorials\[1\](1.html) \[4\](4.html) 信息增益 - Andrew W. Moore [http://www.cs.cmu.edu/\~awm/tutorials](http://www.cs.cmu.edu/~awm/tutorials "http://www.cs.cmu.edu/~awm/tutorials")

相关推荐
掘金一周18 分钟前
金石焕新程 >> 瓜分万元现金大奖征文活动即将回归 | 掘金一周 4.3
前端·人工智能·后端
白雪讲堂34 分钟前
AI搜索品牌曝光资料包(精准适配文心一言/Kimi/DeepSeek等场景)
大数据·人工智能·搜索引擎·ai·文心一言·deepseek
斯汤雷40 分钟前
Matlab绘图案例,设置图片大小,坐标轴比例为黄金比
数据库·人工智能·算法·matlab·信息可视化
ejinxian1 小时前
Spring AI Alibaba 快速开发生成式 Java AI 应用
java·人工智能·spring
葡萄成熟时_1 小时前
【第十三届“泰迪杯”数据挖掘挑战赛】【2025泰迪杯】【代码篇】A题解题全流程(持续更新)
人工智能·数据挖掘
机器之心1 小时前
一篇论文,看见百度广告推荐系统在大模型时代的革新
人工智能
机器之心1 小时前
视觉SSL终于追上了CLIP!Yann LeCun、谢赛宁等新作,逆转VQA任务固有认知
人工智能
赣州云智科技的技术铺子2 小时前
【一步步开发AI运动APP】六、运动计时计数能调用
人工智能·程序员