一个大小为N的数据集D. 对数据集随机有放回抽样N次作为一棵CART树的训练集.
根据概率论,可知数据集中有大约1/3的数据是没有被选取的(称为Out of bag),所以就是这没被选取的部分作为小树的测试集.
数据集D中的每一个样本都可以拿来做测试数据, 对于一个样本d, 森林中大约有1/e树是OOB的, 那么这1/e的树就构成了预测样本d的森林,用简单投票法计算分类结果. 从而得到总的error.
(Predict probability for each possible outcome.
Compute the probability estimates for each single sample in X and each possible outcome seen during training (categorical distribution).Put each case left out in the construction of the kth tree down the kth tree to get a classification. In this way, a test set classification is obtained for each case in about one-third of the trees. At the end of the run, take j to be the class that got most of the votes every time case n was oob. The proportion of times that j is not equal to the true class of n averaged over all cases is the oob error estimate. This has proven to be unbiased in many tests)