> it seems to gloss over what differences the trees within the random forest hav... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		eden_h on April 11, 2019 \| parent \| context \| favorite \| on: Random Forests for Complete Beginners > it seems to gloss over what differences the trees within the random forest have - as I understand it, they are all slightly different, and this gives them greater accuracy? They kinda cover it in the section in 3.1 and 3.2 with Bagging and Bagging -> RandomForest, but it'd be good for them to explain Boosted Trees here as well. As far as I understand it, random forests are an aggregation of trained trees based on randomly sampled data points from the original data set. It doesn't necessarily make them more accurate on the training dataset, but it makes them more generalised and less likely to overfit (https://en.wikipedia.org/wiki/Overfitting), because the different trees are likely to focus on different characteristics of the dataset. Boosted trees do become more accurate, as they resample, but give more priority to data points that weren't correctly classified by the earlier models.

psandersen on April 11, 2019 [–]

Just to add here that the columns and not just the rows are also sampled at each node so the trees are deliberately prevented from learning too much. This helps improve diversity and reduce overfitting. This is typically around 1/3rd of the number of columns and controlled by the mtry parameter.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact