Decision Trees

explains about the similarities and how to measure. which is the best split? based on SSE and GINI (good info about gini here).

  • For classification the Gini cost function is used which provides an indication of how “pure” the leaf nodes are (how mixed the training data assigned to each node is).

Gini = sum(pk * (1 – pk))

  • Early stop - 1 sample per node is overfitting, 5-10 are good

  • Pruning - evaluate what happens if the lead nodes are removed, if there is a big drop, we need it.



Using an ensemble of trees to create a high dimensional and sparse representation of the data and classifying using a linear classifier

How do deal with imbalanced data in Random-forest -

  1. One is based on cost sensitive learning.

  2. Other is based on a sampling technique


Last updated