Decision trees

These notes cover the lectures provided in the following Youtube links:

Decision Trees Part 1

Part 1 primarily covers the purpose of decision trees, their structure, how they’re used, how they can be implemented in a NumPy 2D array, and some thoughts are provided on their performance. In summary, decision trees:

Have factors with numerical or boolean values
Each node in the tree is a factor coupled with the split value
Split values decide whether to take the left or right branch
Leaf nodes are nodes within decisions trees that have no child nodes
Leaf nodes are considered the Y value or predictive outcome of a machine learning model built with a decision tree
Decision trees have a performance of log(n) (base 2), with n being the number of nodes (decisions) in the tree based upon factors provided

Decision Trees Part 2

To build a decision tree from a 2D NumPy array, we can leverage the following pseudocode:

def build_tree(data):
    if data.shape[0] == 1:
        return [leaf, data.y, Null, Null]
    if all data.y same:
        return [leaf, data.y, Null, Null]
 
    else:
        i = best_feature_to_split_on()
        split_val = data[:, i].median()
        left_tree = build_tree(data[data[:, i] <= split_val])
        right_tree = build_tree(data[data[:, i] > split_val])
        root = [i, split_val, 1, left_tree.shape[0] + 1]
        return (append(root, left_tree, right_tree))

How do we determine the best feature to split on?

During this class, and particularly for linear regression algorithms, we’ll be using correlation as our metric for best feature to split on. When determining the best feature, the vernacular used is information gain. Some example information gain metrics used for the creation of decision tree are:

Entropy
Correlation
Gini Index

Building trees faster

Given the algorithm to build a decision tree (above), the slowest portions of this algorithm are best_feature_to_split_on() and taking the median of data[:, i] to find the split_val. We can speed up the process of building our decision tree by leveraging random trees, an algorithm to do so is below:

def build_tree(data):
    if data.shape[0] == 1:
        return [leaf, data.y, Null, Null]
    if all data.y same:
        return [leaf, data.y, Null, Null]
 
    else:
        i = random_feature_to_split_on()
        split_val = (data[random, i] + data[random, i]) / 2
        left_tree = build_tree(data[data[:, i] <= split_val])
        right_tree = build_tree(data[data[:, i] > split_val])
        root = [i, split_val, 1, left_tree.shape[0] + 1]
        return (append(root, left_tree, right_tree))

In the above algorithm, the major changes from the previous decision tree algorithm is that we select a random feature to split on for this particular node and then we take the average of two values from two random rows in the data.

Doesn’t this randomness effect the predictive performance of my tree?

Yes, however, this can be mitigated by bagging or creating an ensemble of models - in this case random decision trees. This is called a random forest.

Strengths and weaknesses of decision trees

Strengths

Cost of query - faster to query than KNN but slower than linear regression
Don’t have to normalize your data

Weaknesses

Cost of learning - more expensive than creating a KNN or linear regression learner

one2bla.me

Explorer