Random Forest
Random forest is an ensemble tool which takes a subset of observations and a subset of variation to build a decision tree.It builds multiple such decision tree and amalgamate them together to get a more accurate and stable prediction.This is a direct consequence of the fact that by the maximum voting from a panel of independent judges, we get the final decision better than the best judge.
We general think of random forest as a black box which takes in input and gives out prediction, without worrying too much what calculations are going on the back end.This black box itself have a few levers we can play with.Each of these levers have some effect on either the performance of the model or the resource - time balance.
Parameters/levers to tune Random Forest
Parameters in random forest are either to increase the predictive power of the model or make it easier to train the model.
- Feature which make predictions of the model better
- Auto/None: This will simply take all features which make sense in every tree.Here we simply do not put any restrictions on the individual tree.
- sqrt: This option will take square root of the total number of features in individual run.For instance, if the total number variables are 100, we can take 10 of them in individual tree."log2" is another similar type of option for max features.
- Features which will make the model trailing easier
These are primarily 3 features which can be tuned to improve the predictive model:
a.max_features :
These are the maximum number of features Random Forest is allowed to try in individual tree.There are multiple options available in Python to assign maximum features.Here are a few of them:
How does "max_features" impact performance and speed?
Increasing max_features generally improves the performance of the model as at each node now we have a higher number of options to be considered. However, this is not necessarily true as this decreases the diversity of individual tree. But, for sure, you decrease the speed of the algorithm by increasing the max_features. Hence, you need to strike the right balance and choose the optimal max_features.
b.n_estimators :
This is the number of trees you want to build before taking the maximum voting or average of predictions. Higher number of trees give you better performance but make you code slower. You should choose as high value as your processor can handle because this make your predictions stronger and more stable.
c.min_sample_leaf :
If you have built a decision tree before, you can appreciate the importance of the minimum sample leaf size. Leaf is the end node of a decision tree. A smaller leaf makes the model more prone to catching noise in train data. Generally, I would prefer a minimum leaf size of more than 50. However, you should try multiple leaf size to find the most optimum for your user case.
There are a few attributes which have a direct impact on model training speed.Following are the key parameters which help you tune for model speed:
a.n_jobs :
This parameters tell the engine how many processor.A value of "-1" means there is no restrictions whereas a value of "1" means it can only use one processor.Here is a simple experiment you can do with Python to check this metric:
%timeit
model = RandomForestRegressor(n_estimator = 100, oob_score = TRUE,n_jobs = 1,random_state = 1)
model.fit(X,y)
Output ---1 loop best of 3 : 1.7 sec per loop
%timeit
model = RandomForestRegressor(n_estimator = 100, oob_score = TRUE,n_jobs = -1,random_state = 1)
model.fit(X,y)
Output ---1 loop best of 3 : 1.1 sec per loop
"%timeit" is an awsum function which runs a function multiple times and gives the fastest loop run time.This comes out very handy while scaling up a particular function from prototype to final dataset.
b.random_state :
This parameter makes a solution easy to replicate. A definite value of random_state will always produce same results if given with same parameter and training data.
c.oob_score :
This is a random forest cross validation method.It is very similar to leave out validation technique, however, this is so much faster. This method simply tags every observation used in different trees. And then it finds out a maximum vote score for every observation based on only tree which did not use this particular observation to train itself.
Here is a single example of using these parameters in one single function :
model = RandomForestRegressor(n_estimator = 100, oob_score = TRUE, n_jobs = -1,random_state = 50,max_features = "auto", min_samples_leaf = 50)
model.fit(X,y)