Python – How to assess Random Forests classifier performance

machine-learningpythonrandom-forestscikit-learn

I recently started using a random forest implementation in Python using the scikit learn sklearn.ensemble.RandomForestClassifier. There is a sample script that I found on Kaggle to classify landcover using Random Forests (see below) that I am trying to use to hone my skills. I am interested in assessing the results of the random forests classification.

For example, if I were to perform the analysis using randomForest in R, I would assess the variable importance with varImpPlot() from the randomForest package:

require(randomForests)
...
myrf = randomForests(predictors, response)
varImpPlot(myrf)

And to get an idea of the out-of-box estimate of error rate and the error matrix for the classification, I would simply type 'myrf' into the interpreter.

How can I programmatically assess these error metrics using Python?

Note, that I am aware there are several potentially useful attributes in the documentation (e.g. feature_importances_, oob_score_, and oob_decision_function_), although I am not sure how to actually apply these.


Sample RF Script

import pandas as pd
from sklearn import ensemble

if __name__ == "__main__":
  loc_train = "kaggle_forest\\train.csv"
  loc_test = "kaggle_forest\\test.csv"
  loc_submission = "kaggle_forest\\kaggle.forest.submission.csv"

  df_train = pd.read_csv(loc_train)
  df_test = pd.read_csv(loc_test)

  feature_cols = [col for col in df_train.columns if col not in ['Cover_Type','Id']]

  X_train = df_train[feature_cols]
  X_test = df_test[feature_cols]
  y = df_train['Cover_Type']
  test_ids = df_test['Id']

  clf = ensemble.RandomForestClassifier(n_estimators = 500, n_jobs = -1)

  clf.fit(X_train, y)

  with open(loc_submission, "wb") as outfile:
    outfile.write("Id,Cover_Type\n")
    for e, val in enumerate(list(clf.predict(X_test))):
      outfile.write("%s,%s\n"%(test_ids[e],val))

Best Solution

After training, if you have test data and labels, you could check accuracy and generate an ROC plot/ AUC score via:

from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# overall accuracy
acc = clf.score(X_test,Y_test)

# get roc/auc info
Y_score = clf.predict_proba(X_test)[:,1]
fpr = dict()
tpr = dict()
fpr, tpr, _ = roc_curve(Y_test, Y_score)

roc_auc = dict()
roc_auc = auc(fpr, tpr)

# make the plot
plt.figure(figsize=(10,10))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([-0.05, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.grid(True)
plt.plot(fpr, tpr, label='AUC = {0}'.format(roc_auc))        
plt.legend(loc="lower right", shadow=True, fancybox =True) 
plt.show()