I recently started using a random forest implementation in Python using the scikit learn sklearn.ensemble.RandomForestClassifier. There is a sample script that I found on Kaggle to classify landcover using Random Forests (see below) that I am trying to use to hone my skills. I am interested in assessing the results of the random forests classification.
For example, if I were to perform the analysis using
randomForest in R, I would assess the variable importance with
varImpPlot() from the
require(randomForests) ... myrf = randomForests(predictors, response) varImpPlot(myrf)
And to get an idea of the out-of-box estimate of error rate and the error matrix for the classification, I would simply type 'myrf' into the interpreter.
How can I programmatically assess these error metrics using Python?
Note, that I am aware there are several potentially useful attributes in the documentation (e.g.
oob_decision_function_), although I am not sure how to actually apply these.
Sample RF Script
import pandas as pd from sklearn import ensemble if __name__ == "__main__": loc_train = "kaggle_forest\\train.csv" loc_test = "kaggle_forest\\test.csv" loc_submission = "kaggle_forest\\kaggle.forest.submission.csv" df_train = pd.read_csv(loc_train) df_test = pd.read_csv(loc_test) feature_cols = [col for col in df_train.columns if col not in ['Cover_Type','Id']] X_train = df_train[feature_cols] X_test = df_test[feature_cols] y = df_train['Cover_Type'] test_ids = df_test['Id'] clf = ensemble.RandomForestClassifier(n_estimators = 500, n_jobs = -1) clf.fit(X_train, y) with open(loc_submission, "wb") as outfile: outfile.write("Id,Cover_Type\n") for e, val in enumerate(list(clf.predict(X_test))): outfile.write("%s,%s\n"%(test_ids[e],val))