The classifiers in machine learning packages like liblinear and nltk offer a method show_most_informative_features()
, which is really helpful for debugging features:
viagra = None ok : spam = 4.5 : 1.0
hello = True ok : spam = 4.5 : 1.0
hello = None spam : ok = 3.3 : 1.0
viagra = True spam : ok = 3.3 : 1.0
casino = True spam : ok = 2.0 : 1.0
casino = None ok : spam = 1.5 : 1.0
My question is if something similar is implemented for the classifiers in scikit-learn. I searched the documentation, but couldn't find anything the like.
If there is no such function yet, does somebody know a workaround how to get to those values?
Best Answer
The classifiers themselves do not record feature names, they just see numeric arrays. However, if you extracted your features using a
Vectorizer
/CountVectorizer
/TfidfVectorizer
/DictVectorizer
, and you are using a linear model (e.g.LinearSVC
or Naive Bayes) then you can apply the same trick that the document classification example uses. Example (untested, may contain a bug or two):This is for multiclass classification; for the binary case, I think you should use
clf.coef_[0]
only. You may have to sort theclass_labels
.