Hadoop – Is this a suitable (or possible) use of HBase


I want to use HBase as a store where I can push in a few million entries of the format {document => {term => weight}} e.g. "Insert term X into document Y with weight Z" and then issue a command like "Select the top 1000 terms for this document" or "Select the top 1000 term for each document". This works in my current MySQL implementation but perhaps the domain is more suited to HBase. I note that HBase and BigTable are used for full-text indexing, which is a similar problem domain.

You can tell that I've not done more than read a few pages on HBase, but I hope you understand the gist of my question. It's related to this question.

Possible barriers might include HBase not allowing queries with the equivalent of a LIMIT clause. Given that I want to query by weight, I would want to associate {weight => term}, which would be problematic for two terms with the same weight (I assume that HBase only allows unique keys). Alternatively I would have to store a collection of terms for a given weight but this would limit my ability to accurately limit the number of returned terms.

Best Solution

Simple answer yes.

More complex answer, right now today these "no sql" datastore's each implement their own programmers interface and as the "no sql" implies they are not SQL based. So be prepared for some coding, none of its difficult though. Mostly these datastores are just name value pair stores, got at via REST or SOAP (HBase also has a concept of Column Families). What they do lend themselves toward though is Map Reduce, a very interesting field of query and well worth reading up on.

Related Question