Mysql – Is HBase meaningful if it’s not running in a distributed environment

bigtablehadoophbaseMySQL

I'm building an index of data, which will entail storing lots of triplets in the form (document, term, weight). I will be storing up to a few million such rows. Currently I'm doing this in MySQL as a simple table. I'm storing the document and term identifiers as string values than foreign keys to other tables. I'm re-writing the software and looking for better ways of storing the data.

Looking at the way HBase works, this seems to fit the schema rather well. Instead of storing lots of triplets, I could map document to {term => weight}.

I'm doing this on a single node, so I don't care about distributed nodes etc. Should I just stick with MySQL because it works, or would it be wise to try HBase? I see that Lucene uses it for full-text indexing (which is analogous to what I'm doing). My question is really how would a single HBase node compare with a single MySQL node? I'm coming from Scala, so might a direct Java API have an edge over JDBC and MySQL parsing etc each query?

My primary concern is insertion speed, as that has been the bottleneck previously. After processing, I will probably end up putting the data back into MySQL for live-querying because I need to do some calculations which are better done within MySQL.

I will try prototyping both, but I'm sure the community can give me some valuable insight into this.

Best Answer

Use the right tool for the job.

There are a lot of anti-RDBMSs or BASE systems (Basically Available, Soft State, Eventually consistent), as opposed to ACID (Atomicity, Consistency, Isolation, Durability) to choose from here and here.

I've used traditional RDBMSs and though you can store CLOBs/BLOBs, they do not have built-in indexes customized specifically for searching these objects.

You want to do most of the work (calculating the weighted frequency for each tuple found) when inserting a document.

You might also want to do some work scoring the usefulness of each (documentId,searchWord) pair after each search.

That way you can give better and better searches each time.

You also want to store a score or weight for each search and weighted scores for similarity to other searches.

It's likely that some searches are more common than others and that the users are not phrasing their search query correctly though they mean to do a common search.

Inserting a document should also cause some change to the search weight indexes.

The more I think about it, the more complex the solution becomes. You have to start with a good design first. The more factors your design anticipates, the better the outcome.