Can anyone point me to a reference or provide a high level overview of how companies like Facebook, Yahoo, Google, etc al perform the large scale (e.g. multi-TB range) log analysis that they do for operations and especially web analytics?
Focusing on web analytics in particular, I'm interested in two closely-related aspects: query performance and data storage.
I know that the general approach is to use map reduce to distribute each query over a cluster (e.g. using Hadoop). However, what's the most efficient storage format to use? This is log data, so we can assume each event has a time stamp, and that in general the data is structured and not sparse. Most web analytics queries involve analyzing slices of data between two arbitrary timestamps and retrieving aggregate statistics or anomalies in that data.
Would a column-oriented DB like Big Table (or HBase) be an efficient way to store, and more importantly, query such data? Does the fact that you're selecting a subset of rows (based on timestamp) work against the basic premise of this type of storage? Would it be better to store it as unstructured data, eg. a reverse index?
Unfortunately there is no one size fits all answer.
I am currently using Cascading, Hadoop, S3, and Aster Data to process 100's Gigs a day through a staged pipeline inside of AWS.
Aster Data is used for the queries and reporting since it provides a SQL interface to the massive data sets cleaned and parsed by Cascading processes on Hadoop. Using the Cascading JDBC interfaces, loading Aster Data is quite a trivial process.
Keep in mind tools like HBase and Hypertable are Key/Value stores, so don't do ad-hoc queries and joins without the help of a MapReduce/Cascading app to perform the joins out of band, which is a very useful pattern.
in full disclosure, I am a developer on the Cascading project.