Schema-on-read or schema-on-write?

I was recently asked if schema-on-read is superior to schema-on-write and how it relates to “traditional” storage systems like EMC, Netapp, Teradata seemingly loosing the ground to commodity-based storage systems. Here are some semi-random thoughts

First of all, I think schema-on-read/schema-on-write is a fancy way to say if the data was stored in non-structured or structured way. It all boils down to where there’s a need to store unaltered data or not. If statistics teaches us anything at all, it would be that by  creatively selecting a subset of data, or making changes in a data sample or in a model itself you can prove the correlation between anything imaginable 😉 Hence, there are clear benefits of keeping data ‘as-is’: without any pre-processing, cleaning, dedup’ing, and so on. It will allow you to run different models or apply alternative approximations.

It might appear that schema-on-read approach might be always superior, but there is plenty of cases where it isn’t so. All sorts of scientific, engineering, financial, medical, accounting systems would still enjoy the benefits of data structuring for years to come. And of course, there are good cases for the opposite, non-structured storage way: marketing, social studies, economical modeling (which is an utter nonsense, of course, but people still believe in it for some reason), and so on.

I don’t think that “schema-on-write” technology is inherently so much more expensive. In the overall order of things one might safe some on the pre-processing stage by using commodity hardware and open software, but will have to pay more in direct and indirect costs related to more expensive and slower BA & BI solutions,

For all I know, we might be witnessing an end game for EMC/Netapp & co., but not because of the way they pre-process the data before storing it. Their very challenge is in the huge change of software development landscape, that has happened over the last 20+ years with Gnu, Linux, ASF and other free and open software models. No doubt, these companies have well-developed sales channels and established brands, but it is almost impossible to out-sale something that anyone can download from the net for the cost of the bandwidth, and get up and running in a matter of hours or even faster. And there’s a whole spectrum of such open systems, so you don’t have to lock yourself up to either of them. Now, go and compete with that!