Documenting Your Schemas

Last week I spent some time reading about using message queues like RabbitMQ and Apache Kafka as stream data systems. In the process I found an awesome article on the reasons why LinkedIn built and open sourced Kafka. It has some excellent learnings that can be applied to any event source stream, and I encourage anyone in the space to have a read of the key recommendations that LinkedIn makes to other companies adopting Kafka.

The biggest one of these, in my opinion, is the value of a good data schema, and defined fields. I understand that as development of a feature or domain object progresses, the need to rapidly iterate may mean that time is not there to lay out the architecture of the data format. And agile really does stress working software over documentation. But is an anxious fingers-crossed receiving team hoping that they picked the right field to parse on really better than a bit of extra time budgeted to document a data schema of your event stream object?

The key paragraph that made me snap out of my schema-less way of thinking was this:

“…invariably you end up with a sort of informal plain english “schema” passed around between users of the data via wiki or over email which is then promptly lost or obsoleted by changes…We found this lack of documentation lead to people guessing as to the meaning of fields, which inevitably leads to bugs and incorrect data analysis when these guesses are wrong.”

I can relate to this after working on multiple projects which did not clearly define a data form of some sort. MongoDB collections full of objects and fields that make no sense in the context or are difficult to recreate. Models between teams that don’t quite match up. My naive suggestion after experiences with RabbitMQ and noSQL databases is to enforce some sort of documentation on your data schemas if they are going to hit any sort of broadcast stream that others may want to receive. LinkedIn went one step further:

“When someone wanted to create a new data stream, or evolve the schema for an existing one, the schema for that stream would undergo a quick review by a group of people who cared about data quality.”

I acknowledge that this might not be an appropriate approach for some small companies that already have enough work as it is. But I’m assuming (and I’m pretty sure Kreps did as well) that if you’re large enough to invest in a federated data stream for your company, you’re large enough to devote some budget toward this task.

Post tags