Big Data Architecture Best Practices
Synchronous vs Async pipelines
Synchronous big data pipelines are a series of data processing components that get triggered when a user invokes an action on a screen. e.g. clicking a button. The user typically waits till a response receives to intimate the user of the results. In contrast in asynchronous implementation, the user initiates the execution of the pipeline and then goes on their merry way till the pipeline intimates the user of the completion of the task.
Asynchronous pipelines are best practice because they design to fulfill the average load of the system (vs. the peak load for synchronous). So the synchronous design aims to maximize asset utilization and costs.
Download your Free Data Warehouse Project Plan Here
Buffering queues
Wherever possible decouple the producers of data and its consumers. Typically this is done through queues that buffer data for some time. This decoupling enables the producers and consumers to work at their own pace and also allows filtering of the data so consumers can select only the data they want
Stateless wherever possible
Design stateless wherever possible. This enables horizontal scalability.
Time to Live
It’s important to consider how long the data in question is valid and exclude the processing of data that is no longer valid. One example of this is data retention settings in Kafka.
Process and deliver what the customer needs
One of the key design elements on the macro and micro level is processing only data that is consumed (and when it is being consumed). An interesting example of this I saw recently was a stock ticker feed that was fed into Kafka. Subscribers typically monitored only a few companies’ feeds. The overall stock tickers were fed into various topics (companies) and consumers only consumed the companies that they were interested in. Any processing on that data defers to when the user pulled it. Removing the overall load of innumerable other companies.
On a micro-level this is also how Apache spark works where actions on an RDD are deferred till a command to execute is given and processing is optimized at that time.