When it comes to the Big Data industry, technological updates are part of the package. At TickSmith, we pride ourselves in using the latest technologies to improve our client experience, while fully optimizing our GOLD Platform Modules. Recently, we upgraded our Analytics Solution to move from batch ingestion of data to reading streaming data in near real-time.
We rose to the challenge and implemented an open source technology called Kafka on one of our client’s platform. Kafka has been in production for years, but in the banking and finance industry, it’s relatively new. We’re excited to be exploring new frontiers of technology and innovation, but let’s go back to why we needed this change in the first place.
Prior to Kafka, our developers were using MapReduce in batch ingestion mode for the Analytics Solution. The batch would run at the end of the day, so the data ingestion and processing would be running through the night. At the time, the batch ingestion process was the latest technology, but data was only available on a d+1 basis.
This process worked on the platform initially as it was scalable, which means if you need more processing power, you can add more machines. But MapReduce isn’t effectively scalable, because the computing power and storage capacity is bound together. To reach the level of cost efficiency we owe our clients, the two capacities must be separate. With Kafka, we can turn the machines off when we don’t need them.
Before, the difficulty with scaling using MapReduce is that once a new node is added to a cluster, the data arrives in that node. Before turning it off, a developer would have to make sure all the data is stored locally before the machine can be removed. MapReduce runs on a traditional hadoop cluster with several worker nodes supporting hdfs storage. As a result, the process accrued server costs 24/7.
To make the switch to Kafka, we had to redesign and adapt our Analytics Solution, as MapReduce is a very different beast than Kafka. Awesome results came along with our new poetic code. With Kafka, the platform now starts the perfect number of servers for each stage and can dispose of them as soon as the job is done. Our clients are able to leverage the possibilities of a cloud provider, such as AWS, and rent machines on the fly. Input data and the generated data artefacts are stored in AWS S3 buckets, effectively decoupling compute and storage.
For all the non-techies out there, let’s illustrate the improvement from MapReduce to Kafka. Before with MapReduce: It would be as if, at the end of every market day, someone brings you a huge book describing everything that happened on the market. It’s convenient because you just have to wait for that person to come to your door, deliver the book and leave. Moreover, because the book is printed, you can read any page in any order you want.
After with Kafka: Your door is left wide open and there’s someone outside shouting out what’s happening on the market. It’s even more convenient because you get the information in near real-time, but you have to keep listening and as he’s speaking, you have to type out every single detail shouted out.
That’s the magic of Kafka! With Kafka you can process information intra-day directly out of a stream, instead of at the end of the day. Our Analytics Solution now has the capability to do both batch ingestion and streaming, and switch back and forth. But it seems like Kafka is the future proof way to go. The Results? Our client loved what the improved platform could do. They could see that this capability was more efficient and would save them money.
In the Big Data industry, we have a lot of decisions to make when it comes to future proofing and enhancing our technology. So how do we make the right ones? It’s through researching the benefits and pitfalls, and jumping in with a strong and dedicated work ethic. Our knowledge and confidence comes from several years of experience. Technology is ever advancing and evolving, and our GOLD Platform must evolve too, or we risk becoming extinct.