Although this blog is about Kafka, I would like to explain “How” and “Why” did we come across Kafka.
Well, in today’s world the most important thing in terms of computing is DATA.
Data is the information translated into such a form that it can be moved and processed. With increasing processing power and smaller chipsets, the amount of data being generated has been ever increasing and to bring out meaning from all this large amount of data has been difficult for organizations.
As it is rightly said “Necessity is the mother of invention”, when people at LinkedIn faced the problem of managing the large amount of data their website was producing they came up with a powerful data streaming platform known as “KAFKA”.
At Zenlabs we have been building some amazing stuff, and one of it is the FACE RECOGNITION SYSTEM.
Persona Detector is a Facial Recognition System that enables Video Surveillance and detect unknown people in a premises. It is built using Deep learning library dlib and a Machine Learning model that uses the camera feed to capture face points and identify the person accurately.
And this system was built over the microservices architecture.
But this architecture had one disadvantage, WE WERE LOSING FRAMES. I’ll explain to you with an example:
Suppose my model takes 0.5 seconds to process an image and the FPS(Frames per Second) of my camera is 10. So, as it states my camera would produce 10 frames in a second but my model is capable of processing only two frames in a second.
Then, What about the remaining 8 frames??? Yeah, we were losing them.
Well, although the above numbers are hypothetical, and model was much faster and rate of loss of frames was less. But the main point was we were losing frames and this system having the potential to be a smart surveillance system shouldn’t lose any data.
So, we came across Kafka – a distributed streaming platform which helped us reach our goal of a smart surveillance system.
1)WHAT IS KAFKA?
Apache Kafka is a distributed streaming platform capable of handling trillions of events a day. Kafka was open-sourced in 2011 and was then acquired by Apache in 2012. It aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is a massive scalable publish subscribe message queue designs as a distributed transaction log. It is based on commit-log, and it allows users to subscribe to it and publish data to any number of systems or real-time applications. It has been written in Java and Scala.
Kafka consists of:
- Producers: The application which produces any kind of data. In this scenario, “Agent” is our producer.
- Consumers: The application which consumes the data produced by Producers.
- Zookeeper: Kafka zookeeper is the heart of Kafka architecture. It handles all the communication between producers, consumers and brokers. Kafka uses zookeeper to elect the Kafka broker and topic-partition pair.
- Brokers: A Kafka cluster consists of multiple Kafka brokers. Each Kafka broker has a unique id. They contain topic log partitions. A client can connect to a whole cluster by connecting to any one of the brokers. Depending upon the application/use-case a Kafka cluster can have as many as brokers, ranging from 1 to 1000+. The broker contains the information of topics and partition in a cluster.
- Topic: Kafka appends a record from the producers to the end of a unique topic name specified in the producer.
- Partition: To obtain high performance, Kafka distributes topic log partitions on different nodes in a cluster as it helps in writing data quickly.
2) How we solved our problem using Kafka??
We built a Kafka Cluster with 5 brokers. The number of brokers required is also calculated which I’ll explain later in the blog. With the Agent as Producer and 5 Consumer applications which fetched the frames from Kafka and performed analysis on them by performing a request on server using Rest API’s asynchronously. So, now the architecture looked something like this,
3) Code Snippets
We start the Zookeeper server and then the brokers on the machine which Kafka is setup.
First we create a kafka topic, assigning the partitions and repetitions for the topic in Kafka ecosystem.
We can check the topic details using command:
Output of above command shows the partitions assigned:
We used python as programming language, so we used pykafka library available in python to connect to Kafka Ecosystem.
The above code snippet shows the Producer class which connects to Kafka brokers and produces messages( frames) to Kafka broker, so that later consumer can fetch those messages. Details regarding arguments of get_producer() method can be read here. The data to be sent has to be sent as bytes only, no other format is supported.
Now, to consume messages we write a consumer class which connects to Kafka brokers and fetches data stores in Kafka Brokers.
The offset value is managed by Zookeeper so no two consumers receive the same frame.
Now, how this helped us?? Considering the same example, i mentioned above.
Suppose my model takes 0.5 seconds to process an image and the FPS(Frames per Second) of my camera is 10. So, as it states my camera would produce 10 frames in a second and all the 10 frames would be sent to kafka broker and at same time consumers would be fetching the model. Now as we see that in a second one consumer can process only two frames , but as we have 5 consumers, so all the 10 frames would be processed. Hence avoiding data loss. PROBLEM SOLVED!!
Also, the number of consumers is also calculated in similar way:
No . of Consumers = (Data produced in a second)/(Time to process one data frame)
Well that is pretty much how we used Kafka to run our FACE RECOGNITION SYSTEM.
You can read more about Kafka here.