Kafka Architecture: Log Compaction
2019-05-09
Kafka Comprehensive Tutorial – Part 2
2019-05-10
Show all

Kafka Comprehensive Tutorial – Part 1

What is Kafka? 

We use Apache Kafka when it comes to enabling communication between producers and consumers using message-based topics. Apache Kafka is a fast, scalable, fault-tolerant, publish-subscribe messaging system. Basically, it designs a platform for high-end new generation distributed applications.

Also, it allows a large number of permanent or ad-hoc consumers. One of the best features of Kafka is, it is highly available and resilient to node failures and supports automatic recovery. This feature makes Apache Kafka ideal for communication and integration between components of large-scale data systems in real-world data systems.

Moreover, this technology replaces the conventional message brokers, with the ability to give higher throughput, reliability, and replication like JMS, AMQP and many more. In addition, core abstraction Kafka offers a Kafka broker, a Kafka Producer, and a Kafka Consumer. Kafka broker is a node on the Kafka cluster, its use is to persist and replicate the data. A Kafka Producer pushes the message into the message container called the Kafka Topic. Whereas a Kafka Consumer pulls the message from the Kafka Topic.

a. Messaging System in Kafka

When we transfer data from one application to another, we use the Messaging System. It results as, without worrying about how to share data, applications can focus on data only. On the concept of reliable message queuing, distributed messaging is based. Although, messages are asynchronously queued between client applications and messaging system. There are two types of messaging patterns available, i.e. point to point and publish-subscribe (pub-sub) messaging system. However, most of the messaging patterns follow pub-sub.

Apache Kafka

Apache Kafka – Kafka Messaging System

  • Point to Point Messaging System

Here, messages are persisted in a queue. Although, a particular message can be consumed by a maximum of one consumer only, even if one or more consumers can consume the messages in the queue. Also, it makes sure that as soon as a consumer reads a message in the queue, it disappears from that queue.

  • Publish-Subscribe Messaging System

Here, messages are persisted in a topic. In this system, Kafka Consumers can subscribe to one or more topic and consume all the messages in that topic. Moreover, message producers refer publishers and message consumers are subscribers here.

3. History of Apache Kafka

Previously, LinkedIn was facing the issue of low latency ingestion of huge amount of data from the website into a lambda architecture which could be able to process real-time events. As a solution, Apache Kafka was developed in the year 2010, since none of the solutions was available to deal with this drawback, before.
However, there were technologies available for batch processing, but the deployment details of those technologies were shared with the downstream users. Hence, while it comes to Real-time Processing, those technologies were not enough suitable. Then, in the year 2011 Kafka was made public.

4. Why Should we use Apache Kafka Cluster?

As we all know, there is an enormous volume of data in Big Data. And, when it comes to big data, there are two main challenges. One is to collect the large volume of data, while another one is to analyze the collected data. Hence, in order to overcome those challenges, we need a messaging system. Then Apache Kafka has proved its utility. There are numerous benefits of Apache Kafka such as:

  • Tracking web activities by storing/sending the events for real-time processes.
  • Alerting and reporting the operational metrics.
  • Transforming data into the standard format.
  • Continuous processing of streaming data to the topics.

Therefore, this technology is giving a tough competition to some of the most popular applications like ActiveMQ, RabbitMQ, AWS etc. because of its wide use.

5. Kafka Tutorial – Audience

Professionals who are aspiring to make a career in Big Data Analytics using Apache Kafka messaging system should refer this Kafka Tutorial article. It will give you complete understanding about Apache Kafka.

6. Kafka Tutorial – Prerequisites

You must have a good understanding of JavaScala, Distributed messaging system, and Linux environment, before proceeding with this Apache Kafka Tutorial.

7. Kafka Architecture

Below we are discussing four core APIs in this Apache Kafka tutorial:

Apache Kafka

Apache Kafka – Kafka Architecture

a. Kafka Producer API
This Kafka Producer API permits an application to publish a stream of records to one or more Kafka topics.
b. Kafka Consumer API
To subscribe to one or more topics and process the stream of records produced to them in an application, we use this Kafka Consumer API.
c. Kafka Streams API
In order to act as a stream processor consuming an input stream from one or more topics and producing an output stream to one or more output topics and also effectively transforming the input streams to output streams, this Kafka Streams API gives permission to an application.
d. Kafka Connector API
This Kafka Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.

8. Kafka Components

a. Kafka Topic

Basically, how Kafka stores and organizes messages across its system and essentially a collection of messages are Topics. In addition, we can replicate and partition Topics.

Replicate refers to copies and partition refers to the division. Also, visualize them as logs wherein, Kafka stores messages. However, this ability to replicate and partitioning topics is one of the factors that enable Kafka’s fault tolerance and scalability.

Apache Kafka

Apache Kafka – Kafka Topic

b. Kafka Producer

It publishes messages to a Kafka topic.

c. Kafka Consumer

This component subscribes to a topic(s), reads and processes messages from the topic(s).

d. Kafka Broker

Kafka Broker manages the storage of messages in the topic(s). If Kafka has more than one broker, that is what we call a Kafka cluster.

e. Kafka Zookeeper

To offer the brokers with metadata about the processes running in the system and to facilitate health checking and broker leadership election, Kafka uses Kafka zookeeper.

9. Kafka Tutorial – Log Anatomy

We view log as the partitions in this Kafka tutorial. Basically, a data source writes messages to the log. One of the advantages is, at any time one or more consumers read from the log they select. Here, below diagram shows a log is being written by the data source and the log is being read by consumers at different offsets.

Apache Kafka

Apache Kafka Tutorial – Log Anatomy

10. Kafka Tutorial – Data Log

By Kafka, messages are retained for a considerable amount of time. Also, consumers can read as per their convenience. However, if Kafka is configured to keep messages for 24 hours and a consumer is down for time greater than 24 hours, the consumer will lose messages. And, messages can be read from last known offset, if the downtime on part of the consumer is just 60 minutes. Kafka doesn’t keep state on what consumers are reading from a topic.

11. Kafka Tutorial – Partition in Kafka

There are few partitions in every Kafka broker. Moreover, each partition can be either a leader or a replica of a topic. In addition, along with updating of replicas with new data, Leader is responsible for all writes and reads to a topic. The replica takes over as the new leader if somehow the leader fails.

Apache Kafka

Apache Kafka Tutorial – Partition In Kafka

12. Importance of Java in Apache Kafka

Apache Kafka is written in pure Java and also Kafka’s native API is java. However, many other languages like C++, Python, .Net, Go, etc. also support Kafka. Still, a platform where there is no need of using a third-party library is Java. Also, we can say, writing code in languages apart from Java will be a little overhead.
In addition, we can use Java language if we need the high processing rates that come standard on Kafka. Also, Java provides a good community support for Kafka consumer clients. Hence, it is a right choice to implement Kafka in Java.

13. Kafka Use Cases

There are several use Cases of Kafka that show why we actually use Apache Kafka.

  • Messaging

For a more traditional message broker, Kafka works well as a replacement. We can say Kafka has better throughput, built-in partitioning, replication, and fault-tolerance which makes it a good solution for large-scale message processing applications.

  • Metrics

For operational monitoring data, Kafka finds the good application. It includes aggregating statistics from distributed applications to produce centralized feeds of operational data.

  • Event Sourcing

Since it supports very large stored log data, that means Kafka is an excellent backend for applications of event sourcing.

14. Kafka Tutorial – Comparisons in Kafka

Many applications offer the same functionality as Kafka like ActiveMQ, RabbitMQ, Apache Flume, Storm, and Spark. Then why should you go for Apache Kafka instead of others?
Let’s see the comparisons below:

a. Apache Kafka vs Apache Flume

Kafka Tutorial

Kafka Tutorial – Apache Kafka vs Flume

i.  Types of tool
Apache Kafka– For multiple producers and consumers, it is a general-purpose tool.
Apache Flume– Whereas, it is a special-purpose tool for specific applications.
ii. Replication feature
Apache Kafka–  Using ingest pipelines, it replicates the events.
Apache Flume- It does not replicate the events.

b. RabbitMQ vs Apache Kafka

One among the foremost Apache Kafka alternatives is RabbitMQ. So, let’s see how they differ from one another:

Kafka Tutorial

Kafka Tutorial – Kafka vs RabbitMQ

i. Features
Apache Kafka– Basically, Kafka is distributed. Also, with guaranteed durability and availability, the data is shared and replicated.
RabbitMQ– It offers relatively less support for these features.
ii. Performance rate
Apache Kafka  Its performance rate is high to the tune of 100,000 messages/second.
RabbitMQ – Whereas, the performance rate of RabbitMQ is around 20,000 messages/second.
iii. Processing
Apache Kafka  It allows reliable log distributed processing. Also, stream processing semantics built into the Kafka Streams.
RabbitMQ  Here, the consumer is just FIFO based, reading from the HEAD and processing 1 by 1.
Let’s learn Kafka vs RabbitMQ

c. Traditional queuing systems vs Apache Kafka

Kafka Tutorial

Kafka Tutorial – Traditional queuing systems vs Apache Kafka

i. Messages Retaining
Traditional queuing systems – Most queueing systems remove the messages after it has been processed typically from the end of the queue.
Apache Kafka – Here, messages persist even after being processed. They don’t get removed as consumers receive them.
ii. Logic-based processing
Traditional queuing systems – It does not allow to process logic based on similar messages or events.
Apache Kafka – It allows to process logic based on similar messages or events.

Top 10 Kafka Features | Why Apache Kafka Is So Popular

Apache Kafka Features

Top 10 Kafka Features | Why Kafka Is So Popular

2. What is Apache Kafka?

To handle a high volume of data and enables us to pass messages from one end-point to another, Apache Kafka is a distributed publish-subscribe messaging system. It is suitable for both offline and online message consumption. Moreover, in order to prevent data loss, Kafka messages are persisted on the disk and replicated within the cluster. In addition, it is built on top of the ZooKeeper synchronization service. While it comes to real-time streaming data analysis, it can also integrate very well with Apache Storm and Spark. There are many more features of Apache Kafka. Let’s discuss them in detail.

3. Top 10 Apache Kafka Features

a. Scalability

Apache Kafka can handle scalability in all the four dimensions, i.e. event producers, event processors, event consumers and event connectors. In other words, Kafka scales easily without downtime.

b. High-Volume

Kafka can work with the huge volume of data streams, easily.

c. Data Transformations

Kafka offers provision for deriving new data streams using the data streams from producers.

d. Fault Tolerance

The Kafka cluster can handle failures with the masters and databases.

e. Reliability

Since Kafka is distributed, partitioned, replicated and fault tolerant, it is very Reliable.

f. Durability

It is durable because Kafka uses Distributed commit log, that means messages persists on disk as fast as possible.

g. Performance

For both publishing and subscribing messages, Kafka has high throughput. Even if many TB of messages is stored, it maintains stable performance.

h. Zero Downtime

Kafka is very fast and guarantees zero downtime and zero data loss.

i. Extensibility

There are as many ways by which applications can plug in and make use of  Kafka. In addition, offers ways by which to write new connectors as needed.

j. Replication

By using ingest pipelines, it can replicate the events.
So, this was all about Apache Kafka Features. Hope you like our explanation.

Terminologies and Concepts

Kafka Terminologies

Apache Kafka Terminologies and Concepts

2. List of Kafka Terminologies

In this Apache Kafka Tutorial, below is the list of most prominent Kafka terminologies which may help us to build the strong foundation of Kafka knowledge.

i. Kafka Broker

There are one or more servers available in Apache Kafka cluster, basically, these servers (each) are what we call a broker.

ii. Kafka Topics

Basically, Kafka maintains feeds of messages in categories. And, messages are stored as well as published in a category/feed name that is what we call a topic. In addition, all Kafka messages are generally organized into Kafka topics.

iii. Kafka Partitions

In each broker in Kafka, there is some number of partitions. These Kafka partitions in Kafka can be both a leader or a replica of a topic. So, on defining a Leader, it is responsible for all writes and reads to a topic whereas if somehow the leader fails, replica takes over as the new leader.

iv. Kafka Producers

In simple words,  the processes which publish messages to Kafka is what we call Producers. In addition, it publishes data on the topics of their choice.

v. Kafka Consumers

The processes that subscribe to topics and process as well as read the feed of published messages, is what we call Consumers.

vi. Offset in Kafka

The position of the consumer in the log and which is retained on a per-consumer basis is what we call Offset. Moreover, we can say it is the only metadata retained on a per-consumer basis.

vii. Kafka Consumer Group

Basically, a consumer abstraction offered by Kafka which generalizes both traditional messaging models of queuing and also publish-subscribe is what we call the consumer group. However, with a consumer group name, Consumers can label themselves.

viii. Kafka Log Anatomy

log is nothing different but another way to view a partition. Basically, a data source writes messages to the log. Further, one or more consumers read that data from the log at any time they want. Let’s understand it with a diagram, here consumers A and B are reading a data source which is writing to the log and from the log at different offsets.

Kafka Terminologies

Log Anatomy in Kafka

ix. Kafka Message Ordering and Client Acknowledgments

In Kafka, the order of the messages delivered from a certain partition and messages received by the partition is same

x. Node in Kafka

In the Apache Kafka cluster, a node is a single computer.

xi. Kafka Cluster

A  group of computers which are acting together in order to achieve a common purpose is what we call a cluster. In Kafka also, it has the same meaning i.e. a group of computers, each having one instance of Kafka broker.

xii. Kafka Replicas

Here, the word replica refers to a backup. That means a replica of a partition is a “backup” of a partition. Basically, we use replicas in order to prevent data loss, they never read or write data.

xiii. Kafka Message

In one line, Message in Kafka is an information which travels from the producer to a consumer through Apache Kafka.

xiv. Kafka Leader

A node which is responsible for all reads and writes for the given partition is what we call a Kafka Leader. So, every partition consists of one server, which acts as a leader.

xv. Follower in Kafka

Simply putting, a node that follows leader instructions is what we call a follower. The basic usage of a follower is, if any leader fails, any of these followers will automatically become the new leader. However, it plays as the normal consumer, which pulls messages and also updates its own data store.

xvi. Kafka Data Log

Messages are preserved through Kafka, especially for a considerable amount of time. That means consumers can read as per their convenience. Since Kafka is configured to keep messages for 24 hours but somehow consumer is down for time greater than 24 hours, in that case, the consumer will lose messages. Still, it is possible to read that message from last known offset, only if the downtime on part of the consumer is just 60 minutes.

xvii. Kafka Connector API

The API which permits to build as well as run reusable consumers or producers that connects existing applications or data systems to Kafka topics, we use the Connector API. 

Advantages and Disadvantages of Kafka

2. Advantages of Kafka

Advantages and disadvantages of Kafka

Kafka Pros and Cons – Kafka Advantages

a. High-throughput
Without having not so large hardware, Kafka is capable of handling high-velocity and high-volume data. Also, able to support message throughput of thousands of messages per second. 
b. Low Latency
It is capable of handling these messages with the very low latency of the range of milliseconds, demanded by most of the new use cases. 
c. Fault-Tolerant
One of the best advantages is Fault Tolerance. There is an inherent capability in Kafka, to be resistant to node/machine failure within a cluster. 
d. Durability
Here, durability refers to the persistence of data/messages on disk. Also, messages replication is one of the reasons behind durability, hence messages are never lost.
e. Scalability
Without incurring any downtime on the fly by adding additional nodes, Kafka can be scaled-out. Moreover, inside the Kafka cluster, the message handling is fully transparent and these are seamless. 
f. Distributed
The distributed architecture of Kafka makes it scalable using capabilities like replication and partitioning. 
g. Message Broker Capabilities
Kafka tends to work very well as a replacement for a more traditional message broker. Here, a message broker refers to an intermediary program, which translates messages from the formal messaging protocol of the publisher to the formal messaging protocol of the receiver. 
h. High Concurrency
Kafka is able to handle thousands of messages per second and that too in low latency conditions with high throughput. In addition, it permits the reading and writing of messages into it at high concurrency.
i. By Default Persistent
As we discussed above that the messages are persistent, that makes it durable and reliable. 
j. Consumer Friendly
It is possible to integrate with the variety of consumers using Kafka. The best part of Kafka is, it can behave or act differently according to the consumer, that it integrates with because each customer has a different ability to handle these messages, coming out of Kafka. Moreover, Kafka can integrate well with a variety of consumers written in a variety of languages. 
k. Batch Handling Capable (ETL like functionality)
Kafka could also be employed for batch-like use cases and can also do the work of a traditional ETL, due to its capability of persists messages.
l. Variety of Use Cases
It is able to manage the variety of use cases commonly required for a Data Lake. For Example log aggregation, web activity tracking, and so on. 
m. Real-Time Handling
Kafka can handle real-time data pipeline. Since we need to find a technology piece to handle real-time messages from applications, it is one of the core reasons for Kafka as our choice.

3. Disadvantages of Kafka

Advantages and disadvantages of Kafka

Cons of Kafka – Apache Kafka Disadvantages

It is good to know Kafka’s limitations even if its advantages appear more prominent then its disadvantages. However, consider it only when advantages are too compelling to omit. Here is one more condition that some disadvantages might be more relevant for a particular use case but not really linked to ours. So, here we are listing out some of the disadvantage associated with Kafka:
a. No Complete Set of Monitoring Tools
It is seen that it lacks a full set of management and monitoring tools. Hence, enterprise support staff felt anxious or fearful about choosing Kafka and supporting it in the long run.
b. Issues with Message Tweaking
As we know, the broker uses certain system calls to deliver messages to the consumer. However, Kafka’s performance reduces significantly if the message needs some tweaking. So, it can perform quite well if the message is unchanged because it uses the capabilities of the system.
c. Not support wildcard topic selection
There is an issue that Kafka only matches the exact topic name, that means it does not support wildcard topic selection. Because that makes it incapable of addressing certain use cases.
d. Lack of Pace
There can be a problem because of the lack of pace, while API’s which are needed by other languages are maintained by different individuals and corporates.
e. Reduces Performance
In general, there are no issues with the individual message size. However, the brokers and consumers start compressing these messages as the size increases. Due to this, when decompressed, the node memory gets slowly used. Also, compress happens when the data flow in the pipeline. It affects throughput and also performance.
f. Behaves Clumsy
Sometimes, it starts behaving a bit clumsy and slow, when the number of queues in a Kafka cluster increases.
g. Lacks some Messaging Paradigms
Some of the messaging paradigms are missing in Kafka like request/reply, point-to-point queues and so on. Not always but for certain use cases, it sounds problematic.
So, this was all about the advantages and disadvantages of Kafka. Hope you like our explanation.

Apache Kafka Use cases | Kafka Applications

Kafka Use Cases

Apache Kafka Use Cases and Applications of Kafka

2. Apache Kafka Use Cases and Applications

i. Kafka Use cases

There are many Use Cases of Apache Kafka. So, here we are listing some of the most common use cases of it−

Apache Kafka

Kafka Use Cases

a. Kafka Messaging

As we know, Kafka is a distributed publish-subscribe messaging systemSo, for a more traditional message broker, Kafka works well as a replacement. For a variety of reasons, we use Message brokers. For example, to decouple processing from data producers, to buffer unprocessed messages and many more.
However, Kafka has better throughput, built-in partitioning, replication, and fault-tolerance, in comparison to most other messaging systems. That makes it a good solution for large-scale message processing applications.

b. Website Activity Tracking

To be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds, it is the original Use Case for Kafka. That implies site activity is published to central topics with one topic per activity type. Here, site activity refers to page views, searches, or other actions users may take.

c. Kafka Metrics

For operational monitoring data, Kafka is often used. In addition, to produce centralized feeds of operational data, it includes aggregating statistics from distributed applications.

d. Kafka Log Aggregation

In order to collect logs from multiple services and make them available in a standard format to multiple consumers, we can use Kafka across an organization.

e. Stream Processing

However, there are some popular frameworks which read data from a topic, processes it, and write processed data to a new topic, where it becomes available for users and applications, such as Storm and Spark Streaming. In the context of stream processing, Kafka’s strong durability is also very useful.

f. Kafka Event Sourcing

Basically, when state changes are logged as a time-ordered sequence of records, then event sourcing is a style of application design. Also, we can say Kafka is an excellent backend for an application built in this style. Because it supports for a very large stored log.

g. Commit Log

While it comes to a distributed system, Kafka can serve as a kind of external commit-log for it. Generally, it replicates data between nodes. Also, acts as a re-syncing mechanism for failed nodes to restore their data. The feature of log compaction in Kafka helps to support this usage. However, Kafka is the same as Apache BookKeeper project, in this usage.

Now, let’s move towards Kafka Applications.

ii. Apache Kafka Applications

Kafka Use Cases

Kafka Applications

Kafka supports many of today’s best industrial applications. So, here we are listing some of the most notable applications of Kafka:

a. Twitter

Twitter is one of the best Kafka Applications. A famous online social networking service or a platform Twitter uses Kafka. Basically, it provides a way to send and receive users tweets. Through this platform, registered users can read and post tweets, but unregistered users can only read tweets. However, it uses Storm-Kafka as a part of their stream processing infrastructure.

b. LinkedIn

Another Kafka Application is LinkedIn. For activity stream data and operational metrics, LinkedIn uses Apache Kafka. There are several products like LinkedIn Newsfeed, LinkedIn Today, for online message consumption and in addition to offline analytics systems like Hadoop, Kafka messaging system helps LinkedIn. Moreover, we can say the strong durability of Kafka is also one of the key factors in connection with LinkedIn.

c. Netflix

An American multinational provider of on-demand internet streaming media, Netflix, also uses Kafka. Basically, for the purpose of real-time monitoring and event processing, it uses Kafka.

d. Mozilla

In 1998, members of Netscape created a free-software community, Mozilla. In order to collect performance and usage data from the end-users browser for projects like Telemetry, Test Pilot, etc. Kafka will soon be replacing a part of Mozilla current production system.

e. Oracle

Basically, from its Enterprise Service Bus product called OSB (Oracle Service Bus), Oracle offers native connectivity to Kafka. In order, to implement staged data pipelines, that permits developers to leverage OSB built-in mediation capabilities.

Amir Masoud Sefidian
Amir Masoud Sefidian
Data Scientist, Machine Learning Engineer, Researcher, Software Developer

Comments are closed.