Complete guide on Logging in Python
2019-03-26
Avro for Big Data, Data Streaming Architectures, and Kafka
2019-05-08
Show all

What is Apache Kafka?

26 mins read

Kafka’s growth is exploding, more than 13 of all Fortune 500 companies use Kafka. These companies include the top ten travel companies, 7 of the top ten banks, 8 of the top ten insurance companies, 9 of the top ten telecom companies, and much more. LinkedIn, Microsoft, and Netflix process four comma messages a day with Kafka (1,000,000,000,000). Kafka is used for real-time streams of data, used to collect big data or do real-time analysis, or both). Kafka is used with in-memory microservices to provide durability and it can be used to feed events to CEP (complex event streaming systems), and IOT/IFTTT style automation systems.

Why Kafka?

Kafka often gets used in the real-time streaming data architectures to provide real-time analytics. Since Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system, Kafka is used in use cases where JMS, RabbitMQ, and AMQP may not even be considered due to volume and responsiveness. Kafka has higher throughput, reliability, and replication characteristics which make it applicable for things like tracking service calls (tracks every call) or tracking IoT sensors data where a traditional MOM might not be considered.

Kafka can work with Flume/Flafka, Spark Streaming, Storm, HBase, Flink, and Spark for real-time ingesting, analysis, and processing of streaming data. Kafka is a data stream used to feed Hadoop BigData lakes. Kafka brokers support massive message streams for a low-latency follow-up analysis in Hadoop or Spark. Also, Kafka Streaming (a subproject) can be used for real-time analytics.

Kafka use cases

In short, Kafka gets used for stream processing, website activity tracking, metrics collection and monitoring, log aggregation, real-time analytics, CEP, ingesting data into Spark, ingesting data into Hadoop, CQRS, replay messages, error recovery, and guaranteed distributed commit log for in-memory computing (microservices).

Who uses Kafka?

A lot of large companies that handle a lot of data use Kafka. LinkedIn, where it originated, uses it to track activity data and operational metrics. Twitter uses it as part of Storm to provide a stream processing infrastructure. Square uses Kafka as a bus to move all system events to various Square data centers (logs, custom events, metrics, and so on), outputs to Splunk, Graphite (dashboards), and implement an Esper-like/CEP alerting systems. It gets used by other companies too like Spotify, Uber, Tumbler, Goldman Sachs, PayPal, Box, Cisco, NetFlix, and much more.

Kafka has operational simplicity. Kafka is to set up and use, and it is easy to reason how Kafka works. However, the main reason Kafka is very popular is its excellent performance. It has other characteristics as well, but so do other messaging systems. Kafka has great performance, is stable, provides reliable durability, has a flexible publish-subscribe/queue that scales well with N-number of consumer groups, has robust replication, provides Producers with tunable consistency guarantees, and provides preserved ordering at shard level (Kafka Topic Partition). In addition, Kafka works well with systems that have data streams to process and enables those systems to aggregate, transform & load into other stores. But none of those characteristics would matter if Kafka was slow. The most important reason Kafka is popular is Kafka’s exceptional performance.

Why is Kafka so Fast?

Kafka relies heavily on the OS kernel to move data around quickly. It relies on the principles of Zero Copy. Kafka enables you to batch data records into chunks. These batches of data can be seen the end to end from Producer to file system (Kafka Topic Log) to the Consumer. Batching allows for more efficient data compression and reduces I/O latency. Kafka writes to the immutable commit log to the disk sequential; thus, avoiding random disk access, slow disk seeking. Kafka provides a horizontal Scale through sharding. It shards a Topic Log into hundreds potentially thousands of partitions to thousands of servers. This sharding allows Kafka to handle a massive load.

Kafka: Streaming Architecture

Kafka gets used most often for real-time streaming of data into other systems. Kafka is a middle layer to decouple your real-time data pipelines. Kafka core is not good for direct computations such as data aggregations, or CEP. Kafka Streaming which is part of the Kafka ecosystem does provide the ability to do real-time analytics. Kafka can be used to feed fast lane systems (real-time, and operational data systems) like Storm, Flink, Spark Streaming, and your services and CEP systems. Kafka is also used to stream data for batch data analysis. Kafka feeds Hadoop. It streams data into your BigData platform or into RDBMS, Cassandra, Spark, or even S3 for some future data analysis. These data stores often support data analysis, reporting, data science crunching, compliance auditing, and backups.

Kafka Streaming Architecture Diagram

what is kafka - Kafka Streaming Architecture Diagram

Now let’s truly answer the question.

Kafka is…

Kafka is a distributed streaming platform that is used to publish and subscribe to streams of records. Kafka gets used for fault-tolerant storage. Kafka replicates topic log partitions to multiple servers. Kafka is designed to allow your apps to process records as they occur. Kafka is fast and uses IO efficiently by batching and compressing records. Kafka gets used for decoupling data streams. Kafka is used to streaming data into data lakes, applications, and real-time stream analytics systems.

Kafka Decoupling Data Streams

what is kafka - Kafka-Decoupling-Data-Streams

Kafka is Polyglot

Kafka communication from clients and servers uses a wire protocol over TCP that is versioned and documented. Kafka promises to maintain backward compatibility with older clients, and many languages are supported. There are clients in C#, Java, C, Python, Ruby, and many more languages. The Kafka ecosystem also provides REST proxy allows easy integration via HTTP and JSON, which makes integration even easier. Kafka also supports Avro schemas via the Confluent Schema Registry for Kafka. Avro and the Schema Registry allow complex records to be produced and read by clients in many programming languages and allow for the evolution of the records. Kafka is truly a polyglot.

Kafka is useful – Kafka Usage

Kafka allows you to build real-time streaming data pipelines. Kafka enables in-memory microservices (actors, AkkaBaratine.ioQBit, reactors, reactiveVert.xRxJavaSpring Reactor) Kafka allows you to build real-time streaming applications that react to streams to do real-time data analytics, transform, react, aggregate, join real-time data flows and perform CEP (complex event processing).

You can use Kafka to aid in gathering Metrics/KPIs, aggregate statistics from many sources implement event sourcing, use it with microservices (in-memory) and actor systems to implement in-memory services (external commit log for distributed systems).

You can use Kafka to replicate data between nodes, re-sync for nodes, to restore the state. While it is true, Kafka is used for real-time data analytics and stream processing, you can also use it for log aggregation, messaging, click-stream tracking, audit trails, and much more.

In a world where data science and analytics are a big deal, then capturing data to feed into your data lakes and real-time analytics systems is a big deal, and since Kafka can hold up to these kinds of strenuous use cases, Kafka is a big deal.

Kafka is a scalable message storage

Kafka is a good storage system for records/messages. Kafka acts as a high-speed file system for committing log storage and replication. These characteristics make Kafka useful for all manners of applications. Records written to Kafka topics are persisted to disk and replicated to other servers for fault-tolerance. Since modern drives are fast and quite large, this fits well and is very useful. Kafka Producers can wait on acknowledgment, so messages are durable as the producer write not complete until the message replicates. The Kafka disk structure scales well. Modern disk drives have a very high throughput when writing in large streaming batches. Also, Kafka Clients/Consumers can control read position (offset) which allows for use cases like replaying the log if there was a critical bug (fix the bug and the replay). And since offsets are tracked per consumer group, which we talk about in the Kafka Architecture article, the consumers can be quite flexible (e.g., replaying the log).

Kafka Record Retention

Kafka cluster retains all published records and if you don’t set a limit, it will keep records until it runs out of disk space. You can set time-based limits (configurable retention period), size-based limits (configurable based on size), or use compaction (keeps the latest version of record using the key). You can, for example, set a retention policy of three days or two weeks, or a month. The records in the topic log are available for consumption until discarded by time, size or compaction. The consumption speed is not impacted by size as Kafka always writes to the end of the topic log.

Let’s Review

Why is Kafka so fast?

Kafka is fast because it avoids copying buffers in-memory (Zero Copy), and streams data to immutable logs instead of using random access.

How fast is Kafka usage growing?

When you consider Kafka is six years old, and over 13 of fortune 500 companies use Kafka, then the only answer is fast, very fast.

How is Kafka getting used?

Kafka is used to feed data lakes like Hadoop, and to feed real-time analytics systems like Flink, Storm, and Spark Streaming.

Where does Kafka fit in the Big Data Architecture?

Kafka is a data stream that fills up Big Data’s data lakes.

How does Kafka relate to real-time analytics?

Kafka feeds data to real-time analytics systems like Storm, Spark Streaming, Flink, and Kafka Streaming.

Who uses Kafka?

The top ten travel companies, 7 of the top ten banks, 8 of the top ten insurance companies, 9 of the top ten telecom companies, LinkedIn, Microsoft, Netflix, and many more companies.

How does Kafka decouple streams of data?

It decouples streams of data by allowing multiple consumer groups that can each control wherein the topic partition they are. The producers don’t know about the consumers. Since the Kafka broker delegates the log partition offset (where the consumer is in the record stream) to the clients (Consumers), the message consumption is flexible. This allows you to feed your high-latency daily or hourly data analysis in Spark and Hadoop and at the same time you are feeding microservices real-time messages, sending events to your CEP system, and feeding data to your real-time analytic systems.

What are some common use cases for Kafka?

Kafka feeds data to real-time analytics systems like Storm, Spark Streaming, Flink, and Kafka Streaming. It also gets used for log aggregation, feeding events to CEP systems, and commit log for in-memory microservices.

Kafka Design Motivation

LinkedIn engineering built Kafka to support real-time analytics. Kafka was designed to feed an analytics system that did real-time processing of streams. LinkedIn developed Kafka as a unified platform for real-time handling of streaming data feeds. The goal behind Kafka, build a high-throughput streaming data platform that supports high-volume event streams like log aggregation, user activity, etc.

To scale to meet the demands of LinkedIn Kafka is distributed, supports sharding and load balancing. Scaling needs inspired Kafka’s partitioning and consumer model. Kafka scales writes and reads with partitioned, distributed, and commit logs. Kafka’s sharding is called partitioning. (Kinesis which is similar to Kafka calls partitions shards.)

A database shard is a horizontal partition of data in a database or search engine. Each individual partition is referred to as a shard or database shard. Each shard is held on a separate database server instance, to spread load. Sharding

Kafka was designed to handle periodic large data loads from offline systems as well as traditional messaging use-cases, low-latency.
MOM is message-oriented middleware think IBM MQSeries, JMS, ActiveMQ, and RabbitMQ. Like many MOMs, Kafka is fault-tolerance for node failures through replication and leadership election. However, the design of Kafka is more like a distributed database transaction log than a traditional messaging system. Unlike many MOMs, Kafka replication was built into the low-level design and is not an afterthought.

Persistence: Embrace filesystem

Kafka relies on the filesystem for storing and caching records.

The disk performance of hard drives performance of sequential writes is fast (really fast). JBOD is just a bunch of disk drives. JBOD configuration with six 7200rpm SATA RAID-5 array is about 600MB/sec. Like Cassandra tables, Kafka logs are write-only structures, meaning, data gets appended to the end of the log. When using HDD, sequential reads and writes are fast, predictable, and heavily optimized by operating systems. Using HDD, sequential disk access can be faster than random memory access and SSD.

While JVM GC overhead can be high, Kafka leans on the OS a lot for caching, which is a big, fast, and rock-solid cache. Also, modern operating systems use all available main memory for disk caching. OS file caches are almost free and don’t have the overhead of the OS. Implementing cache coherency is challenging to get right, but Kafka relies on the rock-solid OS for cache coherence. Using the OS for cache also reduces the number of buffer copies. Since Kafka disk usage tends to do sequential reads, the OS read-ahead cache is impressive.

Cassandra, Netty, and Varnish use similar techniques. All of this is explained well in the Kafka documentation. And, there is a more entertaining explanation at the Varnish site.

Big fast HDDs and long sequential access

Kafka favors long sequential disk access for reads and writes. Like Cassandra, LevelDB, RocksDB, and others Kafka uses a form of log-structured storage and compaction instead of an on-disk mutable BTree. Like Cassandra, Kafka uses tombstones instead of deleting records right away.

Since disks these days have somewhat unlimited space and are very fast, Kafka can provide features not usually found in a messaging system like holding on to old messages for a long time. This flexibility allows for interesting applications of Kafka.

Kafka Producer Load Balancing

The producer asks the Kafka broker for metadata about which Kafka broker has which topic partitions leaders thus no routing layer is needed. This leadership data allows the producer to send records directly to the Kafka broker partition leader.

The Producer client controls which partition it publishes messages to, and can pick a partition based on some application logic. Producers can partition records by a key, round-robin, or use a custom application-specific partitioner logic.

Kafka Producer Record Batching

Kafka producers support record batching. Batching can be configured by the size of records in bytes in batch. Batches can be auto-flushed based on time.

Batching is good for network IO throughput.

Batching speeds up throughput drastically.

Buffering is configurable and lets you make a tradeoff between additional latency for better throughput. Or in the case of a heavily used system, it could be both better average throughput and reduces overall latency.

Batching allows the accumulation of more bytes to send, which equates to a few larger I/O operations on Kafka Brokers and increases compression efficiency. For higher throughput, the Kafka Producer configuration allows buffering based on time and size. The producer sends multiple records as a batch with fewer network requests than sending each record one by one.

Kafka Producer Batching

Kafka Architecture - Kafka Producer Batching

Kafka compression

In large streaming platforms, the bottleneck is not always CPU or disk but often network bandwidth. There are even more network bandwidth issues in the cloud, containerized, and virtualized environments as multiple services could be sharing a NiC card. Also, network bandwidth issues can be problematic when talking data center to data center or WAN.

Batching is beneficial for efficient compression and network IO throughput.

Kafka provides end-to-end batch compression instead of compressing a record at a time, Kafka efficiently compresses a whole batch of records. The same message batch can be compressed and sent to Kafka broker/server in one go and written in compressed form into the log partition. You can even configure the compression so that no decompression happens until the Kafka broker delivers the compressed records to the consumer.

Kafka supports GZIP, Snappy, and LZ4 compression protocols.

Pull vs. Push/Streams

With Kafka consumers pull data from brokers. Other systems brokers push data or stream data to consumers. Messaging is usually a pull-based system (SQS, most MOM use pull). With the pull-based system, if a consumer falls behind, it catches up later when it can.

Since Kafka is pull-based, it implements aggressive batching of data. Kafka like many pull-based systems implements a long poll (SQS, Kafka both do). A long poll keeps a connection open after a request for a period and waits for a response.

A pull-based system has to pull data and then process it, and there is always a pause between the pull and getting the data.

Push-based push data to consumers (scribe, flume, reactive streams, RxJava, Akka). Push-based or streaming systems have problems dealing with slow or dead consumers. It is possible for a push system consumer to get overwhelmed when its rate of consumption falls below the rate of production. Some push-based systems use a back-off protocol based on back pressure that allows a consumer to indicate it is overwhelmed see reactive streams. This problem of not flooding a consumer and consumer recovery is tricky when trying to track message acknowledgments.

Push-based or streaming systems can send a request immediately or accumulate requests and send them in batches (or a combination based on backpressure). Push-based systems are always pushing data. The consumer can accumulate messages while it is processing data already sent which is an advantage to reduce the latency of message processing. However, if the consumer died when it was behind processing, how does the broker know where the consumer was and when does data get sent again to another Consumer. This problem is not an easy problem to solve. Kafka gets around these complexities by using a pull-based system.

Traditional MOM Consumer Message State Tracking

With most MOM it is the broker’s responsibility to keep track of which messages get marked consumed. Message tracking is not an easy task. As the consumer consumes messages, the broker keeps track of the state.

The goal in most MOM systems is for the broker to delete data quickly after consumption. Remember most MOMs were written when disks were a lot smaller, less capable, and more expensive.

This message tracking is trickier than it sounds (acknowledgment feature), as brokers must maintain lots of states to track per message, sent, acknowledge, and know when to delete or resend the message.

Kafka Consumer Message State Tracking

Remember that Kafka topics get divided into ordered partitions. Each message has an offset in this ordered partition. Each topic partition is consumed by exactly one consumer per consumer group at a time.

This partition layout means, the Broker tracks the offset data not tracked per message like MOM, but only needs the offset of each consumer group, and partition offset pair stored. This offset track equates to a lot fewer data to track.

The consumer sends location data periodically (consumer group, partition offset pair) to the Kafka broker, and the broker stores this offset data into an offset topic.

The offset style message acknowledgment is much cheaper compared to MOM. Also, consumers are more flexible and can rewind to an earlier offset (replay). If there was a bug, then fix the bug, rewind the consumer and replay the topic. This rewind feature is a killer feature of Kafka as Kafka can hold topic log data for a very long time.

Message Delivery Semantics

There is three message delivery semantics: at most once, at least once, and exactly once. At most once is messages may be lost but are never redelivered. At least once messages are never lost but may be redelivered. Exactly once is each message is delivered once and only once. Exactly once is preferred but more expensive, and requires more bookkeeping for the producer and consumer.

Kafka Consumer and Message Delivery Semantics

Recall that all replicas have exactly the same log partitions with the same offsets and the consumer groups maintain their position in the log per topic partition.

To implement “at-most-once” the consumer reads a message, then saves its offset in the partition by sending it to the broker, and finally processes the message. The issue with “at-most-once” is a consumer could die after saving its position but before processing the message. Then the consumer that takes over or gets restarted would leave off at the last position and the message in question is never processed.

To implement “at-least-once” the consumer reads a message, processes messages, and finally saves offset to the broker. The issue with “at-least-once” is a consumer could crash after processing a message but before saving the last offset position. Then if the consumer is restarted or another consumer takes over, the consumer could receive the message that was already processed. The “at-least-once” is the most common setup for messaging, and it is your responsibility to make the messages idempotent, which means getting the same message twice will not cause a problem (two debits).

To implement “exactly once” on the consumer side, the consumer would need a two-phase commit between storage for the consumer position, and storage of the consumer’s message process output. Or, the consumer could store the message process output in the same location as the last offset.

Kafka offers the first two, and it is up to you to implement the third from the consumer perspective.

Kafka Producer Durability and Acknowledgement

Kafka offers operational predictability semantics for durability. When publishing a message, a message gets “committed” to the log which means all ISRs accepted the message. This commit strategy works out well for durability as long as at least one replica lives.

The producer connection could go down in the middle of sending, and the producer may not be sure if a message it sent went through, and then the producer resends the message. This resend logic is why it is important to use message keys and use idempotent messages (duplicates ok). Kafka did not make guarantees of messages not getting duplicated from producer retrying until recently (June 2017).
The producer can resend a message until it receives confirmation, i.e., acknowledgment received.

The producer resending the message without knowing if the other message it sent made it or not, negates “exactly once” and “at-most-once” message delivery semantics.

Producer Durability

The producer can specify the durability level. The producer can wait on a message being committed. Waiting for commit ensures all replicas have a copy of the message.

The producer can send with no acknowledgments (0). The producer can send just get one acknowledgment from the partition leader (1). The producer can send and wait on acknowledgments from all replicas (-1), which is the default.

As of June 2017: the producer can ensure a message or group of messages was sent “exactly once”.

Improved Producer (June 2017 release)

Kafka now supports “exactly once” delivery from producer, performance improvements, and atomic write across partitions. They achieve this by the producer sending a sequence id, the broker keeps track if the producer already sent this sequence, if the producer tries to send it again, it gets an ack for a duplicate message, but nothing is saved to log. This improvement requires no API change.

Kafka Producer Atomic Log Writes (June 2017 Release)

Another improvement to Kafka is the Kafka producers having atomic write across partitions. The atomic writes mean Kafka consumers can only see committed logs (configurable). Kafka has a coordinator that writes a marker to the topic log to signify what has been successfully transacted. The transaction coordinator and transaction log maintain the state of the atomic writes.

The atomic writes do require a new producer API for transactions.

Here is an example of using the new producer API.

New Producer API for transactions

producer.initTransaction();

try {
  producer.beginTransaction();
  producer.send(debitAccountMessage);
  producer.send(creditOtherAccountMessage);
  producer.sentOffsetsToTxn(...);
  producer.commitTransaction();
} catch (ProducerFencedTransactionException pfte) {
  ...
  producer.close();
} catch (KafkaException ke) {
  ...
  producer.abortTransaction();
}

Kafka Replication

Kafka replicates each topic’s partitions across a configurable number of Kafka brokers. Kafka’s replication model is by default, not a bolt-on feature like most MOMs as Kafka was meant to work with partitions and multi-nodes from the start. Each topic partition has one leader and zero or more followers.

Leaders and followers are called replicas. A replication factor is the leader node plus all of the followers. Partition leadership is evenly shared among Kafka brokers. Consumers only read from the leader. Producers only write to the leaders.

The topic log partitions on followers are in-sync with the leader’s log, ISRs are an exact copy of the leaders minus the to-be-replicated records that are in-flight. Followers pull records in batches from their leader like a regular Kafka consumer.

Kafka Broker Failover

Kafka keeps track of which Kafka brokers are alive. To be alive, a Kafka Broker must maintain a ZooKeeper session using ZooKeeper’s heartbeat mechanism and must have all of its followers in-sync with the leaders and not fall too far behind.

Both the ZooKeeper session and being in-sync are needed for broker liveness which is referred to as being in-sync. An in-sync replica is called an ISR. Each leader keeps track of a set of “in sync replicas”.

If ISR/follower dies and falls behind, then the leader will remove the follower from the set of ISRs. Falling behind is when a replica is not in-sync after replica.lag.time.max.ms period.

A message is considered “committed” when all ISRs have applied the message to their log. Consumers only see committed messages. Kafka guarantee: committed message will not be lost, as long as there is at least one ISR.

Replicated Log Partitions

A Kafka partition is a replicated log. A replicated log is a distributed data system primitive. A replicated log is useful for implementing other distributed systems using state machines. A replicated log model “coming into consensus” on an ordered series of values.

While a leader stays alive, all followers just need to copy values and orders from their leader. If the leader does die, Kafka chooses a new leader from its followers which are in-sync. If a producer is told a message is committed, and then the leader fails, then the newly elected leader must have that committed message.

The more ISRs you have; the more there are to elect during a leadership failure.

Kafka and Quorum

Quorum is the number of acknowledgments required and the number of logs that must be compared to elect a leader such that there is guaranteed to be an overlap for availability. Most systems use a majority vote, Kafka does not use a simple majority vote to improve availability.

In Kafka, leaders are selected based on having a complete log. If we have a replication factor of 3, then at least two ISRs must be in-sync before the leader declares a sent message committed. If a new leader needs to be elected then, with no more than 3 failures, the new leader is guaranteed to have all committed messages.

Among the followers, there must be at least one replica that contains all committed messages. The problem with majority vote Quorum is it does not take many failures to have an inoperable cluster.

Kafka Quorum Majority of ISRs

Kafka maintains a set of ISRs per leader. Only members in this set of ISRs are eligible for the leadership election. What the producer writes to partition is not committed until all ISRs acknowledge the write. ISRs are persisted to ZooKeeper whenever ISR set changes. Only replicas that are members of the ISR set are eligible to be elected leaders.

This style of ISR quorum allows producers to keep working without the majority of all nodes, but only an ISR majority vote. This style of ISR quorum also allows a replica to rejoin ISR set and have its vote count, but it has to be fully re-synced before joining even if the replica lost un-flushed data during its crash.

All nodes die at the same time. Now what?

Kafka’s guarantee about data loss is only valid if at least one replica is in-sync.

If all followers that are replicating a partition leader die at once, then the data loss Kafka guarantee is not valid. If all replicas are down for a partition, Kafka, by default, chooses the first replica (not necessarily in ISR set) that comes alive as the leader (config unclean.leader.election.enable=true is the default). This choice favors availability over consistency.

If consistency is more important than availability for your use case, then you can set config unclean.leader.election.enable=falsethen if all replicas are down for a partition, Kafka waits for the first ISR member (not the first replica) that comes alive to elect a new leader.

Producers pick Durability

Producers can choose durability by setting acks to – none (0), the leader only (1), or all replicas (-1 ).

The acks=all is the default. With all, the acks happen when all current in-sync replicas (ISRs) have received the message.

You can make the trade-off between consistency and availability. If durability over availability is preferred, then disable unclean leader election and specify a minimum ISR size.

The higher the minimum ISR size, the better the guarantee is for consistency. But the higher the minimum ISR, the more you reduce availability since partition won’t be unavailable for writes if the size of ISR set is less than the minimum threshold.

Quotas

Kafka has quotas for consumers and producers to limit the bandwidth they are allowed to consume. These quotas prevent consumers or producers from hogging up all the Kafka broker resources. The quota is by client id or user. The quota data is stored in ZooKeeper, so changes do not necessitate restarting Kafka brokers.


Kafka Low-Level Design and Architecture Review

How would you prevent a denial of service attack from a poorly written consumer?

Use Quotas to limit the consumer’s bandwidth.

What is the default producer durability (acks) level?

All. This means all ISRs have to write the message to their log partition.

What happens by default if all of the Kafka nodes go down at once?

Kafka chooses the first replica (not necessarily in ISR set) that comes alive as the leader as unclean.leader.election.enable=true is default to support availability.

Why is Kafka record batching important?

Optimized IO throughput over the wire as well as to the disk. It also improves compression efficiency by compressing an entire batch.

What are some of the design goals for Kafka?

To be a high-throughput, a scalable streaming data platform for real-time analytics of high-volume event streams like log aggregation, user activity, etc.

What are some of the new features in Kafka as of June 2017?

Producer atomic writes, performance improvements, and producer not sending duplicate messages.

What is the different message delivery semantics?

There is three message delivery semantics: at most once, at least once, and exactly once.

Useful Links:

https://medium.com/@durgaswaroop/a-practical-introduction-to-kafka-storage-internals-d5b544f6925f

https://hackernoon.com/thorough-introduction-to-apache-kafka-6fbf2989bbc1

https://data-flair.training/blogs/apache-kafka-tutorial/

https://medium.com/rahasak/apache-zookeeper-31b2091657a8

https://www.tutorialspoint.com/apache_kafka/index.htm

http://cloudurable.com/blog/what-is-kafka/index.html

Amir Masoud Sefidian
Amir Masoud Sefidian
Machine Learning Engineer

Comments are closed.