Skip to content

Kafka Tuning

Kafka Producer/Consumer Configuration

Message Batching

One of the most important concepts when producing to Kafka is batching. If we don't set the Kafka producer property linger.ms or if it is set to 0, it means that we will do minimal batching, and most of the messages will be sent as soon as they arrive. Although this could result in lower latencies, it also means a very high number of I/O operations and Kafka requests, negatively affecting throughput and efficiency, and potentially overloading our Kafka brokers.

  • linger.ms: it is the number of milliseconds a producer is willing to wait before sending a batch out -> Zero by default -> messages are sent straight away
  • batch.size: if your batch is full before the end of the linger.ms period, then we’ll be sent to Kafka right away. -> 16kb by default -> If linger.ms is 0, messages will be sent straight away anyway despite the batch.size value

kafka-batching.png

Best practices

Visit this AWS Best Practices Documentation for more details. According to the above, some starting values could be: Producer Configuration:

  • linger.ms: 5ms for all cases inc low latency, and a higher value of 25ms in most cases
  • batch.size: at least 64KB or 128KB (use buffer.memory when using larger batch sizes, of 64MB)
  • send.buffer.bytes: set to -1
  • compression.type: -> lz4 or zstd

Consumer Configuration:

  • fetch.min.bytes: of at least 32 bytes or even 128 bytes
  • fetch.max.wait.ms: 1000ms
  • Number of consumers at least = to number of partitions
  • receive.buffer.bytes: -1 (default 64 kibibytes)

Auto vs Manual Kafka Commits

We are currently using auto commits in our applications, which means that commits will be done in certain time intervals. This is more straightforward and good enough at the moment, but in the future we may want to consider using manual commits. This way of committing is set in the code as well as in the configuration (commitSync() and commitAsync()), and can be done after we process a batch of activities, so we ensure this processing has been completed. Synchronous commits (commitSync()) wait for the acknowledgment from Kafka before moving on to the next message, and this can be slow, especially under high load. Consider using asynchronous commits (commitAsync()), which don’t block the consumer from processing new messages.

Use Auto Commit (enable.auto.commit=true) When:

  • You need low-latency, high-throughput processing.
  • Reprocessing some messages twice (at-least-once) is acceptable.
  • Your processing is idempotent (e.g., updating a cache).
  • The consumer logic is stateless (e.g., simple logging, monitoring).

Use Manual Commit (enable.auto.commit=false) When:

  • Data integrity is critical (avoid duplicates).
  • You need exactly-once processing (e.g., Flink with checkpoints).
  • Processing is stateful, and you must commit offsets only after successful computation.
  • Consumers handle long or multi-step operations (e.g., writing to a database).

auto-manual-kafka-commits.png

Expected Kafka latencies

Whilst investigating all these topics, I found it really hard to understand what "normal" looks like in terms of Kafka latencies. Here are some rough guidelines, although it will depend on individual cases:

expected-kafka-latencies.png

Acknowledge all vs Acknowledge Zero or One

There is another setting which can affect Kafka latencies and is worth discussing. The acks= setting will determine how many acknowledgments our producer needs to receive before it moves on after sending data to Kafka.

  • acks=all: All leader broker and followers need to send an acknowledgement that they have received the data (strong durability) → Higher latency (10-100ms per write).
  • acks=1: Only the leader broker will need to send an acknowledgment that the data has been received. If there is an error when replicating this data to the followers, it can lead to inconsistencies (lower durability) → Lower latency (1-10ms per write).
  • acks=0: the producer sends the message but doesn't wait for any acknowledgment from the broker. If there is an issue and the leader broker hasn't received this data, it could lead to data loss, but it also means a much lower latency and higher throughput.

kafka-acks-all.png kafka-acks1.png