Strange Patterns: Kafka 101

19 November 2024

Kafka 101

What is Kafka?

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. (kafka.apache.org)

•High throughput

•Scalable

•Permanent storage

•High availability

lIn a Synchronous system, a service makes a request, and that thread/process must wait until it receives a response. While this flow is simpler to code (a straightforward structure) it may introduce delays and may create a bottleneck.

lIn contrast, Asynchronous processing, as explained in the EDA (Event-Driven Architecture) presentation by Fusse, allows events to be processed without waiting for an immediate response. Messaging is a common way to implement loose coupling, as messages are sent asynchronously, allowing services to continue working and receive responses later.

lBy using loosely coupled services, we can achieve a scalable architecture without blocking the system while waiting for responses.

CoConsumer Offsets:

o•Offsets: are used as indexes in the log that the consumer sends to the broker.

Kafka tracks the last processed message for each consumer group in a topic using an internal topic called __consumer_offsets.

lIn a monolithic architecture, all components are packaged together as a single code/entity and communicate by calling functions/methods inside of the runtime execution.

lIn a microservice architecture, services communicate (over the network) with each other through APIs or Messaging Systems (asynchronous).

lIn distributed systems the communications brings more challenges than in a monolith architecture. However, brings some benefits which are related to scalable, fault tolerant applications.

Apache Kafka is a scalable, fault-tolerant, publish-subscribe, streaming messaging system widely used for communication and integration in distributed architectures.

lKafka operates with the following core components:

lMessage: The unit of data within Kafka is called a message, event or record.

lBroker: manages data persistence and replication across nodes in a Kafka cluster.

lCluster: is composed of multiple brokers

lProducer: generates messages and sends them to specific Kafka topics.

lConsumer: reads messages from topics.

Consumer Groups: enable multiple consumers to process data collaboratively, enhancing scalability and fault tolerance. One consumer from a Group read several partitions, but a partition can have only one consumer from a consumer group.

Topics: abstract concept where messages are dispatched, which are divided into partitions to distribute load.

Partitions & Replication:

•each topic can have multiple partitions, spread across different brokers.

•partitions can be replicated for fault tolerance, and each has a designated leader responsible for writes, while replicas handle failover.

•partitions are atomic/indivisible.

•total number of replicas <= number of brokers, as exceeding this will cause an InvalidReplicationFactorException.

Serialization: converts data into binary format for efficient transmission and storage, supporting both native and custom serialization.

Security Layers in Kafka Kafka provides robust security options, including:

•Encryption: SSL/TLS is used for secure communication between clients and brokers, as well as between brokers.

•Authentication: Supports various SASL mechanisms, including OAuth 2, Kerberos, and LDAP, for flexible and secure access control.

•Access Control: Kafka implements Access Control Lists (ACLs) and offers Role-Based Access Control (RBAC) for managing permissions.

Design Patterns & Anti-Patterns Mentioned in previous presentations, some key patterns include to be used with Kafka:

•CQRS (Command and Query Responsibility Segregation): Separates read and write operations to enhance performance and scalability.

•Event Sourcing: Logs all changes as events to ensure an auditable and re-constructable history of the system's state.

Performance Considerations

•Compression Trade-offs: Compressing data within Kafka can reduce storage requirements, but it also impacts performance. Brokers and consumers expend resources on data compression and decompression, which can lower both Kafka's throughput and latency.

•Fine tuning Stream Caching and Buffering: Kafka Streams uses caching to reduce the load on state stores and improve processing

•OS Page Cache: Kafka optimizes performance by leveraging the operating system's page cache (by avoiding caching in the JVM heap, the brokers can help prevent some of the issues that large heaps may have (for example, long or frequent garbage collection pauses)

•Message size: it is preferred smaller size than large size

•Among others…

•There is no silver bullet (one-size-fits-all) solution!

Schema Registry

•Centralized repository for managing and validating data structure versus schemas and serialization to ensure data consistency and compatibility.