What
is Kafka?
Apache Kafka is an open-source
distributed event streaming platform used by thousands of companies for
high-performance data pipelines, streaming analytics, data integration, and
mission-critical applications. (kafka.apache.org)
•High
throughput
•Scalable
•Permanent
storage
•High
availability
lIn a Synchronous
system, a service makes a request, and that thread/process must wait until it
receives a response. While this flow is simpler to code (a straightforward
structure) it may introduce delays and may create a bottleneck.
lIn contrast, Asynchronous
processing, as explained in the EDA (Event-Driven Architecture) presentation by
Fusse, allows events to be processed
without waiting for an immediate response. Messaging is a common way to
implement loose coupling, as messages are sent asynchronously, allowing
services to continue working and receive responses later.
lBy using loosely coupled services,
we can achieve a scalable architecture without blocking the system while
waiting for responses.
CoConsumer
Offsets:
o•Offsets:
are used as indexes in the log that the consumer sends to the broker.
Kafka tracks the last processed
message for each consumer group in a topic using an internal topic called __consumer_offsets.
lIn a monolithic
architecture, all components are packaged together as a single code/entity and
communicate by calling functions/methods inside of the runtime execution.
lIn a microservice
architecture, services communicate (over the
network) with each other through APIs or Messaging Systems (asynchronous).
lIn distributed
systems the
communications brings more challenges than in a monolith architecture. However,
brings some benefits which are related to scalable, fault tolerant
applications.
Apache
Kafka is a scalable, fault-tolerant, publish-subscribe, streaming messaging
system widely used for communication and integration in distributed
architectures.
lKafka
operates with the following core components:
lMessage:
The unit of data within Kafka is called a message, event or record.
lBroker:
manages data persistence and replication across nodes in a Kafka cluster.
lCluster:
is composed of multiple brokers
lProducer:
generates messages and sends them to specific Kafka topics.
lConsumer:
reads messages from topics.
Consumer
Groups:
enable multiple consumers to process data collaboratively, enhancing
scalability and fault tolerance. One consumer from a Group read several
partitions, but a partition can have only one consumer from a consumer group.
Topics: abstract concept where messages
are dispatched, which are divided into partitions to distribute load.
Partitions
& Replication:
•each
topic can have multiple partitions, spread across different brokers.
•partitions
can be replicated for fault tolerance, and each has
a designated leader responsible for writes, while replicas handle failover.
•partitions
are atomic/indivisible.
•total
number
of replicas <= number of brokers,
as exceeding this will cause an InvalidReplicationFactorException.
Serialization: converts data into binary format
for efficient transmission and storage, supporting both native and custom
serialization.
Security
Layers in Kafka Kafka
provides robust security options, including:
•Encryption:
SSL/TLS is used for secure communication between clients and brokers, as well
as between brokers.
•Authentication: Supports various SASL mechanisms,
including OAuth 2, Kerberos, and LDAP, for flexible and secure access control.
•Access Control: Kafka implements Access Control
Lists (ACLs) and offers Role-Based Access Control (RBAC) for managing
permissions.
Design
Patterns & Anti-Patterns Mentioned in previous presentations, some
key patterns include to be used with Kafka:
•CQRS
(Command and Query Responsibility Segregation): Separates read and write
operations to enhance performance and scalability.
•Event Sourcing: Logs all changes as events to
ensure an auditable and re-constructable history of the system's state.
Performance
Considerations
•Compression Trade-offs: Compressing data within Kafka can
reduce storage requirements, but it also impacts performance. Brokers and
consumers expend resources on data compression and decompression, which can
lower both Kafka's throughput and latency.
•Fine tuning
Stream Caching and Buffering:
Kafka Streams uses caching to reduce the load on state stores and improve
processing
•OS Page Cache: Kafka optimizes performance by
leveraging the operating system's page cache (by avoiding caching in the JVM
heap, the brokers can help prevent some of the issues that large heaps may have
(for example, long or frequent garbage collection pauses)
•Message size: it is preferred smaller size than
large size
•Among
others…
•There
is no silver bullet (one-size-fits-all) solution!
Schema
Registry
•Centralized
repository for managing and validating data structure versus schemas and
serialization to ensure data consistency and compatibility.