Kafka Quotas and per tenant cost center charging

Question

How to allocate costs of my streaming data platform across different cost centers and tenants ie. product teams ?

Answer

This article highlights how Cost Center allocation (across one of more Apache Kafka clusters) can be implemented using the Kafka Quotas capability. Consumption patterns will be used to identify how different product teams consume platform resources in order for them to share the operational cost.

The two key technical elements, that the reader should be aware are:

Kafka Quotas A native capability of Apache Kafka that can throttle network consumption resources
Client-ID - An identifier of a Kafka consumer or Kafka producer that is optional and passed to a Kafka broker with every request, with the sole purpose of tracking the source of requests on a logical / application name for monitoring aggregation.

Multi-tenant example

For example, a Kafka cluster is shared across 3 different product teams (3 lines of business). The total available network I/O is 20 MBytes/sec.

Product team A requires guaranteed produce and consumer rates of 10MB/sec
Product team B requires guaranteed produce and consumer rates of 6MB/sec
Product team C requires guaranteed produce and consumer rates of 4MB/sec

Cost allocation for the above, should be 50% Team A, 30% Team B, 20% Team C.

How to implement Cost allocation

Example Quotas for a multi-tenant Kafka cluster with 3 main projects:

In the above screen we have implemented one Kafka Quota per cost center (or per project team). We are using the Client-ID as the main identifier, and have added the guaranteed consume and produce rates. (The request percentage quota has been intentionally omitted, as it will not add any additional value)

Note: In addition to the predefined quotas, we have added a threashold of 1MB/sec for CLIENTS DEFAULT. It is highly recommended to over-allocate and provide a default value for any “unnamed” client. That will allow developers to use their favorite tools for data productivity and observability such as kafka-console-consumer, or Lenses and also any application (micro-service, machine learning, data pipeline) can still operate in a “slow lane” until they have migrated and are properly annotating which project they belong to.

For the technical reader, keep in mind that Apache Kafka implements a specific set of Quota precedence rules. For example a “named client” will always be allocated to the first matching /clients/<client-id> quota, and any “unnamed client” will fallback to /clients/<default>.

Cost allocation reporting

On a large scale organization having 10s of Kafka clusters, all the Kafka quotas can be exported:

project	consume	produce	cluster
COST-CENTER-1	10	10	US-EAST-MSK-1
COST-CENTER-2	6	6	US-EAST-MSK-1
COST-CENTER-3	4	4	US-EAST-MSK-1
COST-CENTER-1	10	10	US-WEST-AZURE-1
COST-CENTER-5	10	10	US-WEST-AZURE-1
COST-CENTER-8	5	5	US-WEST-AZURE-1

And when joined with cost reports:

cluster	capacity	monthly_cost
US-EAST-MSK-1	20	2000
US-WEST-AZURE-1	25	2500

We can produce rich real-time views in dashboards:

Cost center allocation of streaming data platform

Kafka Connect and tenants

When the data platform tenants are also using Kafka Connect for bringing data in or out of Apache Kafka, the following section is relevant. Additional info can be read at KIP-411

Kafka Connect assigns a default client.id to tasks in the form:

connector-consumer-{connectorId}-{taskId}     # for sink tasks
connector-producer-{connectorId}-{taskId}     # for source tasks
connector-dlq-producer-{connectorId}-{taskId} # for sink tasks Dead-letter queue

That means that the above QUOTA based model for cost allocation will not work for Kafka Connect.

The solution is to specify in the worker configuration properties the producer.client.id and consumer.client.id, as they take precedence.

cat connect-avro-distributed.properties | grep -i client
producer.client.id=COSTCENTER-1
consumer.client.id=COSTCENTER-1

Setting the above in the connect workers properties, will make the above solution feasible as, the CLIENT ID will propage to the consumers and producers of the Kafka Connect cluster:

kafka-consumer-groups  --describe --group connect-nullsink --bootstrap-server localhost:909
GROUP            TOPIC                       PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                       HOST            CLIENT-ID
connect-mongodb  topic_telecom_italia        0          764             830             66              COSTCENTER-1-2d388c25-6532-43a8-b8cf-fd3bb4b06268 /10.156.0.16    COSTCENTER-1
connect-elastic  topic_iot_position_reports  0          1327            1478            151             COSTCENTER-1-2d388c25-6532-43a8-b8cf-fd3bb4b06268 /10.156.0.16    COSTCENTER-1

In order to have a sound architecture around Kafka Connect multi-tenancy, keep in mind best practices, such as the single responsibility principle. The ideal architecture is a small Kafka Connect cluster to be deployed per data pipeline (rather than overloading a large single Kafka Connect cluster with multiple types of connectors).

How Lenses can help

Lenses can help at delivering a multi-tenant data platform in the following key areas:

Quotas / Cost Allocation to apply quotas and allocate cost with automation
Data centric security model to empower people to access the data platform with a data centric security model (avoiding the security gaps of the Kafka ecosystem) and enable different roles and permissions per tenants and teams
RBAC security over Kafka Connect introduced in Lenses version 4.1 see Kafka Connect security
DAD / Distributed Application Deployment framework (aimed to be released in 2021), that automates the deployment of Kafka Connect pipelines natively within Kubernetes with embedded monitoring, alerting and cost allocation