Latest version: 4.3.x
Query performance on Apache Kafka with SQL
There are many reasons why one would want to explore or query the data using SQL on Apache Kafka. They can range from democratizing and socializing your data (and deliver better data experiences) to debugging a specific transaction or device identifier.
With the native SQL queries that comes with Lenses, that capability is available over any data format (AVRO, XML, JSON, ProtoBuf), and is governed via a data-centric and field-centric security model.
Testing environment
As a testing environment, we are using 5 x Apache Kafka brokers within a DMZ zone, with 1 x instance of Lenses serving traffic over a secure (HTTPS) connection.

SQL queries on Kafka topics with Avro messages
For the first exploration, we will be running a SQL query over a Kafka topic that contains AVRO serialized data in both the key and the value of the messages. The total payload of the data is 500 bytes when un-serialized.
We will be fetching 100K messages, thus downloading a chunk of 50MB of data over the internet by using a simple SQL query over a Kafka topic:

Executing this query takes 18 seconds

Let’s check our Prometheus monitoring to review the performance further. We notice network traffic increased with a peak at ~6MB/sec. CPU usage increased by 30% during the execution of the query. This is because of the deserializing of more than 5K Avro messages/sec, the JVM memory usage, and GC time slightly increased due to the intermediate object creation.

SQL queries on Kafka topics with JSON messages
For the second case, we will run a SQL query over a Kafka topic that contains JSON serialized data. The total payload of the data is 1000+ bytes when un-serialized.
We will be fetching 50K messages, thus downloading a chunk of ~50MB of data over the internet by using simple SQL query over a Kafka topic:

Executing this query takes 15 seconds

Let’s check our Prometheus monitoring to review the performance further. We notice network traffic increased with a peak at ~2MB/sec. CPU usage increased by ~50% during the execution of the query. This is because of the deserializing of more than ~3.3K JSON messages/sec, the JVM memory usage, and GC time slightly increased due to the intermediate object creation.

Summary
While working remotely, in this performance article, we analyzed how to query data on Apache Kafka using SQL queries. It highlights how to deliver secure data access to a real-time environment, and how one can execute queries that run up to 6MB / sec ( about 400% faster than a Netflix stream ) over AVRO data.
Note that the analysis was done over the internet and on a live Apache Kafka cluster. Performance characteristics will also depend on cluster configuration and whether the data is cached, so it might vary.
Technical product notes
For the above article we used the April 2020 release of Lenses 3.0.9 and 3.1.1