Latest version: 4.3.x
Regex within Lenses.io
Regex matching and extraction
In this tutorial you will learn how to use the regexp
and regex_matches
functions to match and extract data from text/string fields.
Applying a regex to validate or extract text data is a very common scenario especially when working with string/text data sources like log files.
This tutorial is split in 5 main stages:
- Creating and populating an example topic.
- Filtering out non-standard log messages.
- Extracting patterns from a string (capturing groups)
- Creating complex structs from arrays of capturing groups
- Turning it into a processor
NOTE: All the code snippets bellow apart from step 5) can be used within SQL Studio. Step 5) is to be used to create a new processor.
Step 01 - Demo data
Running the following code in SQL Studio should enable you to create a table (log_lines
) with some logs taken from a
Zookeeper instance:
CREATE TABLE log_lines(_key LONG, _value.log_line STRING) FORMAT (LONG, AVRO);
INSERT INTO log_lines(_key, log_line) VALUES
(1, '[2021-09-22 08:51:39,347] INFO clientPortAddress is 0.0.0.0:2181 (org.apache.zookeeper.server.quorum.QuorumPeerConfig)'),
(2, '[2021-09-22 08:51:39,347] INFO secureClientPort is not set (org.apache.zookeeper.server.quorum.QuorumPeerConfig)'),
(3, '[2021-09-22 08:51:39,347] INFO Starting server (org.apache.zookeeper.server.ZooKeeperServerMain)'),
(4, '[2021-09-22 08:51:39,351] INFO zookeeper.snapshot.trust.empty : false (org.apache.zookeeper.server.persistence.FileTxnSnapLog)'),
(5, '[2021-09-22 10:52:39,347] INFO Starting server (org.apache.zookeeper.server.ZooKeeperServerMain)'),
(6, '[2021-09-22 10:52:42,351] INFO zookeeper.snapshot.trust.empty : false (org.apache.zookeeper.server.persistence.FileTxnSnapLog)'),
(7, '[2021-09-22 10:52:50,365] INFO Server environment:zookeeper.version=3.5.9-83df9301aa5c2a5d284a9940177808c01bc35cef, built on 01/06/2021 20:03 GMT (org.apache.zookeeper.server.ZooKeeperServer)')
(8, '[19283312] JFHR Server environment:zookeeper.version=3.5.9-83df9301aa5c2a5d284a9940177808c01bc35cef, built on 01/06/2021 20:03 GMT (org.apache.zookeeper.server.ZooKeeperServer)')
;
Step 02 - Filtering messages
Since we’re dealing with strings we should first filter any log lines that do not match our existing assumptions and in that
way avoid errors. As such we first filter out any messages that do not respect the following regex ^\[\d{4}-\d{2}-\d{2} \d{2}\:\d{2}:\d{2},\d{3}\] (INFO|ERROR|WARN).*$
:
SELECT
*
FROM
log_lines
WHERE
SIZEOF(
regexp(log_line, "^\[\d{4}-\d{2}-\d{2} \d{2}\:\d{2}:\d{2},\d{3}\] (INFO|ERROR|WARN).*$")
) >= 1
Notice that _key
= 8
was filtered out from the result set since it does not start with a valid date format.
Step 03 - Capturing groups
Now that we know the data we’re working with matches our expectations, the next step is to extract the relevant fields.
For that we’ll use the regex_matches
function which allows us to define capturing groups allowing us to deconstruct our
string based on a pattern:
SELECT
log_line,
regex_matches(log_line, '^\[(\d{4})-(\d{2})-(\d{2}) (\d{2})\:(\d{2}):(\d{2}),(\d{3})\] (INFO|ERROR|WARN) (.*)$') as matches
FROM
log_lines
WHERE
SIZEOF(regexp(log_line, '^\[\d{4}-\d{2}-\d{2} \d{2}\:\d{2}:\d{2},\d{3}\] (INFO|ERROR|WARN).*$')) >= 1
Notice that in the result set, we now have a new property “matches” which is a 2D array. This is because regex_matches
will return one set of results every time the whole regex provided is matched, since our regex matches the whole string,
we only get one result per message.
Step 04 - Creating a struct
A closer look at the matches property seems to indicate that:
- all
matches
have only size = 1 - all arrays inside matches have size = 10 (the same number of capturing groups we defined plus one)
As such we can extract the fields using the following query:
SELECT
CAST(matches[0][1] AS INT) as year,
CAST(matches[0][2] AS INT) as month,
CAST(matches[0][3] AS INT) as day,
CAST(matches[0][4] AS INT) as hour,
CAST(matches[0][5] AS INT) as minutes,
CAST(matches[0][6] AS INT) as seconds,
CAST(matches[0][7] AS INT) as millis,
matches[0][8] as log_level,
matches[0][9] as log_msg
FROM
(
SELECT
log_line,
regex_matches(log_line, '^\[(\d{4})-(\d{2})-(\d{2}) (\d{2})\:(\d{2}):(\d{2}),(\d{3})\] (INFO|ERROR|WARN) (.*)$') as matches
FROM
log_lines
WHERE
SIZEOF(regexp(log_line, '^\[\d{4}-\d{2}-\d{2} \d{2}\:\d{2}:\d{2},\d{3}\] (INFO|ERROR|WARN).*$')) >= 1
)
Step 05 - Creating a processor
Finally, the last step is to make a processor that will continuously apply the changes we’ve describe above. That can be done by creating a processor with the following query:
SET defaults.topic.autocreate=true;
INSERT INTO
structured_logs
SELECT STREAM
CAST(matches[0][1] AS INT) as year,
CAST(matches[0][2] AS INT) as month,
CAST(matches[0][3] AS INT) as day,
CAST(matches[0][4] AS INT) as hour,
CAST(matches[0][5] AS INT) as minutes,
CAST(matches[0][6] AS INT) as seconds,
CAST(matches[0][7] AS INT) as millis,
matches[0][8] as log_level,
matches[0][9] as log_msg
FROM
(
SELECT STREAM
log_line,
regex_matches(log_line, '^\[(\d{4})-(\d{2})-(\d{2}) (\d{2})\:(\d{2}):(\d{2}),(\d{3})\] (INFO|ERROR|WARN) (.*)$') as matches
FROM
log_lines
WHERE
SIZEOF(regexp(log_line, '^\[\d{4}-\d{2}-\d{2} \d{2}\:\d{2}:\d{2},\d{3}\] (INFO|ERROR|WARN).*$')) >= 1
)