Splitting string functions

String splitting functions provide powerful tools for dividing strings into arrays based on various criteria.

ClickHouse function reference

splitByChar

Splits a string into substrings separated by a specified character. Syntax:

splitByChar(separator, s[, max_substrings])

Arguments:

separator (String): The separator character.
s (String): The string to split.
max_substrings (UInt64, optional): Maximum number of substrings to return. Default is 0 (return all substrings).

Returns: An array of selected substrings. (Array(String)) Example:

SELECT
  splitByChar(',', 'carnitas,al_pastor,barbacoa,lengua');

Result:

| splitByChar(',', 'carnitas,al_pastor,barbacoa,lengua') |
|--------------------------------------------------------|
| ['carnitas','al_pastor','barbacoa','lengua']           |

In this example, we split a string of taco fillings into an array using a comma as the separator.

Empty substrings may be selected when:

A separator occurs at the beginning or end of the string
There are multiple consecutive separators
The original string is empty

When max_substrings is 0 or omitted, the function returns as many substrings as possible.

splitByString

Splits a string into substrings separated by a specified string delimiter. Syntax:

splitByString(separator, s[, max_substrings])

Arguments:

separator (String): The string separator.
s (String): The string to split.
max_substrings (UInt64, optional): Maximum number of substrings to return. Default is 0 (return all substrings).

Returns: An array of selected substrings. (Array(String))

Empty substrings may be selected when:
- A non-empty separator occurs at the beginning or end of the string
- There are multiple consecutive non-empty separators
- The original string s is empty while the separator is not empty
If separator is an empty string, it splits the string s into an array of single characters

Example:

SELECT
  splitByString(', ', 'beef, chicken, carnitas, barbacoa') AS taco_fillings;

Result:

| taco_fillings                                |
|----------------------------------------------|
| ['beef','chicken','carnitas','barbacoa']     |

In this example, we split a string of taco fillings into an array using ’, ’ as the separator. Example with empty separator:

SELECT
  splitByString('', 'salsa') AS salsa_letters;

Result:

| salsa_letters       |
|---------------------|
| ['s','a','l','s']   |

When the separator is an empty string, the function splits the input string into individual characters. Example with max_substrings:

SELECT
  splitByString(', ', 'mild, medium, hot, extra hot', 2) AS sauce_levels;

Result:

| sauce_levels      |
|-------------------|
| ['mild','medium'] |

By specifying max_substrings as 2, we limit the output to the first two substrings.

splitByRegexp

Splits a string into substrings using a regular expression as the separator. Syntax:

splitByRegexp(regexp, s[, max_substrings])

Arguments:

regexp (String or FixedString): Regular expression used as the separator.
s (String): The string to split.
max_substrings (UInt64, optional): Maximum number of substrings to return. Default is 0 (return all substrings).

Returns: An array of selected substrings. (Array(String)) Example:

SELECT
  splitByRegexp('\\s+', 'carnitas tacos al pastor barbacoa');

Result:

| splitByRegexp('\\s+', 'carnitas tacos al pastor barbacoa') |
|------------------------------------------------------------|
| ['carnitas','tacos','al','pastor','barbacoa']              |

In this example, the function splits the taco order into individual words using whitespace as the separator.

Empty substrings may be selected when:
- A non-empty regular expression match occurs at the beginning or end of the string
- There are multiple consecutive non-empty regular expression matches
- The original string s is empty while the regular expression is not empty
If no match is found for the regular expression, the string s won’t be split.

Example with max_substrings:

SELECT
  splitByRegexp('\\s+', 'carnitas tacos al pastor barbacoa', 3);

Result:

| splitByRegexp('\\s+', 'carnitas tacos al pastor barbacoa', 3) |
|---------------------------------------------------------------|
| ['carnitas','tacos','al']                                     |

This function is useful for parsing structured text data, extracting specific patterns, or tokenizing strings based on complex separators.

splitByWhitespace

Splits a string into substrings separated by whitespace characters. Syntax:

splitByWhitespace(s[, max_substrings])

Arguments:

s (String): The string to split.
max_substrings (UInt64, optional): The maximum number of substrings to return.
- Default value: 0 (return all possible substrings)

Returns: An array of selected substrings. (Array(String)) Example:

SELECT
  splitByWhitespace('Carne asada tacos al pastor') AS taco_types;

Result:

| taco_types                                 |
|--------------------------------------------|
| ['Carne','asada','tacos','al','pastor']    |

In this example, the function splits the taco order into individual words.

Empty substrings are not included in the result.
When max_substrings > 0, the function returns at most that many substrings.

Example with max_substrings:

SELECT
	splitByWhitespace('Carne asada tacos al pastor', 3) AS limited_taco_types;

Result:

| limited_taco_types         |
|----------------------------|
| ['Carne','asada','tacos']  |

This function is useful for parsing space-separated values or tokenizing text for further analysis.

splitByNonAlpha

Splits a string into substrings separated by non-alphabetic characters (whitespace and punctuation). Returns an array of selected substrings. Syntax:

splitByNonAlpha(s[, max_substrings])

Arguments:

s (String): The string to split.
max_substrings (UInt64, optional): The maximum number of substrings to return. Default is 0 (return all substrings).

Returns: An array of selected substrings. (Array(String)) Example:

SELECT
  splitByNonAlpha('Crunchy tacos: $3.99, Soft tacos: $3.50!');

Result:

| splitByNonAlpha('Crunchy tacos: $3.99, Soft tacos: $3.50!')      |
|------------------------------------------------------------------|
| ['Crunchy','tacos','3','99','Soft','tacos','3','50']             |

In this example, the function splits the taco menu string into an array of words and numbers, removing all non-alphabetic characters. Example with max_substrings:

SELECT
	splitByNonAlpha('Crunchy tacos: $3.99, Soft tacos: $3.50!', 3) AS limited_taco_types;

Result:

| splitByNonAlpha('Crunchy tacos: $3.99, Soft tacos: $3.50!', 3)  |
|-----------------------------------------------------------------|
| ['Crunchy','tacos','3']                                         |

This function is useful for tokenizing text, extracting words from sentences, or parsing structured strings where non-alphabetic characters serve as delimiters.

arrayStringConcat

Concatenates string representations of values listed in the array with an optional separator. Syntax:

arrayStringConcat(arr[, separator])

Arguments:

arr (Array): The input array of strings to concatenate.
separator (String, optional): The separator to use between array elements. Defaults to an empty string.

Returns: A string containing the concatenated elements of the input array. (String) Example:

SELECT
  arrayStringConcat(['Carne Asada', 'Al Pastor', 'Pollo'], ', ') AS taco_order;

Result:

| taco_order                      |
|---------------------------------|
| Carne Asada, Al Pastor, Pollo   |

In this example, arrayStringConcat joins the array of taco types into a single string, separating each type with a comma and space.

If the separator is omitted, the array elements are concatenated without any characters between them.

alphaTokens

Selects substrings of consecutive bytes from the ranges a-z and A-Z. Returns an array of substrings. Syntax:

alphaTokens(s[, max_substrings])

Alias:

splitByAlpha

Arguments:

s (String): The string to split.
max_substrings (UInt64, optional): Maximum number of substrings to return. Default is 0 (return all substrings).

Returns: An array of selected substrings. [Array(String)] Example:

SELECT
  alphaTokens('Crunchy Taco Supreme 2.99') AS taco_tokens;

Result:

| taco_tokens                     |
|---------------------------------|
| ['Crunchy','Taco','Supreme']    |

In this example, alphaTokens extracts all consecutive alphabetic substrings from the taco menu item, ignoring numbers and spaces.

extractAllGroups

Extracts all groups from non-overlapping substrings matched by a regular expression. Syntax:

extractAllGroups(text, regexp)

Arguments:

text (String or FixedString): The input text to search for matches.
regexp (String or FixedString): The regular expression to match against. Must be a constant.

Returns:

An array of arrays containing the matched groups. [Array(Array(String))]
- If at least one matching group is found, returns an array of arrays, where each inner array represents a match and contains the captured groups.
- If no matching groups are found, returns an empty array.

Example:

SELECT
	extractAllGroups('Carnitas Taco=$5.99, Fish Taco="$6.50"', '("[^"]+"|\\w+)=("[^"]+"|\\w+)') AS taco_prices;

Result:

| taco_prices                                         |
|-----------------------------------------------------|
| [['Carnitas Taco','5.99'],['Fish Taco','"$6.50"']]  |

In this example:

The function extracts price information for different types of tacos.
Each inner array contains two elements: the taco name and its price.
The regular expression matches patterns like name=price, where both name and price can be quoted or unquoted.

This function is particularly useful for parsing structured text data, such as log files or semi-structured data formats, where you need to extract multiple pieces of information based on a pattern.

ngrams

Splits a UTF-8 string into n-grams of a specified size. Syntax:

ngrams(string, ngramsize)

Arguments:

string (String or FixedString): The input string to split into n-grams.
ngramsize (UInt): The size of each n-gram.

Returns: An array of strings, where each string is an n-gram of the specified size. [Array(String)] Example:

SELECT
  ngrams('Taco Tuesday', 3) AS taco_grams;

Result:

| taco_grams                                          |
|-----------------------------------------------------|
| ['Tac','aco',' Tu','Tue','ues','esd','sda','day']   |

In this example, the function splits the phrase “Taco Tuesday” into 3-grams. Each element in the resulting array represents a substring of 3 consecutive characters from the input string.

The function considers UTF-8 characters, so it works correctly with multi-byte characters as well.

This function is useful for text analysis, particularly in natural language processing tasks such as text classification or similarity measurement.

tokens

Splits a string into tokens using non-alphanumeric ASCII characters as separators. Syntax:

tokens(input_string)

Arguments:

input_string (String): The string to be tokenized.

Returns: An array of tokens extracted from the input string. [Array(String)] Example:

SELECT
  tokens('spicy_taco, mild_salsa; extra_guacamole') AS taco_ingredients;

Result:

| taco_ingredients                                       |
|--------------------------------------------------------|
| ['spicy','taco','mild','salsa','extra','guacamole']    |

In this example, the tokens function splits the input string into separate words, removing the non-alphanumeric separators (underscore, comma, semicolon, and space).

The function treats any sequence of non-alphanumeric ASCII characters as a separator. Multiple consecutive separators are treated as a single separator.

This function is useful for text processing tasks such as tokenizing ingredient lists, parsing user input, or preparing text for further analysis.

Introduction

Statements

Splitting string functions

ClickHouse function reference

splitByChar

splitByString

splitByRegexp

splitByWhitespace

splitByNonAlpha

arrayStringConcat

alphaTokens

extractAllGroups

ngrams

tokens

Introduction

Statements

​ClickHouse function reference

​splitByChar

​splitByString

​splitByRegexp

​splitByWhitespace

​splitByNonAlpha

​arrayStringConcat

​alphaTokens

​extractAllGroups

​ngrams

​tokens

ClickHouse function reference

splitByChar

splitByString

splitByRegexp

splitByWhitespace

splitByNonAlpha

arrayStringConcat

alphaTokens

extractAllGroups

ngrams

tokens