String splitting functions provide powerful tools for dividing strings into arrays based on various criteria.

ClickHouse function reference

splitByChar

Splits a string into substrings separated by a specified character. Syntax:
splitByChar(separator, s[, max_substrings])
Arguments:
  • separator (String): The separator character.
  • s (String): The string to split.
  • max_substrings (UInt64, optional): Maximum number of substrings to return. Default is 0 (return all substrings).
Returns: An array of selected substrings. (Array(String)) Example:
SELECT
  splitByChar(',', 'carnitas,al_pastor,barbacoa,lengua');
Result:
| splitByChar(',', 'carnitas,al_pastor,barbacoa,lengua') |
|--------------------------------------------------------|
| ['carnitas','al_pastor','barbacoa','lengua']           |
In this example, we split a string of taco fillings into an array using a comma as the separator.
Empty substrings may be selected when:
  • A separator occurs at the beginning or end of the string
  • There are multiple consecutive separators
  • The original string is empty
When max_substrings is 0 or omitted, the function returns as many substrings as possible.

splitByString

Splits a string into substrings separated by a specified string delimiter. Syntax:
splitByString(separator, s[, max_substrings])
Arguments:
  • separator (String): The string separator.
  • s (String): The string to split.
  • max_substrings (UInt64, optional): Maximum number of substrings to return. Default is 0 (return all substrings).
Returns: An array of selected substrings. (Array(String))
  • Empty substrings may be selected when:
    • A non-empty separator occurs at the beginning or end of the string
    • There are multiple consecutive non-empty separators
    • The original string s is empty while the separator is not empty
  • If separator is an empty string, it splits the string s into an array of single characters
Example:
SELECT
  splitByString(', ', 'beef, chicken, carnitas, barbacoa') AS taco_fillings;
Result:
| taco_fillings                                |
|----------------------------------------------|
| ['beef','chicken','carnitas','barbacoa']     |
In this example, we split a string of taco fillings into an array using ’, ’ as the separator. Example with empty separator:
SELECT
  splitByString('', 'salsa') AS salsa_letters;
Result:
| salsa_letters       |
|---------------------|
| ['s','a','l','s']   |
When the separator is an empty string, the function splits the input string into individual characters. Example with max_substrings:
SELECT
  splitByString(', ', 'mild, medium, hot, extra hot', 2) AS sauce_levels;
Result:
| sauce_levels      |
|-------------------|
| ['mild','medium'] |
By specifying max_substrings as 2, we limit the output to the first two substrings.

splitByRegexp

Splits a string into substrings using a regular expression as the separator. Syntax:
splitByRegexp(regexp, s[, max_substrings])
Arguments:
  • regexp (String or FixedString): Regular expression used as the separator.
  • s (String): The string to split.
  • max_substrings (UInt64, optional): Maximum number of substrings to return. Default is 0 (return all substrings).
Returns: An array of selected substrings. (Array(String)) Example:
SELECT
  splitByRegexp('\\s+', 'carnitas tacos al pastor barbacoa');
Result:
| splitByRegexp('\\s+', 'carnitas tacos al pastor barbacoa') |
|------------------------------------------------------------|
| ['carnitas','tacos','al','pastor','barbacoa']              |
In this example, the function splits the taco order into individual words using whitespace as the separator.
  1. Empty substrings may be selected when:
    • A non-empty regular expression match occurs at the beginning or end of the string
    • There are multiple consecutive non-empty regular expression matches
    • The original string s is empty while the regular expression is not empty
  2. If no match is found for the regular expression, the string s won’t be split.
Example with max_substrings:
SELECT
  splitByRegexp('\\s+', 'carnitas tacos al pastor barbacoa', 3);
Result:
| splitByRegexp('\\s+', 'carnitas tacos al pastor barbacoa', 3) |
|---------------------------------------------------------------|
| ['carnitas','tacos','al']                                     |
This function is useful for parsing structured text data, extracting specific patterns, or tokenizing strings based on complex separators.

splitByWhitespace

Splits a string into substrings separated by whitespace characters. Syntax:
splitByWhitespace(s[, max_substrings])
Arguments:
  • s (String): The string to split.
  • max_substrings (UInt64, optional): The maximum number of substrings to return.
    • Default value: 0 (return all possible substrings)
Returns: An array of selected substrings. (Array(String)) Example:
SELECT
  splitByWhitespace('Carne asada tacos al pastor') AS taco_types;
Result:
| taco_types                                 |
|--------------------------------------------|
| ['Carne','asada','tacos','al','pastor']    |
In this example, the function splits the taco order into individual words.
  1. Empty substrings are not included in the result.
  2. When max_substrings > 0, the function returns at most that many substrings.
Example with max_substrings:
SELECT
	splitByWhitespace('Carne asada tacos al pastor', 3) AS limited_taco_types;
Result:
| limited_taco_types         |
|----------------------------|
| ['Carne','asada','tacos']  |
This function is useful for parsing space-separated values or tokenizing text for further analysis.

splitByNonAlpha

Splits a string into substrings separated by non-alphabetic characters (whitespace and punctuation). Returns an array of selected substrings. Syntax:
splitByNonAlpha(s[, max_substrings])
Arguments:
  • s (String): The string to split.
  • max_substrings (UInt64, optional): The maximum number of substrings to return. Default is 0 (return all substrings).
Returns: An array of selected substrings. (Array(String)) Example:
SELECT
  splitByNonAlpha('Crunchy tacos: $3.99, Soft tacos: $3.50!');
Result:
| splitByNonAlpha('Crunchy tacos: $3.99, Soft tacos: $3.50!')      |
|------------------------------------------------------------------|
| ['Crunchy','tacos','3','99','Soft','tacos','3','50']             |
In this example, the function splits the taco menu string into an array of words and numbers, removing all non-alphabetic characters. Example with max_substrings:
SELECT
	splitByNonAlpha('Crunchy tacos: $3.99, Soft tacos: $3.50!', 3) AS limited_taco_types;
Result:
| splitByNonAlpha('Crunchy tacos: $3.99, Soft tacos: $3.50!', 3)  |
|-----------------------------------------------------------------|
| ['Crunchy','tacos','3']                                         |
This function is useful for tokenizing text, extracting words from sentences, or parsing structured strings where non-alphabetic characters serve as delimiters.

arrayStringConcat

Concatenates string representations of values listed in the array with an optional separator. Syntax:
arrayStringConcat(arr[, separator])
Arguments:
  • arr (Array): The input array of strings to concatenate.
  • separator (String, optional): The separator to use between array elements. Defaults to an empty string.
Returns: A string containing the concatenated elements of the input array. (String) Example:
SELECT
  arrayStringConcat(['Carne Asada', 'Al Pastor', 'Pollo'], ', ') AS taco_order;
Result:
| taco_order                      |
|---------------------------------|
| Carne Asada, Al Pastor, Pollo   |
In this example, arrayStringConcat joins the array of taco types into a single string, separating each type with a comma and space.
If the separator is omitted, the array elements are concatenated without any characters between them.

alphaTokens

Selects substrings of consecutive bytes from the ranges a-z and A-Z. Returns an array of substrings. Syntax:
alphaTokens(s[, max_substrings])
Alias:
  • splitByAlpha
Arguments:
  • s (String): The string to split.
  • max_substrings (UInt64, optional): Maximum number of substrings to return. Default is 0 (return all substrings).
Returns: An array of selected substrings. [Array(String)] Example:
SELECT
  alphaTokens('Crunchy Taco Supreme 2.99') AS taco_tokens;
Result:
| taco_tokens                     |
|---------------------------------|
| ['Crunchy','Taco','Supreme']    |
In this example, alphaTokens extracts all consecutive alphabetic substrings from the taco menu item, ignoring numbers and spaces.

extractAllGroups

Extracts all groups from non-overlapping substrings matched by a regular expression. Syntax:
extractAllGroups(text, regexp)
Arguments:
  • text (String or FixedString): The input text to search for matches.
  • regexp (String or FixedString): The regular expression to match against. Must be a constant.
Returns:
  • An array of arrays containing the matched groups. [Array(Array(String))]
    • If at least one matching group is found, returns an array of arrays, where each inner array represents a match and contains the captured groups.
    • If no matching groups are found, returns an empty array.
Example:
SELECT
	extractAllGroups('Carnitas Taco=$5.99, Fish Taco="$6.50"', '("[^"]+"|\\w+)=("[^"]+"|\\w+)') AS taco_prices;
Result:
| taco_prices                                         |
|-----------------------------------------------------|
| [['Carnitas Taco','5.99'],['Fish Taco','"$6.50"']]  |
In this example:
  • The function extracts price information for different types of tacos.
  • Each inner array contains two elements: the taco name and its price.
  • The regular expression matches patterns like name=price, where both name and price can be quoted or unquoted.
This function is particularly useful for parsing structured text data, such as log files or semi-structured data formats, where you need to extract multiple pieces of information based on a pattern.

ngrams

Splits a UTF-8 string into n-grams of a specified size. Syntax:
ngrams(string, ngramsize)
Arguments:
  • string (String or FixedString): The input string to split into n-grams.
  • ngramsize (UInt): The size of each n-gram.
Returns: An array of strings, where each string is an n-gram of the specified size. [Array(String)] Example:
SELECT
  ngrams('Taco Tuesday', 3) AS taco_grams;
Result:
| taco_grams                                          |
|-----------------------------------------------------|
| ['Tac','aco',' Tu','Tue','ues','esd','sda','day']   |
In this example, the function splits the phrase “Taco Tuesday” into 3-grams. Each element in the resulting array represents a substring of 3 consecutive characters from the input string.
The function considers UTF-8 characters, so it works correctly with multi-byte characters as well.
This function is useful for text analysis, particularly in natural language processing tasks such as text classification or similarity measurement.

tokens

Splits a string into tokens using non-alphanumeric ASCII characters as separators. Syntax:
tokens(input_string)
Arguments:
  • input_string (String): The string to be tokenized.
Returns: An array of tokens extracted from the input string. [Array(String)] Example:
SELECT
  tokens('spicy_taco, mild_salsa; extra_guacamole') AS taco_ingredients;
Result:
| taco_ingredients                                       |
|--------------------------------------------------------|
| ['spicy','taco','mild','salsa','extra','guacamole']    |
In this example, the tokens function splits the input string into separate words, removing the non-alphanumeric separators (underscore, comma, semicolon, and space).
The function treats any sequence of non-alphanumeric ASCII characters as a separator. Multiple consecutive separators are treated as a single separator.
This function is useful for text processing tasks such as tokenizing ingredient lists, parsing user input, or preparing text for further analysis.