- Web analytics
- Log processing
- Any task involving URL data
-
Extract key components of a URL:
- Protocol (e.g., http, https)
- Domain
- Path
- Query parameters
-
Perform common URL operations:
- Encoding
- Decoding
ClickHouse function reference
protocol
Extracts the protocol from a URL. Syntax:url
(String
): URL to extract protocol from.
- The protocol, or an empty string if it cannot be determined. [
String
]
This function is optimized for performance and may not strictly follow RFC-3986. For RFC-compliant parsing, use
protocolRFC
instead.domain
Extracts the hostname from a URL. Syntax:url
(String
): URL.
- Host name if the input string can be parsed as a URL, otherwise an empty string. (
String
)
The function is optimized for performance and may not strictly follow RFC-3986. For RFC-compliant parsing, use
domainRFC()
.domainRFC
Extracts the hostname from a URL, conforming to RFC 3986. Syntax:url
(String
): URL.
- Host name if the input string can be parsed as a URL, otherwise an empty string. (
String
)
domain
, but it strictly follows RFC 3986 standards. It’s particularly useful when dealing with URLs that contain special characters or complex structures.
The
domainRFC
function can handle URLs with user information and non-standard ports, which the non-RFC variant might struggle with.domainWithoutWWW
Returns the domain name without the leadingwww.
subdomain, if present.
Syntax
url
(String
): URL.
- The domain name without the leading
www.
subdomain, if present. (String
)
If the URL doesn’t contain
www.
at the beginning of the domain, the function returns the domain as is. If the input is not a valid URL or doesn’t contain a domain, an empty string is returned.domainWithoutWWWRFC
Returns the domain without leadingwww.
if present. This function conforms to RFC 3986.
Syntax
url
(String
): URL.
- Domain name without leading
www.
if present, otherwise an empty string. (String
)
domainWithoutWWWRFC
correctly extracts the domain ‘tacosoft.com’ from the URL, removing the www.
prefix and conforming to RFC 3986 standards.
This function is particularly useful when you need to extract the domain from URLs that may contain various components like usernames, passwords, ports, and query parameters, while ensuring compliance with RFC 3986.
topLevelDomain
Extracts the top-level domain from a URL. Syntax:url
(String
): URL.
- The top-level domain if the input string can be parsed as a URL. Otherwise, an empty string. (
String
)
The URL can be specified with or without a protocol. For example:will also return ‘com’.If the function cannot parse the input as a URL or if there’s no valid top-level domain, it returns an empty string.
topLevelDomainRFC
Extracts the top-level domain from a URL. This function conforms to RFC 3986. Syntaxurl
(String
): URL.
- The top-level domain name if the input string can be parsed as a URL. Otherwise, an empty string. (
String
)
Unlike its non-RFC counterpart,
topLevelDomainRFC
can correctly handle URLs with special characters in the user info part (following the @ symbol), such as %, ;, =, &, and others, as defined in RFC 3986.firstSignificantSubdomain
Returns the “first significant subdomain” of a URL. Syntaxurl
(String
): URL.
- The first significant subdomain. (
String
)
- For second-level domains like .com, .net, .org, or .co, it returns the third-level domain.
- For all other cases, it returns the second-level domain.
The list of “insignificant” second-level domains and other implementation details may change in future versions.This function is optimized for performance and may not strictly follow URL parsing standards. For RFC-compliant parsing, use
firstSignificantSubdomainRFC
.firstSignificantSubdomainRFC
Returns the “first significant subdomain” of a URL, conforming to RFC 3986. Syntaxurl
(String
): URL.
- The first significant subdomain. (
String
)
- For second-level domains like .com, .net, .org, or .co, it returns the third-level domain.
- For other domains, it returns the second-level domain.
firstSignificantSubdomain
, but strictly adheres to RFC 3986 for URL parsing.
Examples
The list of “insignificant” second-level domains and other implementation details may change in future versions.
cutToFirstSignificantSubdomain
Returns the part of the domain that includes top-level subdomains up to the “first significant subdomain”. Syntaxurl
(String
): URL.
- Part of the domain that includes top-level subdomains up to the first significant subdomain if possible, otherwise returns an empty string.
String
.
subdomain
returns ‘tacosoft.com’, which is the part of the domain up to the first significant subdomain.subdomain2
returns ‘tacosoft’, as ‘www’ is not considered significant.subdomain3
returns an empty string, as ‘tacosoft’ alone is not a valid domain with a significant subdomain.
cutToFirstSignificantSubdomainRFC
Returns the part of the domain that includes top-level subdomains up to the “first significant subdomain”. Similar tocutToFirstSignificantSubdomain
, but conforms to RFC 3986.
Syntax
url
(String
): URL.
- Part of the domain that includes top-level subdomains up to the first significant subdomain if possible, otherwise returns an empty string.
String
.
cutToFirstSignificantSubdomainRFC
correctly extracts ‘delicious-tacos.com’ from the URL, including the user information and port in the parsing process as per RFC 3986. The non-RFC version fails to parse the URL correctly due to the presence of user information.
This function is particularly useful when dealing with complex URLs that may contain user information, non-standard ports, or other elements that require strict adherence to URL standards.
cutToFirstSignificantSubdomainWithWWW
Returns the part of the domain that includes top-level subdomains up to the “first significant subdomain”, without stripping www. Syntax:url
(String
): URL.
String
.
Example:
- For
domain
, the function returnstacosoft.com
, preserving thewww
if it was present. - For
domain2
, it returnstacosoft.co
, keeping thewww
. - For
domain3
, it returnstacosoft.io
as there’s no subdomain to cut.
www
subdomain if it exists.
cutToFirstSignificantSubdomainWithWWWRFC
Returns the part of the domain that includes top-level subdomains up to the “first significant subdomain”, without stripping www. This function conforms to RFC 3986. Syntaxurl
(String
): URL.
- Part of the domain that includes top-level subdomains up to the first significant subdomain (with www) if possible, otherwise returns an empty string.
String
.
The function adheres to RFC 3986, ensuring proper handling of special characters and edge cases in URLs.
port
Extracts the port number from a URL, or returns a default port if not specified. Syntaxurl
(String
): URL to extract the port from.default_port
(UInt16
, optional): The default port number to return if no port is specified in the URL.
- The port number from the URL, or the default port if not specified. (
UInt16
)
port_number
extracts the explicitly specified port (8443) from the URL.default_port
returns the provided default value (80) since no port is specified in the URL.
If the URL cannot be parsed or doesn’t contain a port, and no default port is provided, the function returns 0.
portRFC
Returns the port number from a URL, or a default port if not specified. This function conforms to RFC 3986. Syntaxurl
(String
): URL to extract the port from.default_port
(UInt16
, optional): The port number to return if no port is specified in the URL. Default: 0.
- The port number from the URL, or the default port if not specified. (
UInt16
)
port_with_url
returns 8080, which is explicitly specified in the URL.port_with_default
returns 443 (the default HTTPS port) since no port is specified in the URL.
This function is RFC 3986 compliant, which means it correctly handles URLs with special characters or unusual formats. For non-RFC compliant URL parsing, use the
port
function instead.path
Extracts the path from a URL without the query string. Syntaxurl
(String
): URL.
- The path component of the URL without the query string. (
String
)
/menu/burritos
from the URL, omitting the query string ?size=large&extra=guac
and the fragment #nutrition
.
If the URL does not contain a path, an empty string is returned.
pathFull
Returns the full path of a URL, including the query string and fragment. Syntax:url
(String
): URL.
String
)
Example:
pathFull
extracts the complete path from the URL, including the query string ?size=large
and the fragment #spiciness
.
If the URL doesn’t contain a path, query string, or fragment, an empty string is returned.
protocol
Extracts the protocol from a URL. Syntax:url
(String
): URL to extract protocol from.
- The protocol, or an empty string if it cannot be determined. [
String
]
This function is optimized for performance and may not strictly follow RFC-3986. For RFC-compliant parsing, use
protocolRFC
instead.queryString
Extracts the query string from a URL without the initial question mark, # and everything after #. Syntaxurl
(String
): URL to extract the query string from.
- The query string without the initial question mark and fragment identifier. (
String
)
queryString
extracts ‘items=3&sauce=hot’ from the URL, omitting the initial ’?’ and everything after ’#’.
If the URL doesn’t contain a query string, an empty string is returned.
fragment
Extracts the fragment identifier from a URL, without the initial hash symbol. Syntaxurl
(String
): URL to extract the fragment from.
- The fragment identifier without the initial hash symbol, or an empty string if there is no fragment. (
String
)
fragment
extracts ‘spicy-tacos’ from the URL, which represents the specific section of the taco menu being referenced.
If the URL doesn’t contain a fragment identifier, an empty string is returned:Result:
queryStringAndFragment
Returns the query string and fragment identifier from a URL. Syntaxurl
(String
): URL.
- The query string and fragment identifier. (
String
)
- This function does not decode URL-encoded characters.
- If only a fragment identifier is present (without a query string), it will still be returned.
extractURLParameter
Extracts the value of a specified parameter from a URL. Syntaxurl
(String
): The URL to extract the parameter from.name
(String
): The name of the parameter to extract.
- The value of the specified parameter if present in the URL, otherwise an empty string. (
String
)
- If there are multiple parameters with the same name, the function returns the value of the first occurrence.
- The function assumes that the parameter in the URL is encoded in the same way as in the
name
argument.
extractURLParameters
Extracts all parameters and their values from a URL query string. SyntaxURL
(String
): The URL to extract parameters from.
- An array of
name=value
strings corresponding to the URL parameters. (Array(String)
)
name=value
. The values are not decoded.
If the URL doesn’t contain any parameters, an empty array is returned.
extractURLParameterNames
Extracts the names of parameters from a URL. Syntaxurl
(String
): URL to extract parameter names from.
- An array of strings containing the names of URL parameters. (
Array(String)
)
- If the URL has no parameters, an empty array is returned.
- The function does not handle duplicate parameter names in any special way; all occurrences will be included in the result.
- The order of parameter names in the resulting array matches their order in the URL.
URLHierarchy
Returns an array containing the URL, truncated at the end by the symbols/
and ?
in the path and query string. Consecutive separator characters are counted as one. The cut is made in the position after all the consecutive separator characters.
Syntax
url
(String
): URL.
- An array of strings containing the hierarchical parts of the URL.
The function includes the protocol and domain in the result, unlike
URLPathHierarchy
which focuses only on the path.URLPathHierarchy
Returns an array containing the URL path hierarchy, excluding the protocol and host. Syntaxurl
(String
): URL.
- An array of strings representing the URL path hierarchy. (
Array(String)
)
- It removes the protocol and domain.
- It splits the remaining path at each forward slash (/).
- It builds an array where each element is a progressively longer portion of the path.
The root path (’/’) is not included in the result array.
encodeURLComponent
Encodes a URL component by replacing certain characters with their percent-encoded equivalents. Syntax:url
(String
): The URL component to encode.
This function is useful when you need to include special characters or non-ASCII characters in a URL, ensuring that the URL remains valid and properly formatted. It’s particularly helpful when constructing URLs with query parameters that may contain special characters.
decodeURLComponent
Decodes a URL-encoded string. Syntax:url
(String
): The URL-encoded string to decode.
String
]
Example:
%20
is decoded to a space, %3A
to a colon, and %2F
to a forward slash.
The
decodeURLComponent
function is the inverse of encodeURLComponent
. It’s particularly useful when working with URLs or query parameters that may contain special characters or spaces.encodeURLFormComponent
Encodes a URL component following RFC 1866, where spaces are encoded as plus signs (+). Syntax:url
(String
): URL component to encode.
- The encoded URL component. [
String
]
This function differs from
encodeURLComponent
in that it specifically encodes spaces as plus signs, which is the expected behavior for URL-encoded form data.decodeURLFormComponent
Decodes a URL-encoded form component string. Syntaxencoded_string
(String
): A URL-encoded string.
- The decoded string. (
String
)
- Converts
+
(plus) to a space character. - Decodes percent-encoded sequences (e.g.,
%20
to space,%2B
to+
).
decodeURLFormComponent
converts the +
to a space and decodes %21
to an exclamation mark, resulting in “Spicy Taco!”.
This function is particularly useful when working with form data submitted via HTTP POST requests or when processing URL query parameters.
netloc
Extracts the network locality (username:password@host:port) from a URL. Syntax:url
(String
): URL.
- The network locality part of the URL (username:password@host:port). [
String
]
netloc
extracts the network locality part from a URL for a taco restaurant’s online ordering system, including the username, password, host, and port.
If the URL doesn’t contain any network locality information, an empty string is returned.
cutWWW
Removes the leadingwww.
from a URL’s domain, if present.
Syntax
url
(String
): The URL to process.
- The URL with the leading
www.
removed from the domain, if present. Otherwise, returns the original URL. [String
]
cutWWW
removes the www.
from the domain of the Taco Bell website URL.
This function only removes the
www.
prefix if it appears at the beginning of the domain. It does not affect other parts of the URL or remove www.
if it appears elsewhere in the URL.cutQueryString
Removes the query string from a URL, including the question mark. Syntax:url
(String
): The URL to process.
String
]
Example:
cutQueryString
removes the query string ?category=burritos&spicy=true
from the URL, leaving only the base URL.
If the URL does not contain a query string, the function returns the original URL unchanged.
cutFragment
Removes the fragment identifier from a URL, including the hash symbol (#). Syntaxurl
(String
): The URL to process.
- The URL with the fragment identifier removed. [
String
]
If the URL doesn’t contain a fragment identifier, the function returns the original URL unchanged.
cutQueryStringAndFragment
Removes the query string and fragment identifier from a URL, including the question mark and number sign. Syntaxurl
(String
): The URL to process.
- The URL with query string and fragment removed. (
String
)
If the URL doesn’t contain a query string or fragment, it remains unchanged.
cutURLParameter
Removes a specified parameter from a URL. Syntaxurl
(String
): The URL to modify.name
(String
orArray(String)
): The name of the parameter to remove.
- The modified URL with the specified parameter(s) removed. (
String
)
url_without_toppings
removes the ‘toppings’ parameter from the URL.url_without_size_and_sauce
removes both the ‘size’ and ‘sauce’ parameters from the URL.
This function does not encode or decode characters in parameter names. For example, ‘Salsa Type’ and ‘Salsa%20Type’ are treated as different parameter names.