Remember, correlation does not imply causation.

ClickHouse function reference

contingency

Calculates the contingency coefficient between two columns. The contingency function is similar to the cramersV function but uses a different denominator in the square root calculation. Syntax:
contingency(column1, column2)
Arguments:
  • column1 (any): The first column to compare.
  • column2 (any): The second column to compare.
Returns: A value between 0 and 1, where a larger result indicates a closer association between the two columns. [Float64] Example:
SELECT
	contingency(taco_type, salsa_type) AS taco_salsa_association
FROM
(
    SELECT 'Beef' AS taco_type, 'Mild' AS salsa_type
    UNION ALL
    SELECT 'Chicken' AS taco_type, 'Medium' AS salsa_type
    UNION ALL
    SELECT 'Vegetarian' AS taco_type, 'Hot' AS salsa_type
    UNION ALL
    SELECT 'Beef' AS taco_type, 'Medium' AS salsa_type
    UNION ALL
    SELECT 'Chicken' AS taco_type, 'Hot' AS salsa_type
    UNION ALL
    SELECT 'Vegetarian' AS taco_type, 'Mild' AS salsa_type
) AS taco_orders;
Result:
| taco_salsa_association |
|------------------------|
| 0.1643989873053573     |
In this example, we calculate the association between taco types and salsa types in taco orders.

cramersV

Calculates the [Cramér’s V statistic](https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V#:~:text=In%20statistics%2C%20Cram%C3%A9r’s%20V%20(sometimes,by%20Harald%20Cram%C3%A9r%20in%201946.), which measures the strength of association between two categorical variables. Syntax:
cramersV(x, y)
Arguments:
  • x (any): The first categorical variable.
  • y (any): The second categorical variable.
Returns: A value between 0 and 1, where:
  • 0 indicates no association
  • 1 indicates perfect association
Return type: Float64 Example:
SELECT
	cramersV(taco_type, salsa_preference) AS association_strength
FROM
(
    SELECT 'Beef' AS taco_type, 'Mild' AS salsa_preference
    UNION ALL
    SELECT 'Chicken' AS taco_type, 'Medium' AS salsa_preference
    UNION ALL
    SELECT 'Vegetarian' AS taco_type, 'Hot' AS salsa_preference
    UNION ALL
    SELECT 'Beef' AS taco_type, 'Medium' AS salsa_preference
    UNION ALL
    SELECT 'Chicken' AS taco_type, 'Hot' AS salsa_preference
    UNION ALL
    SELECT 'Vegetarian' AS taco_type, 'Mild' AS salsa_preference
) AS taco_orders;
Result:
| association_strength |
|----------------------|
| 0.2886751345948129   |
In this example, we calculate the association strength between taco_type and salsa_preference. The result of 0.28 suggests a moderate association between these two variables in taco orders.
Cramér’s V is particularly useful for comparing the strength of association between pairs of categorical variables, even when they have different numbers of categories.

cramersVBiasCorrected

Calculates the bias-corrected Cramer’s V, a measure of association between two columns in a table. Cramer’s V measures the strength of association between two categorical variables. This function uses a bias correction to provide a more accurate measure, especially for small sample sizes or when variables have many categories. The bias-corrected version typically returns lower values compared to the uncorrected cramersV function, offering a more conservative and often more realistic estimate of the association. Syntax:
cramersVBiasCorrected(column1, column2)
Arguments:
  • column1 (any): The first column to be compared.
  • column2 (any): The second column to be compared.
Returns: A value between 0 and 1, where:
  • 0 indicates no association between the columns’ values
  • 1 indicates complete association
Return type: Float64 Example:
SELECT
	cramersV(taco_type, salsa_type) AS uncorrected,
	cramersVBiasCorrected(taco_type, salsa_type) AS bias_corrected
FROM
(
    SELECT 'Beef' AS taco_type, 'Mild' AS salsa_type
    UNION ALL
    SELECT 'Chicken' AS taco_type, 'Medium' AS salsa_type
    UNION ALL
    SELECT 'Vegetarian' AS taco_type, 'Hot' AS salsa_type
    UNION ALL
    SELECT 'Beef' AS taco_type, 'Medium' AS salsa_type
    UNION ALL
    SELECT 'Chicken' AS taco_type, 'Hot' AS salsa_type
    UNION ALL
    SELECT 'Vegetarian' AS taco_type, 'Mild' AS salsa_type
) AS taco_orders;
Result:
| uncorrected         | bias_corrected       |
|---------------------|----------------------|
| 0.2886751345948129  | 0                    |
In this example, we compare the association between taco types and salsa types in orders. The bias-corrected version shows no association, providing a more conservative estimate of the relationship between these variables.
The bias-corrected version is generally preferred, especially when dealing with smaller datasets or variables with many categories, as it provides a more accurate representation of the true association.

entropy

Calculates the Shannon entropy of a column of values. Syntax:
entropy(val)
Arguments:
  • val (any type): Column of values.
Returns: The Shannon entropy as a Float64. Example:
SELECT
	entropy(filling) AS filling_entropy,
	entropy(quantity) AS quantity_entropy
FROM (
    SELECT 'beef' AS filling, 5 AS quantity
    UNION ALL
    SELECT 'beef' AS filling, 5 AS quantity
    UNION ALL
    SELECT 'chicken' AS filling, 3 AS quantity
    UNION ALL
    SELECT 'fish' AS filling, 2 AS quantity
    UNION ALL
    SELECT 'veggie' AS filling, 1 AS quantity
) AS taco_orders;
Result:
| filling_entropy    | quantity_entropy   |
|--------------------|--------------------|
| 1.9219280948873623 | 1.9219280948873623 |
In this example:
  • filling_entropy shows the entropy of taco fillings, indicating the diversity of choices.
  • quantity_entropy represents the entropy of order quantities, reflecting the variability in order sizes.
A higher entropy value suggests more diversity or randomness in the data, while a lower value indicates more uniformity or predictability.
The Shannon entropy is a measure of the average amount of information contained in each element of a set. It’s useful for analyzing the distribution and unpredictability of data in various fields, including information theory and data compression.

rankCorr

Calculates the rank correlation coefficient between two columns. Syntax:
rankCorr(x, y)
Arguments:
  • x (Float32 or Float64): The first set of values.
  • y (Float32 or Float64): The second set of values.
Returns: The rank correlation coefficient as a Float64 value ranging from -1 to +1.
  • A value close to +1 indicates a strong positive correlation.
  • A value close to -1 indicates a strong negative correlation.
  • A value close to 0 indicates little to no correlation.
Example:
SELECT
	rankCorr(taco_spiciness, customer_satisfaction) AS spice_satisfaction_correlation
FROM
(
    SELECT 5 AS taco_spiciness, 8 AS customer_satisfaction
    UNION ALL
    SELECT 7 AS taco_spiciness, 9 AS customer_satisfaction
    UNION ALL
    SELECT 6 AS taco_spiciness, 7 AS customer_satisfaction
    UNION ALL
    SELECT 8 AS taco_spiciness, 10 AS customer_satisfaction
) AS taco_reviews;
Result:
| spice_satisfaction_correlation |
|--------------------------------|
| 0.80                           |
In this example, we calculate the rank correlation between taco spiciness and customer satisfaction. The result of 0.8 suggests a strong positive correlation, between taco spiciness and customer satisfaction.
The function requires at least two non-null pairs of observations to compute the correlation. If there are fewer than two pairs, an exception will be thrown.
See also: Spearman’s rank correlation coefficient