Correlation and information theory

Remember, correlation does not imply causation.

ClickHouse function reference

contingency

Calculates the contingency coefficient between two columns. The contingency function is similar to the cramersV function but uses a different denominator in the square root calculation. Syntax:

contingency(column1, column2)

Arguments:

column1 (any): The first column to compare.
column2 (any): The second column to compare.

Returns: A value between 0 and 1, where a larger result indicates a closer association between the two columns. [Float64] Example:

SELECT
	contingency(taco_type, salsa_type) AS taco_salsa_association
FROM
(
    SELECT 'Beef' AS taco_type, 'Mild' AS salsa_type
    UNION ALL
    SELECT 'Chicken' AS taco_type, 'Medium' AS salsa_type
    UNION ALL
    SELECT 'Vegetarian' AS taco_type, 'Hot' AS salsa_type
    UNION ALL
    SELECT 'Beef' AS taco_type, 'Medium' AS salsa_type
    UNION ALL
    SELECT 'Chicken' AS taco_type, 'Hot' AS salsa_type
    UNION ALL
    SELECT 'Vegetarian' AS taco_type, 'Mild' AS salsa_type
) AS taco_orders;

Result:

| taco_salsa_association |
|------------------------|
| 0.1643989873053573     |

In this example, we calculate the association between taco types and salsa types in taco orders.

cramersV

Calculates the [Cramér’s V statistic](https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V#:~:text=In%20statistics%2C%20Cram%C3%A9r’s%20V%20(sometimes,by%20Harald%20Cram%C3%A9r%20in%201946.), which measures the strength of association between two categorical variables. Syntax:

cramersV(x, y)

Arguments:

x (any): The first categorical variable.
y (any): The second categorical variable.

Returns: A value between 0 and 1, where:

0 indicates no association
1 indicates perfect association

Return type: Float64 Example:

SELECT
	cramersV(taco_type, salsa_preference) AS association_strength
FROM
(
    SELECT 'Beef' AS taco_type, 'Mild' AS salsa_preference
    UNION ALL
    SELECT 'Chicken' AS taco_type, 'Medium' AS salsa_preference
    UNION ALL
    SELECT 'Vegetarian' AS taco_type, 'Hot' AS salsa_preference
    UNION ALL
    SELECT 'Beef' AS taco_type, 'Medium' AS salsa_preference
    UNION ALL
    SELECT 'Chicken' AS taco_type, 'Hot' AS salsa_preference
    UNION ALL
    SELECT 'Vegetarian' AS taco_type, 'Mild' AS salsa_preference
) AS taco_orders;

Result:

| association_strength |
|----------------------|
| 0.2886751345948129   |

In this example, we calculate the association strength between taco_type and salsa_preference. The result of 0.28 suggests a moderate association between these two variables in taco orders.

Cramér’s V is particularly useful for comparing the strength of association between pairs of categorical variables, even when they have different numbers of categories.

cramersVBiasCorrected

Calculates the bias-corrected Cramer’s V, a measure of association between two columns in a table. Cramer’s V measures the strength of association between two categorical variables. This function uses a bias correction to provide a more accurate measure, especially for small sample sizes or when variables have many categories. The bias-corrected version typically returns lower values compared to the uncorrected cramersV function, offering a more conservative and often more realistic estimate of the association. Syntax:

cramersVBiasCorrected(column1, column2)

Arguments:

column1 (any): The first column to be compared.
column2 (any): The second column to be compared.

Returns: A value between 0 and 1, where:

0 indicates no association between the columns’ values
1 indicates complete association

Return type: Float64 Example:

SELECT
	cramersV(taco_type, salsa_type) AS uncorrected,
	cramersVBiasCorrected(taco_type, salsa_type) AS bias_corrected
FROM
(
    SELECT 'Beef' AS taco_type, 'Mild' AS salsa_type
    UNION ALL
    SELECT 'Chicken' AS taco_type, 'Medium' AS salsa_type
    UNION ALL
    SELECT 'Vegetarian' AS taco_type, 'Hot' AS salsa_type
    UNION ALL
    SELECT 'Beef' AS taco_type, 'Medium' AS salsa_type
    UNION ALL
    SELECT 'Chicken' AS taco_type, 'Hot' AS salsa_type
    UNION ALL
    SELECT 'Vegetarian' AS taco_type, 'Mild' AS salsa_type
) AS taco_orders;

Result:

| uncorrected         | bias_corrected       |
|---------------------|----------------------|
| 0.2886751345948129  | 0                    |

In this example, we compare the association between taco types and salsa types in orders. The bias-corrected version shows no association, providing a more conservative estimate of the relationship between these variables.

The bias-corrected version is generally preferred, especially when dealing with smaller datasets or variables with many categories, as it provides a more accurate representation of the true association.

entropy

Calculates the Shannon entropy of a column of values. Syntax:

entropy(val)

Arguments:

val (any type): Column of values.

Returns: The Shannon entropy as a Float64. Example:

SELECT
	entropy(filling) AS filling_entropy,
	entropy(quantity) AS quantity_entropy
FROM (
    SELECT 'beef' AS filling, 5 AS quantity
    UNION ALL
    SELECT 'beef' AS filling, 5 AS quantity
    UNION ALL
    SELECT 'chicken' AS filling, 3 AS quantity
    UNION ALL
    SELECT 'fish' AS filling, 2 AS quantity
    UNION ALL
    SELECT 'veggie' AS filling, 1 AS quantity
) AS taco_orders;

Result:

| filling_entropy    | quantity_entropy   |
|--------------------|--------------------|
| 1.9219280948873623 | 1.9219280948873623 |

In this example:

filling_entropy shows the entropy of taco fillings, indicating the diversity of choices.
quantity_entropy represents the entropy of order quantities, reflecting the variability in order sizes.

A higher entropy value suggests more diversity or randomness in the data, while a lower value indicates more uniformity or predictability.

The Shannon entropy is a measure of the average amount of information contained in each element of a set. It’s useful for analyzing the distribution and unpredictability of data in various fields, including information theory and data compression.

rankCorr

Calculates the rank correlation coefficient between two columns. Syntax:

rankCorr(x, y)

Arguments:

x (Float32 or Float64): The first set of values.
y (Float32 or Float64): The second set of values.

Returns: The rank correlation coefficient as a Float64 value ranging from -1 to +1.

A value close to +1 indicates a strong positive correlation.
A value close to -1 indicates a strong negative correlation.
A value close to 0 indicates little to no correlation.

Example:

SELECT
	rankCorr(taco_spiciness, customer_satisfaction) AS spice_satisfaction_correlation
FROM
(
    SELECT 5 AS taco_spiciness, 8 AS customer_satisfaction
    UNION ALL
    SELECT 7 AS taco_spiciness, 9 AS customer_satisfaction
    UNION ALL
    SELECT 6 AS taco_spiciness, 7 AS customer_satisfaction
    UNION ALL
    SELECT 8 AS taco_spiciness, 10 AS customer_satisfaction
) AS taco_reviews;

Result:

| spice_satisfaction_correlation |
|--------------------------------|
| 0.80                           |

In this example, we calculate the rank correlation between taco spiciness and customer satisfaction. The result of 0.8 suggests a strong positive correlation, between taco spiciness and customer satisfaction.

The function requires at least two non-null pairs of observations to compute the correlation. If there are fewer than two pairs, an exception will be thrown.

Introduction

Statements

Correlation and information theory

ClickHouse function reference

contingency

cramersV

cramersVBiasCorrected

entropy

rankCorr

Introduction

Statements

​ClickHouse function reference

​contingency

​cramersV

​cramersVBiasCorrected

​entropy

​rankCorr

ClickHouse function reference

contingency

cramersV

cramersVBiasCorrected

entropy

rankCorr