Remember, correlation does not imply causation.
ClickHouse function reference
contingency
Calculates the contingency coefficient between two columns. The contingency
function is similar to the cramersV
function but uses a different denominator in the square root calculation.
Syntax:
contingency(column1, column2)
Arguments:
column1
(any): The first column to compare.
column2
(any): The second column to compare.
Returns:
A value between 0 and 1, where a larger result indicates a closer association between the two columns. [Float64
]
Example:
SELECT
contingency(taco_type, salsa_type) AS taco_salsa_association
FROM
(
SELECT 'Beef' AS taco_type, 'Mild' AS salsa_type
UNION ALL
SELECT 'Chicken' AS taco_type, 'Medium' AS salsa_type
UNION ALL
SELECT 'Vegetarian' AS taco_type, 'Hot' AS salsa_type
UNION ALL
SELECT 'Beef' AS taco_type, 'Medium' AS salsa_type
UNION ALL
SELECT 'Chicken' AS taco_type, 'Hot' AS salsa_type
UNION ALL
SELECT 'Vegetarian' AS taco_type, 'Mild' AS salsa_type
) AS taco_orders;
Result:
| taco_salsa_association |
|------------------------|
| 0.1643989873053573 |
In this example, we calculate the association between taco types and salsa types in taco orders.
cramersV
Calculates the [Cramér’s V statistic](https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V#:~:text=In%20statistics%2C%20Cram%C3%A9r’s%20V%20(sometimes,by%20Harald%20Cram%C3%A9r%20in%201946.), which measures the strength of association between two categorical variables.
Syntax:
Arguments:
x
(any): The first categorical variable.
y
(any): The second categorical variable.
Returns:
A value between 0 and 1, where:
- 0 indicates no association
- 1 indicates perfect association
Return type: Float64
Example:
SELECT
cramersV(taco_type, salsa_preference) AS association_strength
FROM
(
SELECT 'Beef' AS taco_type, 'Mild' AS salsa_preference
UNION ALL
SELECT 'Chicken' AS taco_type, 'Medium' AS salsa_preference
UNION ALL
SELECT 'Vegetarian' AS taco_type, 'Hot' AS salsa_preference
UNION ALL
SELECT 'Beef' AS taco_type, 'Medium' AS salsa_preference
UNION ALL
SELECT 'Chicken' AS taco_type, 'Hot' AS salsa_preference
UNION ALL
SELECT 'Vegetarian' AS taco_type, 'Mild' AS salsa_preference
) AS taco_orders;
Result:
| association_strength |
|----------------------|
| 0.2886751345948129 |
In this example, we calculate the association strength between taco_type
and salsa_preference
. The result of 0.28 suggests a moderate association between these two variables in taco orders.
Cramér’s V is particularly useful for comparing the strength of association between pairs of categorical variables, even when they have different numbers of categories.
cramersVBiasCorrected
Calculates the bias-corrected Cramer’s V, a measure of association between two columns in a table.
Cramer’s V measures the strength of association between two categorical variables. This function uses a bias correction to provide a more accurate measure, especially for small sample sizes or when variables have many categories.
The bias-corrected version typically returns lower values compared to the uncorrected cramersV
function, offering a more conservative and often more realistic estimate of the association.
Syntax:
cramersVBiasCorrected(column1, column2)
Arguments:
column1
(any): The first column to be compared.
column2
(any): The second column to be compared.
Returns:
A value between 0 and 1, where:
- 0 indicates no association between the columns’ values
- 1 indicates complete association
Return type: Float64
Example:
SELECT
cramersV(taco_type, salsa_type) AS uncorrected,
cramersVBiasCorrected(taco_type, salsa_type) AS bias_corrected
FROM
(
SELECT 'Beef' AS taco_type, 'Mild' AS salsa_type
UNION ALL
SELECT 'Chicken' AS taco_type, 'Medium' AS salsa_type
UNION ALL
SELECT 'Vegetarian' AS taco_type, 'Hot' AS salsa_type
UNION ALL
SELECT 'Beef' AS taco_type, 'Medium' AS salsa_type
UNION ALL
SELECT 'Chicken' AS taco_type, 'Hot' AS salsa_type
UNION ALL
SELECT 'Vegetarian' AS taco_type, 'Mild' AS salsa_type
) AS taco_orders;
Result:
| uncorrected | bias_corrected |
|---------------------|----------------------|
| 0.2886751345948129 | 0 |
In this example, we compare the association between taco types and salsa types in orders. The bias-corrected version shows no association, providing a more conservative estimate of the relationship between these variables.
The bias-corrected version is generally preferred, especially when dealing with smaller datasets or variables with many categories, as it provides a more accurate representation of the true association.
entropy
Calculates the Shannon entropy of a column of values.
Syntax:
Arguments:
val
(any type): Column of values.
Returns:
The Shannon entropy as a Float64
.
Example:
SELECT
entropy(filling) AS filling_entropy,
entropy(quantity) AS quantity_entropy
FROM (
SELECT 'beef' AS filling, 5 AS quantity
UNION ALL
SELECT 'beef' AS filling, 5 AS quantity
UNION ALL
SELECT 'chicken' AS filling, 3 AS quantity
UNION ALL
SELECT 'fish' AS filling, 2 AS quantity
UNION ALL
SELECT 'veggie' AS filling, 1 AS quantity
) AS taco_orders;
Result:
| filling_entropy | quantity_entropy |
|--------------------|--------------------|
| 1.9219280948873623 | 1.9219280948873623 |
In this example:
filling_entropy
shows the entropy of taco fillings, indicating the diversity of choices.
quantity_entropy
represents the entropy of order quantities, reflecting the variability in order sizes.
A higher entropy value suggests more diversity or randomness in the data, while a lower value indicates more uniformity or predictability.
The Shannon entropy is a measure of the average amount of information contained in each element of a set. It’s useful for analyzing the distribution and unpredictability of data in various fields, including information theory and data compression.
rankCorr
Calculates the rank correlation coefficient between two columns.
Syntax:
Arguments:
x
(Float32
or Float64
): The first set of values.
y
(Float32
or Float64
): The second set of values.
Returns:
The rank correlation coefficient as a Float64
value ranging from -1 to +1.
- A value close to +1 indicates a strong positive correlation.
- A value close to -1 indicates a strong negative correlation.
- A value close to 0 indicates little to no correlation.
Example:
SELECT
rankCorr(taco_spiciness, customer_satisfaction) AS spice_satisfaction_correlation
FROM
(
SELECT 5 AS taco_spiciness, 8 AS customer_satisfaction
UNION ALL
SELECT 7 AS taco_spiciness, 9 AS customer_satisfaction
UNION ALL
SELECT 6 AS taco_spiciness, 7 AS customer_satisfaction
UNION ALL
SELECT 8 AS taco_spiciness, 10 AS customer_satisfaction
) AS taco_reviews;
Result:
| spice_satisfaction_correlation |
|--------------------------------|
| 0.80 |
In this example, we calculate the rank correlation between taco spiciness and customer satisfaction. The result of 0.8 suggests a strong positive correlation, between taco spiciness and customer satisfaction.
The function requires at least two non-null pairs of observations to compute the correlation. If there are fewer than two pairs, an exception will be thrown.
See also: Spearman’s rank correlation coefficient