# Expanding the boundaries or about the problem of testing the hypothesis about the normality of a multivariate distribution

A short story about the MVN package

A moment of theory

Let’s say we have some joint distribution of n variables – and we need to check if it is normal. One small fact simply prevents us from solving this problem – from the normality of the multivariate distribution, the normality of the distribution of each variable separately follows, but in the opposite direction it works only if the components of the distribution are independent, which in practice is almost never fulfilled. Therefore, you have to invent something.

The scheme for testing the statistical hypothesis about the normality of the multivariate distribution is identical to the corresponding one for the one-dimensional case, only it uses different tests.

The Mardia test (original work: KV Mardia. Measures of multivariate skewness and kurtosis with applications. Biometrika, 57 (3): 519-530, 1970) is based on calculating the kurtosis and skewness of the multivariate distribution by the formulas

Moreover, m is the Malanhobis distance between the i-th and j-th observations

In this interpretation, the calculated asymmetry value multiplied by n / 6 is distributed according to the Chi-square law with p (p + 1) (p + 2) / 6 degrees of freedom, and the kurtosis value is distributed according to the normal distribution law with the mean p (p + 2) and deviation 8p (p + 2) / n

Henze-Zirkler test (base work: N. Henze and B. Zirkler. A class of invariant consistent tests for multivariate normality. Communications in Statistics – Theory and Methods, 19 (10): 3595–3617, 1990.) is based on the following formula calculating the statistical criterion:

The values ​​of the criterion are distributed according to the lognormal law with the parameters

Royston’s test is based on the idea of ​​the Shapiro-Wilks test. The value of the statistical criterion is calculated by the formula

Its value is distributed according to the Chi-square law with the number of degrees of freedom equal to e. The chain of calculations is as follows:

The Dornik-Hansen test (original work: Doornik, JA, and H. Hansen. 2008. An omnibus test for univariate and multivariate normality. Oxford Bulletin of Economics and Statistics 70: 927–939.) Is based on transforming multivariate observations and calculating the kurtosis and asymmetry for a one-dimensional variable.

The transformation is carried out according to the formula

Next, the kurtosis and skewness are calculated for each variable in the new matrix.

The asymmetry values ​​(bone) and kurtosis (b2) are not distributed according to the normal law. To transform them, the following transformations are applied:

The obtained values ​​of zone and z2 are combined into vectors Zone and Z2, and the calculated value of the statistic is distributed according to the Chi-square law with the number of degrees of freedom equal to 2k

The E-statistic test (Shekeli-Rizzo test, basic work: GJ Szekely, ML Rizzo. A new test for multivariate normality / Journal of Multivariate Analysis 93 (2005) 58–80) implies the calculation of test statistics using the Taylor series expansion:

y is the centered observation matrix obtained by column-wise transformations as

Methodology

For example, let’s select the “Crime” database from the plm package, and take three variables from there:

prbpris – probability of imprisonment

avgsen – average term of imprisonment, days

pctymle – share in the population of men aged 15-24 years

From these three variables, we will collect two databases – with two and three variables:

``````library(MVN)
library(tidyverse)
library(plm)
data("Crime")
glimpse(Crime)
ggplot(Crime, aes(x=Crime\$avgsen)) + geom_density()
# Crime\$prbpris - точно, avgsen - 70/30, pctymle - 50/50
Data_1 <- Crime[,c(6,7)]
Data_2 <- Crime[,c(6,7,24)]``````

Calculations and description

The basic calculation function is the mvn function with the following parameters:

data – Database (in the form of a matrix or dataframe)

subset – Factorial grouping variable

mvnTest – Specifies the statistical test to be checked

desc – Boolean variable. If it is true, descriptive statistics are output.

univariateTest – Defines a statistical test that checks the normality of individual variables

univariatePlot – Determines the kind of univariate normality plot to display

multivariatePlot – Determines the appearance of the error plot

multivariateOutlierMethod – Selects the method for determining outliers

Let’s check our data for normality using the classic Mardia test

You can look at the QQ chart with your eyes

``mvn(data = Data_1, mvnTest = "mardia", multivariatePlot = "qq")``

And also on a two-dimensional distribution graph

``mvn(data = Data_1, mvnTest = "mardia", multivariatePlot = "qq")``

And on a 2D contour plot

``mvn(data = Data_1, mvnTest = "energy", multivariatePlot = "contour")``

You can also display, for example, a QQ graph for each variable separately.

``mvn(data = Data_1, mvnTest = "mardia", univariatePlot = "qqplot")``

Of particular interest is the capability provided by the subset variable. If there is a grouping variable, it is possible to check multidimensional / univariate normality depending on its different values:

These are the basics of the MVN package functionality. All materials are available on https://github.com/acheremuhin/Multivariate_normal