[Institute of Statistics and Data Science]Distance Correlations of Kernel Density Estimators
Sustainable Development Goals
Abstract/Objectives
Results/Contributions
This paper mainly explores the application of distance correlation under the kernel density estimation (KDE) method, along with simulations and validation with real data. Distance correlation is a statistical measure used to assess the correlation between two sets of multidimensional random variables. Traditional Pearson correlation can only detect linear relationships, while distance correlation can detect a wider range of nonlinear correlations. Kernel density estimation is a non-parametric curve estimation method that estimates density by selecting samples around the point of interest. This paper uses a Gaussian kernel function for estimation and applies the results to estimate the joint probability density function and marginal probability density function of variables, thereby obtaining the estimated joint characteristic function and marginal characteristic function. The paper first discusses the theoretical aspects of distance correlation and defines the relevant mathematical formulas. Next, a series of simulation experiments were conducted, including generating highly correlated and weakly correlated data, and calculating the kernel density estimated distance correlation (R_KDE), sample distance correlation (Rn), and Pearson correlation (ρ) for comparison. The simulation results demonstrate that the statistical indicators of distance correlation have superiority in detecting nonlinear correlations. Finally, the kernel density estimated distance correlation is applied to real data, utilizing the death population of males and females in Taiwan across all age groups from 0 to 110 years old from 1970 to 2021, involving a total of 52 years of data. The R_KDE, Rn, and ρ of male and female death numbers for each year were calculated. Historically, men were predominantly the ones earning a living, engaging in dangerous professions, and serving in the military. Therefore, it can be expected that the factors contributing to male mortality during that time were significantly higher than those for females, who spent most of their time at home. As a result, the fluctuations in historical male mortality rates compared to female rates would be more pronounced than in recent years. Thus, the correlation coefficient between historical male and female mortality rates is expected to be lower than that for modern male and female mortality rates. Observing the results, it was found that the results detected by R_KDE and Rn were consistent. In earlier times, both R_KDE and Rn values were lower, aligning with our expected results.