[統計與數據科學研究所]核密度估計之距離相關係數

[Institute of Statistics and Data Science]Distance Correlations of Kernel Density Estimators

Correlation coefficient distribution plot of male and female death populations (with data points colored).

Sustainable Development Goals

Abstract/Objectives

This thesis primarily explores the application of the distance correlation coefficient for kernel density estimators (KDE), with simulations and real data applications. The distance correlation coefficient is a statistical measure used to assess the association between two sets of multidimensional random variables. Unlike the traditional Pearson correlation coefficient, which can only detect linear relationships, the distance correlation coefficient can identify a broader range of nonlinear associations. KDE is a non-parametric density estimation method that fits a curve using data around the point to be estimated, with closer data points receiving more weight. This study employs Gaussian kernel functions for estimation and uses the results to estimate the joint and marginal probability density functions of the variables, thereby obtaining the estimated joint characteristic function and marginal characteristic function. The thesis first theoretically reviews the distance correlation coefficient and defines the relevant mathematical notation. Subsequently, a series of simulation experiments are conducted, including generating data with high and low correlations, and calculating the KDE-based distance correlation coefficients, the sample distance correlation coefficients, and the Pearson correlation coefficients for comparison. The simulation results demonstrate that the distance correlation coefficient is superior in detecting nonlinear correlations. Finally, the KDE-based distance correlation coefficient is applied to mortality data to verify its applicability across different datasets. (The Chinese abstract was written by myself and translated into English version using ChatGPT. Both my professor and I have revised the translated abstract.)

Results/Contributions

This paper mainly explores the application of distance correlation under the kernel density estimation (KDE) method, along with simulations and validation with real data. Distance correlation is a statistical measure used to assess the correlation between two sets of multidimensional random variables. Traditional Pearson correlation can only detect linear relationships, while distance correlation can detect a wider range of nonlinear correlations. Kernel density estimation is a non-parametric curve estimation method that estimates density by selecting samples around the point of interest. This paper uses a Gaussian kernel function for estimation and applies the results to estimate the joint probability density function and marginal probability density function of variables, thereby obtaining the estimated joint characteristic function and marginal characteristic function. The paper first discusses the theoretical aspects of distance correlation and defines the relevant mathematical formulas. Next, a series of simulation experiments were conducted, including generating highly correlated and weakly correlated data, and calculating the kernel density estimated distance correlation (R_KDE), sample distance correlation (Rn), and Pearson correlation (ρ) for comparison. The simulation results demonstrate that the statistical indicators of distance correlation have superiority in detecting nonlinear correlations. Finally, the kernel density estimated distance correlation is applied to real data, utilizing the death population of males and females in Taiwan across all age groups from 0 to 110 years old from 1970 to 2021, involving a total of 52 years of data. The R_KDE, Rn, and ρ of male and female death numbers for each year were calculated. Historically, men were predominantly the ones earning a living, engaging in dangerous professions, and serving in the military. Therefore, it can be expected that the factors contributing to male mortality during that time were significantly higher than those for females, who spent most of their time at home. As a result, the fluctuations in historical male mortality rates compared to female rates would be more pronounced than in recent years. Thus, the correlation coefficient between historical male and female mortality rates is expected to be lower than that for modern male and female mortality rates. Observing the results, it was found that the results detected by R_KDE and Rn were consistent. In earlier times, both R_KDE and Rn values were lower, aligning with our expected results.

Keywords

distance covariancekernel density estimatorcorrelationnonlinear correlationdistance correlationdistance variance