[統計與數據科學研究所]解決測量誤差問題之新的局部線性估計方法

[Institute of Statistics and Data Science]A New Local Linear Estimator for the Errors-in-Variables Problem

The estimation results of the four methods when h=2 and b=3.5, 4 for the data of the year 2021.

Sustainable Development Goals

Abstract/Objectives

The main focus of this thesis is to discuss estimation when the explanatory variable X of the (X,Y) data has measurement errors. When the actual values of X are un-observable and only data with measurement errors are available, it is important to adjust statistical estimation methods. We adopt local linear regression for estimation. We incorporate the asymptotic projection matrix and make some adjustments to the kernel function. This thesis primarily derives the theoretical properties of our proposed estimation method. We find that our method has a bias of order h^4, which is smaller than the usual order $h^2$. This phenomenon is also observed in the simulations; in some scenarios, the Mean Squared Errors (MSE) of our proposed estimation method are generally smaller than those of existing methods. Furthermore, in the real data analysis, our estimation method also performs well, and the estimated regression curve fits the data more closely.

Results/Contributions

This paper primarily discusses how to handle and estimate when the explanatory variable W in the two-dimensional data (W, Y) has measurement errors. When we cannot observe the actual data X and can only observe the data W = X + U, which has measurement error U, we need to make adjustments to the local linear regression we use. In this paper, we propose a new local linear estimation method and derive its theoretical properties. We apply it to a real-world example involving the mortality data of different age groups in Taiwan in 2021. Since the explanatory variable age is represented as an integer in the data, but in reality, age should be expressed in years and months and is not necessarily an integer, we believe that age may contain measurement errors. Therefore, when estimating this dataset, we can use the estimation method studied in this paper. In total, there were 182,966 deaths in 2021, with the explanatory variable W being age, ranging from [0, 99]; the response variable Y is the number of deaths. (In the data, some deaths over age 100 are not whole numbers, so we consider the data over 100 years old to be potentially inaccurate. During the analysis, we removed the data over age 100.) From the age distribution of the deceased, we find that the median age is 77 years, the average age at death is 73.67 years, and the standard deviation is 16.35. In the actual data, we found that our estimation method (case 1) has a smaller RMSE (root mean squared error) and provides a regression curve that better fits the trend of the data.

Keywords

measurement error