TY - JOUR
T1 - Development and Evaluation of Machine Learning-Based High-Cost Prediction Model Using Health Check-Up Data by the National Health Insurance Service of Korea
AU - Choi, Yeongah
AU - An, Jiho
AU - Ryu, Seiyoung
AU - Kim, Jaekyeong
N1 - Publisher Copyright:
© 2022 by the authors.
PY - 2022/10
Y1 - 2022/10
N2 - In this study, socioeconomic, medical treatment, and health check-up data from 2010 to 2017 of the National Health Insurance Service (NHIS) of Korea were analyzed. This year’s socioeconomic, treatment, and health check-up data are used to develop a predictive model for high medical expenses in the next year. The characteristic of this study is to derive important variables related to the high cost of domestic medical expenses users by using data on health check-up items conducted by the country. In this study, we tried to classify data and evaluate its performance using classification supervised learning algorithms for high-cost medical expense prediction. Supervised learning for predicting high-cost medical expenses was performed using the logistic regression model, random forest, and XGBoost, which have been known to result the best performance and explanatory power among the machine learning algorithms used in previous studies. Our experimental results show that the XGBoost model had the best performance with 77.1% accuracy. The contribution of this study is to identify the variables that affect the prediction of high-cost medical expenses by analyzing the medical bills using the health check-up variables and the Korea Classification Disease (KCD) large group as input variables. Through this study, it was confirmed that musculoskeletal disorders (M) and respiratory diseases (J), which are the most frequently treated diseases, as important KCD disease groups for high-cost prediction in Korea, affect the future high cost prediction. In addition, it was confirmed that malignant neoplasia diseases (C) with high medical cost per treatment are a group of diseases related to high future medical cost prediction. Unlike previous studies, it is the result of analyzing all disease data, so it is expected that the study will be more meaningful when compared with the results of other national health check-up data.
AB - In this study, socioeconomic, medical treatment, and health check-up data from 2010 to 2017 of the National Health Insurance Service (NHIS) of Korea were analyzed. This year’s socioeconomic, treatment, and health check-up data are used to develop a predictive model for high medical expenses in the next year. The characteristic of this study is to derive important variables related to the high cost of domestic medical expenses users by using data on health check-up items conducted by the country. In this study, we tried to classify data and evaluate its performance using classification supervised learning algorithms for high-cost medical expense prediction. Supervised learning for predicting high-cost medical expenses was performed using the logistic regression model, random forest, and XGBoost, which have been known to result the best performance and explanatory power among the machine learning algorithms used in previous studies. Our experimental results show that the XGBoost model had the best performance with 77.1% accuracy. The contribution of this study is to identify the variables that affect the prediction of high-cost medical expenses by analyzing the medical bills using the health check-up variables and the Korea Classification Disease (KCD) large group as input variables. Through this study, it was confirmed that musculoskeletal disorders (M) and respiratory diseases (J), which are the most frequently treated diseases, as important KCD disease groups for high-cost prediction in Korea, affect the future high cost prediction. In addition, it was confirmed that malignant neoplasia diseases (C) with high medical cost per treatment are a group of diseases related to high future medical cost prediction. Unlike previous studies, it is the result of analyzing all disease data, so it is expected that the study will be more meaningful when compared with the results of other national health check-up data.
KW - Korea NHIS
KW - XGBoost
KW - data imbalance
KW - health checkup cohort DB
KW - logistic regression
KW - machine learning
KW - medical cost prediction
KW - random forest
UR - http://www.scopus.com/inward/record.url?scp=85140906175&partnerID=8YFLogxK
U2 - 10.3390/ijerph192013672
DO - 10.3390/ijerph192013672
M3 - Article
C2 - 36294248
AN - SCOPUS:85140906175
SN - 1661-7827
VL - 19
JO - International Journal of Environmental Research and Public Health
JF - International Journal of Environmental Research and Public Health
IS - 20
M1 - 13672
ER -