Application of text-mining technique and machine-learning model with clinical text data obtained from case reports for Sasang constitution diagnosis: a feasibility study

Article information

J Korean Med. 2024;45(3):193-210
Publication date (electronic) : 2024 September 1
doi : https://doi.org/10.13048/jkm.24049
1Department of Korean Medicine, College of Korean Medicine, Sangji University
2Sogang Univ. Computer Science & Engineering
3Department of Sasang Constitutional Medicine, College of Korean Medicine, Sangji University
4Research Institute of Korean Medicine, Sangji University
Correspondence to: Jun-sang Yu, Korean Medicine Hospital of Sangji University, 80 Sangjidae-gil, Wonju-si, Gangwon-do, 26338, Republic of Korea, Tel: +82-33-741-9203, E-mail: hiruok@sangji.ac.kr
§

These authors contributed equally to this work

Received 2024 August 2; Revised 2024 August 28; Accepted 2024 August 28.

Abstract

Objectives

We analyzed Sasang constitution case reports using text mining to derive network analysis results and designed a classification algorithm using machine learning to select a model suitable for classifying Sasang constitution based on text data.

Methods

Case reports on Sasang constitution published from January 1, 2000, to December 31, 2022, were searched. As a result, 343 papers were selected, yielding 454 cases. Extracted texts were pretreated and tokenized with the Python-based KoNLPy package. Each morpheme was vectorized using TF-IDF values. Word cloud visualization and centrality analysis identified keywords mainly used for classifying Sasang constitution in clinical practice. To select the most suitable classification model for diagnosing Sasang constitution, the performance of five models—XGBoost, LightGBM, SVC, Logistic Regression, and Random Forest Classifier—was evaluated using accuracy and F1-Score.

Results

Through word cloud visualization and centrality analysis, specific keywords for each constitution were identified. Logistic regression showed the highest accuracy (0.839416), while random forest classifier showed the lowest (0.773723). Based on F1-Score, XGBoost scored the highest (0.739811), and random forest classifier scored the lowest (0.643421).

Conclusions

This is the first study to analyze constitution classification by applying text mining and machine learning to case reports, providing a concrete research model for follow-up research. The keywords selected through text mining were confirmed to effectively reflect the characteristics of each Sasang constitution type. Based on text data from case reports, the most suitable machine learning models for diagnosing Sasang constitution are logistic regression and XGBoost.

Fig. 1

Study flow of text-mining and machine learning.

Fig. 2

Flow chart of literature searches and screening results.

Fig. 3

Wordcloud visualization analysis result of Sasang constitution.

English Word Translation Criteria

Data Refining Criteria

Number of Data by Sasang Constitution

Combined Centrality (Top10)

Best Parameter of Algorithm

Accuracy and F1-Score of Algorithms

References

1. Jung S. H.. 2021;A Study on <Nanjungilgi> Using Topic Modeling and Network Analysis. The Korean Language and Literature (197):111–144. https://doi.org/10.31889/kll.2021.12.197.111.
2. Cho S. Z., Kang S. H.. 2016;Industrial Applications of Machine Learning (Artificial Intelligence). Industrial Engineering Magazine 23(2):34–38.
3. Seo H. J.. 2019;A Preliminary Discussion on Policy Decision Making of AI in The Fourth Industrial Revolution. Informatization Policy 26(3):1–1. https://doi.org/10.22693/NIAIP.2019.26.3.003.
4. Baek S. W.. 2023;Natural Language Processing in Construction Management. KSCE 2023 CONVENTION :549–550.
5. Park K. M., Hwang K. B.. 2011;A Bio-Text Mining System Based on Natural Language Processing. Journal of KIISE: Computing Practices and Letters 17(4):205–213.
6. Choi C. H., Park K. H., Park H. K., Lee M. J., Kim J. S., Kim H. S.. 2017;Development of Heavy Rain Damage Prediction Function for Public Facility Using Machine Learning. Journal of Korean Society of Hazard Mitigation 17(6):443–450. https://doi.org/10.9798/KOSHAM.2017.17.6.443.
7. Hong J. W., Kim Y. I., Park S. J., Kim B. C., Eom I. K., Hwang M. W., et al. 2009;Data mining Algorithms for the Development of Sasang Type Diagnosis. Journal of Physiology & Pathology in Korean Medicine 23(6):1234–1240.
8. Lee J. H., Lee H. H.. 2019;Selecting Sasang-Type classification model using machine learning and designing the service flow. Journal of Digital Contents Society 20(2):321–327. http://dx.doi.org/10.9728/dcs.2019.20.2.321.
9. Lee H. R., Lee J. H.. 2021;A Study on the Development of Diagnostic Tools for Sasang Constitutional Patterns. Journal of Sasang Constitutional Medicine 33(3):95–126. https://doi.org/10.7730/JSCM.2021.33.3.95.
10. Kim G. W.. 2002;Relation of Sasang Constitution diseases and Mind-Body Medicine (Sasang Constitutinal Medicine from the psychiatry point of view). Journal of Oriental Neuropsychiatry 13(2):11–19.
11. Craddock N., Mynors-Wallis L.. 2014;Psychiatric diagnosis: impersonal, imperfect and important. Br J Psychiatry 204(2):93–95. https://doi.org/10.1192/bjp.bp.113.133090.
12. Srivastava A., Sahami M.. 2009. Text mining: Classification, Clustering, and Applications CRC Press.
13. Park S. E., Gang J. Y.. Python Text Mining Complete Guide 1st Editionth ed. Gyeonggi: Wikibooks; 2022. p. 322.
14. Seo D. H.. Grab It! Text Mining with Python 1st Editionth ed. Seoul: bjpublic; 2019. p. 203.
15. Park D. H., Cho M H.. 2022;Identifying Fine Dining Restaurant Consumers’ Perceptions: A Pre- and During COVID-19 Comparison using Big Data. Korean Journal of Hospitality & Tourism 31(4):17–32. https://doi.org/10.24992/KJHT.2022.6.31.04.17.
16. Seo D. H.. 2019. Grab It! Text Mining with Python 1st Editionth ed. Seoul: bjpublic. p. 203.
17. Rácz A., Bajusz D., Héberger K.. 2021;Effect of Dataset Size and Train/Test Split Ratios in QSAR/QSPR Multiclass Classification. Molecules 26(4):1111.
18. Department of Sasang Constitutional Medicine, College of Korean Medicine. 2004;Sasang constitutional medicine. Jipmoon 164–165:643:729–730.
19. Park H. S., Joo J. C., Kim J. H., Kim K. Y.. 2002;A Study on clinical application of the QSCCII(Questionnaire for the Sasang Constitution ClassificationII). Journal of Sasang Constitutional Medicine 14(2):35–44.
20. Baek Y. H., Kim H. S., Lee S. W., Jang E. S.. 2014;The Concordance and Validity Assessment of Diagnosis for the Expert in Sasang Constitution. Journal of Sasang constitutional medicine 26(3):295–303.
21. Lee S. G., Kwak C. K., Lee E. J., Ko B. H., Song I. B.. 2003;The Study of the Upgrade of QSCCII(II)-A Study on the re-validity of QSCCII-. Journal of Sasang constitutional medicine 15(1):39–49.
22. Kang M. S., Oh J. W., Lee H. R., Lee J. H.. 2019;Patient Group Study to Improve the Accuracy of QSCC II+. Journal of Sasang Constitutional Medicine 31(3):48–65. https://doi.org/10.7730/JSCM.2019.31.3.48.
23. Do J. H., Nam J. H., Jang E. S., Jang J. S., Kim J. W., Kim Y. S., et al. 2013;Comparison between Diagnostic Results of the Sasang Constitutional Analysis Tool (SCAT) and a Sasang Constitution Expert. Journal of Sasang constitutional medicine 25(3):158–166. https://doi.org/10.7730/JSCM.2013.25.3.158.
24. Hwang D. S., Cho J. H., Lee C. H., Jang J. B., Lee K. S.. 2006;A Study on Reproducibility of Responses to the Questionnaire for Sasang Constitution Classification II (QSCCII). Journal of Korean Medicine 27(3):145–150.
25. Kim J. W., Sul Y. K., Choi J. J., Kwon S. D., Kim K. K., Lee Y. T.. 2007;Comparative Study of Diagnostic Accuracy Rate by Sasang Constitutions on Measurement Method of Body Shape. Journal of physiology & pathology in Korean Medicine 21(1)
26. Lee E. J., Song K. B., Choi H. S., Yoo J. H., Kwak C. K., Sohn E. H., et al. 2005;Pilot Study on the classification for sasangin by the voice analysis. Journal of Korean Oriental Medicine 26(1):93–102.
27. Lee J.H.. 2022. Korean Medicine Clinical Practice Guideline for Sasang(Four) constitutional medicine patterns Korea: The Society of Sasang Constitutional Medicine.
28. Kim M. J., Lee S. J.. 2018;Study of health characteristics of female college students according to sasang constitution and factors affecting BMI. Journal of Sasang constitutional medicine 30(3):48–61.
29. Kim E. Y., Kim J. W.. 2004;A Clinical study on the Sasang Constitution and Obesity. Journal of Sasang constitutional medicine 16(1):100–111.
30. Hong S. C., Lee S. K., Lee E. J., Han G. H., Chou Y. J., Choi C. H., et al. 1998;A Study on the morphologic characteristics of each constitution’s trunk. Journal of Sasang constitutional medicine 10(1):101–142.
31. Choi J. S., Kim K. Y.. 1998;A Study on Disease and Medical Theory of Soyangin Bisoohan-pyohanbyung-theory. Journal of Sasang constitutional medicine 10(2):61–110.
32. Park S. E.. 2021;Analysis of the Status of Natural Language Processing Technology Based on Deep Learning. The Korea Journal of BigData 6(1):63–81. https://doi.org/10.36498/kbigdt.2021.6.1.63.

Article information Continued

Fig. 1

Study flow of text-mining and machine learning.

Fig. 2

Flow chart of literature searches and screening results.

Fig. 3

Wordcloud visualization analysis result of Sasang constitution.

Table 1

English Word Translation Criteria

Translation exclusion criteria Examples
Words written in English in most of the research papers VAS, QSCCII
Words that represents a unit kg, cm
Name of the medicine trolac, NSAID

Table 2

Data Refining Criteria

Criteria Example
Before after
Exclusions Not a key variable, and used conventionally Above-mentioned, Opinion, Not, usual, And, When, time Delete
Terms related to Korean medicine, but used conventionally Common Questions in the Constitutional Questionnaire (Address, Symptoms usually present, Medical history, Body type, Temperament, Abilities) Oriental Medicine, Diagnosis, Defecation, Urine
Synonyms Cases with the same or similar meanings but different spellings ‘ears, eyes, mouth, and nose’, ‘ears, eyes, nose, and mouth’, ‘eyes, nose, and mouth’ ‘ears, eyes, nose, and mouth’
Cases where a single word represents or encompasses other words Sleep disorder, Difficulty falling asleep, Nocturnal sleep disorder, Difficulty maintaining sleep, Insomnia, Sleep difficulties, Difficulty falling asleep Sleep disorder
Native words Cases where a compound word is perceived as separate components cold, sweat Cold sweat
Hyung, Geumji, Pose Hyunggeumjipose
Cases where multiple words should be considered as a single phrase Abdominal, bloating Abdominal bloating
Nocturnal, sleep, disorder Nocturnal sleep disorder

Table 3

Number of Data by Sasang Constitution

Sasang Constitution Number of data
Soeumin 92
Soyangin 198
Taeeumin 148
Taeyangin 16
Total 454

Table 4

Combined Centrality (Top10)

Soeumin Soyangin Taeeumin Taeyangin
Word TI CC Word TI CC Word TI CC Word TI CC
1 Thin 0.136 0.217 Vomiting 0.353 0.313 Dizziness 0.245 0.211 nothing particular 0.122 0.182
2 Chest 0.082 0.201 Headache 0.163 0.282 Headache 0.163 0.210 Evening 0.082 0.182
3 Severe 0.027 0.196 nausea 0.381 0.280 Gait 0.109 0.201 Weakness 0.218 0.155
4 Bilateral 0.082 0.193 Thin 0.136 0.276 Bilateral 0.082 0.197 Gait 0.109 0.155
5 Abdomen 0.109 0.186 Pain 0.082 0.260 Head 0.082 0.193 Exercise 0.109 0.155
6 abdominal pain 0.082 0.181 Dizziness 0.245 0.258 Thorax 0.109 0.190 -ed 0.155
7 Physique 0.272 0.176 Entire body 0.163 0.249 Abdomen 0.109 0.186 Duration 0.163 0.152
8 Lower extremities 0.176 Above 0.236 Drug 0.181 Bilateral 0.082 0.152
9 Shoulder 0.054 0.174 Physique 0.272 0.225 stress 0.054 0.175 Defecation 0.109 0.125
10 Drug 0.172 Administration 0.190 0.220 Nocturnal 0.136 0.172 Limbs 0.109 0.125
*

TI: TF-IDF, CC: Combined Centrality, ■: TF-IDF<0.1

Table 5

Best Parameter of Algorithm

Algorithm Best Params. F1-score
XGBoost {‘learning_rate’: 0.5, ‘max_depth’: 20, ···} 0.696374
LightGBM {‘learning_rate’: 1, ‘max_depth’: 10, ···} 0.695833
SVC {‘C’: 10, ‘kernel’: ‘linear’} 0.651028
Logistic Regression {‘C’: 20} 0.668290
Random Forest Classifier {‘n_estimators’: 50} 0.603950

Table 6

Accuracy and F1-Score of Algorithms

Algorithm Accuracy F1-Score Precision Recall
XGBoost 0.810219 0.739811 0.859072 0.696374
LightGBM 0.795620 0.730692 0.835910 0.695833
SVC 0.817518 0.688447 0.872854 0.651028
Logistic Regression 0.839416 0.705982 0.889106 0.668290
Random Forest Classifier 0.773723 0.643421 0.853030 0.603950