• Chỉ mục bởi
  • Năm xuất bản

A Novel COVID-19 Data Set and an Effective Deep Learning Approach for the De-Identification of Italian Medical Records

Catelli Institute for High Performance Computing and Networking (ICAR), National Research Council, Naples, 80131, Italy|
Massimo (25926920500) | Hamido (35611951900); Esposito | Giuseppe (6508247659); Fujita Faculty of Software and Information Science, Iwate Prefectural University, Iwate, 020-0611, Japan| Valentina (6506461115); De Pietro Andalusian Research Institute in Data Science and Computational Intelligence (DaSCI), University of Granada, Granada, 18014, Spain| Francesco (36020088700); Casola Faculty of Information Technology, Ho Chi Minh City University of Technology (HUTECH), Ho Chi Minh, 723000, Viet Nam| Rosario (57210848004); Gargiulo Department of Electrical Engineering and Information Technologies, University of Naples Federico II, Naples, 80125, Italy|

IEEE Access Số , năm 2021 (Tập 9, trang 19097-19110)

ISSN: 21693536

ISSN: 21693536

DOI: 10.1109/ACCESS.2021.3054479

Tài liệu thuộc danh mục:



Từ khóa: Electronic document exchange; Long short-term memory; Natural language processing systems; Semantics; Statistical tests; Syntactics; Electronic health record (EHRs); Named entity recognition; NAtural language processing; Semantic similarity; Sensitive informations; Syntactic variations; Training and development; Word representations; Deep learning
Tóm tắt tiếng anh
In the last years, the need to de-identify privacy-sensitive information within Electronic Health Records (EHRs) has become increasingly felt and extremely relevant to encourage the sharing and publication of their content in accordance with the restrictions imposed by both national and supranational privacy authorities. In the field of Natural Language Processing (NLP), several deep learning techniques for Named Entity Recognition (NER) have been applied to face this issue, significantly improving the effectiveness in identifying sensitive information in EHRs written in English. However, the lack of data sets in other languages has strongly limited their applicability and performance evaluation. To this aim, a new de-identification data set in Italian has been developed in this work, starting from the 115 COVID-19 EHRs provided by the Italian Society of Radiology (SIRM): 65 were used for training and development, the remaining 50 were used for testing. The data set was labelled following the guidelines of the i2b2 2014 de-identification track. As additional contribution, combined with the best performing Bi-LSTM + CRF sequence labeling architecture, a stacked word representation form, not yet experimented for the Italian clinical de-identification scenario, has been tested, based both on a contextualized linguistic model to manage word polysemy and its morpho-syntactic variations and on sub-word embeddings to better capture latent syntactic and semantic similarities. Finally, other cutting-edge approaches were compared with the proposed model, which achieved the best performance highlighting the goodness of the promoted approach. � 2013 IEEE.

Xem chi tiết