Evaluating semantic textual similarity in clinical sentences using deep learning and sentence embeddings

The wide adoption of electronic health records (EHRs) has fostered an improvement in healthcare quality, with EHRs currently representing a major source of medical information. Nevertheless, this process has also brought new challenges to the medical environment since the facilitated replication of information (e.g. using copy-paste) has resulted in less concise and sometimes incorrect information, which hinders the understandability of this data and can compromise the quality of medical decisions drawn from it. Due to the high volume and redundancy in medical data, it is imperative to develop solutions that can condense information whilst retaining its value, with a possible methodology involving the assessment of the semantic similarity between clinical text excerpts. In this paper we present an approach that explores neural networks and different types of text preprocessing pipelines, and that evaluates the impact of using word embeddings or sentence embeddings. We present the results following our participation in the n2c2 shared-task on clinical semantic textual similarity, perform an error analysis and discuss obtained results along with possible future improvements.


INTRODUCTION
For years, technology advancements have been applied in health care with the goal of preventing, diagnosing and treating diseases, as well as improving the quality of life of the general population. These technological breakthroughs have brought new tools and information sources to physicians, aiding them in clinical workflows like patient follow-up and clinical decision making, whilst also playing an important role in the transition into the personalized medicine paradigm.
Electronic health records (EHRs) are an example of such widely adopted technologies in medicine, providing an electronic infrastructure that can aggregate a multitude of medical information and support the medical act. EHRs provide a longitudinal view of patient trajectory, comprehending the past and present of a patient's health condition 1 , and despite containing rich contextual information in structured data that could for instance be explored for prediction modelling purposes [14,15,32], large amounts of valuable patient information are stored in unstructured notes, commonly referred to as free text, which are often underexplored due to difficulties in processing this type of text [24].
Due to the large extent of information contained in an EHR, physicians are provided with the key aspect of context when reasoning on their medical decisions, making EHR a key component of the patient-centered notion of health care. However, even though its wide adoption has enabled an improvement on healthcare quality, certain challenges also arose with it. An example of such issue is the facilitated process of replicating information in medical text reports through copy-paste actions or by using pseudo-templates, which has impacted on EHR data quality since it can lead to less concise documentation, and an increased chance of introducing erroneous information, that can consequently compromise the quality of the medical act [13,27].
Owing to the utmost importance of reducing the dimension and redundancy in EHR data, solutions for annotating relevant data and summarizing clinical text [26] have been the focus of much research, mostly targeting clinical natural language processing (NLP). A possible recently explored approach to reduce clinical text redundancy is by assessing the semantic textual similarity (STS) between different text excerpts from an EHR. STS has attracted particular attention in past years. SemEval, an ongoing series of evaluations of computational semantic analysis systems, started a pilot STS challenge track in 2012 which attracted the attention of the research community [5] and, due to its success, progressively organized a series of STS challenge tracks from 2012 through 2017 [1][2][3][4]8]. However, these shared tasks had the major drawback of being centered in general-domain text, whereas clinical text is inherently different in its characteristics. Therefore, an additional effort for the clinical domain was required.
The increasing interest in exploring and pushing forward existing research with clinical data, to further improve healthcare quality, led to the creation of dedicated resources and challenges. To that extent, in recent years, text corpora specifically focused on clinical text were created along with shared tasks on clinical STS that leverage from those corpora. A more detailed coverage on clinical resources and STS challenges is further provided in Section 2.
In the present paper we describe an approach for evaluating the STS between clinical sentence pairs. For that, we explored the use of neural networks and evaluated the usage of different clinical text preprocessing pipelines. Moreover, we compared the impact of using word embeddings versus sentence embeddings. The proposed method was used in a participation in the 2019 n2c2 shared task on clinical semantic textual similarity 2 .

RELATED WORK
Over the years, healthcare quality has benefited from the symbiosis between the medicine and technology fields. Despite the significant research focus given to certain data modalities, such as medical imaging or -omics data, text data from the clinical narratives stored in EHRs has been regaining increased interest since natural language provides a flexible convoy for physicians to track and report each medical situation, enabling complete descriptions and overcoming the limitations from ambiguous and unspecific terms that physicians can find when using coding standards such as RxNorm [23] or SNOMED-CT [29]. Therefore, the potential in free text is great as it can frequently contain information that is otherwise not obtainable from other data sources [16].
However, free text is very challenging to process and, on top of that, can be cluttered with redundant information. Data redundancy is in fact an increasing problem since the digital processes used by physicians to document medical data have the major issue of facilitating information replication (e.g. through copy-paste actions), which can reduce the quality of EHR data and increase the probability of introducing incorrect information that can lead to clinical error [13,27]. This problem has been acknowledged by physicians, who state the need to improve the clinical documentation process to cope with these growing issues [18]. A possible approach to tackle the abovementioned issue is by exploring clinical STS to detect text excerpts containing repeated and/or similar content within EHR documents.

Biomedical and Clinical STS Resources
Recent years have shown efforts to bring STS to the clinical data domain. In 2017, motivated by the rapidly increasing availability of textual data in the biomedical domain, by the need to facilitate the retrieval and analysis of this data, and also by the lack of suitable datasets for the development of appropriate systems, Soğancıoğlu et al. created BIOSSES: a benchmark dataset containing 100 sentence pairs from biomedical literature where each sentence pair was scored in a scale from 0 to 4 regarding its semantic similarity, and used it for the development of STS methods [28]. Despite being an interesting initial effort, as the type of text found in biomedical literature significantly differs from that of medical narratives, an additional effort was still required.
In 2018, with the goal of exploring clinical STS to reduce clinical text redundancy, Wang et al. assembled MedSTS, a dataset containing 174 629 pairs of clinical sentences extracted from a clinical corpus at Mayo Clinic. From this pool of clinical sentence pairs, 1 068 pairs were annotated by two medical experts regarding their semantic similarity, who classified each pair with a value within 0 (dissimilar) and 5 (equivalent), resulting in the creation of the MedSTS_ann dataset. The authors used MedSTS_ann to compare the performance of existing STS approaches on general and clinical domain STS datasets, and as expected observed that performances obtained on MedSTS_ann were in general lower, demonstrating the higher complexity of clinical text [30].

Clinical STS Shared Tasks
Driven by the interest of motivating the research community to solve real world clinical problems, Wang et al. organized a shared task on clinical STS where they released MedSTS_ann, hence making it the first available resource for the study of clinical STS. The pioneering shared task was titled BioCreative/OHNLP Challenge 2018 Task 2: Clinical Semantic Textual Similarity 3 and attracted the participation of 4 teams who developed automatic systems to measure the semantic relation between sentence pairs from Med-STS_ann [31].
Submitted solutions explored various techniques ranging from conventional machine learning to deep learning models. The winning system consisted of a regression model applied on 8 trained models: Random Forest, Bayesian Ridge regression, Lasso regression, linear regression, Extra Tree model, Dense Neural Network (DNN) using the Universal Sentence Encoder, DNN using the in-ferSent encoder, and Encoder-MLP using the inferSent encoder. This system attained the highest Pearson correlation of 0.8328 [9].
The second placing team obtained its best result (0.8143) with an ensemble that averaged the predictions from two models: (i) an Attention-Based Convolutional Neural Network (ABCNN) with NLP features, and (ii) a Bidirectional Long Short-Term Memory network (Bi-LSTM) [33].
The third team used sentence embeddings by performing a weighted average of word vectors followed by a soft projection. To address common component bias in linear sentence embedding, a self-regularized identity map (Conceptors) was used. This approach obtained a Pearson correlation of 0.7789 [20]. The final team attained a best Pearson correlation of 0.7090 and did not disclose a detailed approach.
On a more recent note, building upon the experience from the BioCreative/OHNLP shared task on clinical STS, a collaboration between n2c2 and OHNLP resulted in the 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity with the objective of expanding previous work by providing novel annotated data, thus allowing further development and evaluation of systems on previously unseen data.

Sentence Representation: BioWordVec versus BioSentVec
Posterior to the BioCreative/OHNLP shared task, members from the winning team proceeded their work by exploring the field of sentence representation, focusing on word and sentence embeddings. Due to the limitations of traditional word embeddings, which are computed at the word level and trained for general domain text, Zhang et al. decided to develop a set of biomedical word embeddings that combined subword information from biomedical text with the biomedical vocabulary MeSH (Medical Subject Headings). The resulting biomedical word embeddings, named BioWordVec 4 , performed better than word2vec embeddings in medical word pair similarity benchmarks, and were made publicly available to the community [34]. The authors further improved these embeddings by training them on a new corpus containing over 30 million documents from PubMed articles and clinical notes from the MIMIC-III Clinical Database [17]. Word embeddings capture representations at word level, thus to represent a sentence it is necessary to decompose it in words and combine the representations from each word. However, it is also possible to represent sentence text in a more straightforward approach, using instead sentence embeddings. Similarly to the word embedding scenario, despite the existence of pre-trained sentence encoders for the general domain, biomedical and clinical text remained an unexplored field, thus Chen et al. trained sentence embeddings on the same PubMed + MIMIC-III dataset used for BioWordVec. The result was BioSentVec 4 , a set of publicly available sentence embeddings suitable for biomedical and clinical text applications [11].
To assess the adequacy and effectiveness of these sentence embeddings, Chen et al. tested BioSentVec with BIOSSES and Med-STS_ann. Whilst using a much simpler 5-layer deep learning model that only received two sentence vectors generated by BioSentVec as input, the authors managed to obtain a Pearson correlation of 0.836 in MedSTS_ann, slightly improving the previous state-of-the-art. This demonstrated that BioSentVec embeddings can effectively capture sentence semantics in clinical text [11]. The authors performed a further comparative evaluation in MedSTS_ann using various sentence similarity models, including the latest bidirectional transformers in the clinical domain, such as BioBERT [19], and observed that (i) the simpler approach proposed in [11] still obtained the best performance, and (ii) embeddings trained on large corpora are the best solution to capture sentence semantics in small datasets [10].

MATERIALS AND METHODS
In this work we evaluate a machine learning system in a supervised setting for predicting the semantic relatedness between clinical sentences. For this task, a real value in the interval [0, 5] is attributed to each pair of sentences: if two sentences are completely unrelated 4 https://github.com/ncbi-nlp/BioSentVec they have a similarity score of 0 (minimum); otherwise if they share the same semantic meaning their similarity value is 5 (maximum). Alike previous challenges on STS [8,31] the evaluation metric is the Pearson correlation coefficient, where two lists of similarity values (system predictions and ground truth) are compared.
In the following subsections we describe the dataset used in this work, the different approaches used to preprocessing clinical text, the process of feature representation, and finally the deep learning models employed in our simulations.

Dataset
The dataset contained a total of 2 054 sentence pairs, and it was beforehand split by the n2c2 organizers into training and test subsets. The training data allowed for model development, whereas the test data was used solely for official evaluation in the challenge. Only afterwards, the ground truth scores of the test set were made available to the participating teams. Table 1 presents some statistics about the dataset. Surprisingly, the distribution of the similarity scores between the two subsets (training and test) are somewhat discrepant since the test data contains a higher number of pairs with scores in the interval [0, 1].
A list of example clinical sentence pairs is provided in Table 2, containing the sentence pairs, their respective similarity score, and an explanation of the criteria used to assign the score.

Clinical Text Preprocessing
Before converting sentences into embedding representations, it was necessary to apply a preprocessing step to the clinical sentence pairs. In this work, three different pipelines were tested.

Base
Preprocessing. The baseline preprocessing pipeline was simple, consisting only of two steps: (i) lowercase conversion of the text, and (ii) tokenization using the NLTK tokenizer 5 .

Advanced Preprocessing With Full Stopword
Removal. This pipeline was inspired on the preprocessing used by Chen et al. in the BioCreative/OHNLP 2018 Task on Clinical Semantic Textual [9], and had the objective of retaining as much semantic information as possible in preprocessed sentences.
The pipeline started with the separation of number ranges (e.g. 0.3-1.8 → 0.3 to 1.8) which are frequent in lab analysis data. The second step was to extend numbers into their textual counterpart The two sentences are mostly equivalent, but some unimportant details differ S1: Discussed goals, risks, alternatives, advanced directives, and the necessity of other members of the surgical team participating in the procedure with the patient S2: Discussed risks, goals, alternatives, advance directives, and the necessity of other members of the healthcare team participating in the procedure with the patient and his mother 3 The two sentences are roughly equivalent, but some important information differs/missing S1: Cardiovascular assessment findings include heart rate normal, Heart rhythm, atrial fibrillation with controlled ventricular response S2: Cardiovascular assessment findings include heart rate, bradycardic, Heart rhythm, first degree AV Block 2 The two sentences are not equivalent, but share some details S1: Discussed risks, goals, alternatives, advance directives, and the necessity of other members of the healthcare team participating in the procedure with (patient) (legal representative and others present during the discussion) S2: We discussed the low likelihood that a blood transfusion would be required during the postoperative period and the necessity of other members of the surgical team participating in the procedure 1 The two sentences are not equivalent, but are on the same topic S1: No: typical 'cold' symptoms; fever present (greater than or equal to 100.4 ºF or 38 ºC) or suspected fever; rash; white patches on lips, tongue or mouth (other than throat); blisters in the mouth; swollen or 'bull' neck; hoarseness or lost voice or ear pain S2: New wheezing or chest tightness, runny or blocked nose, or discharge down the back of the throat, hoarseness or lost voice 0 The two sentences are completely dissimilar S1: The risks and benefits of the procedure were discussed, and the patient consented to this procedure S2: The content of this note has been reproduced, signed by an authorized physician in the space above, and mailed to the patient's parents, the patient's home care company (e.g. 78 → seventy-eight; 0.9 → zero point nine). Next, words connected with slashes, dashes and dots where separated with spaces (e.g. yes/no → yes / no; point-of-care → point -of -care). Then starting white spaces are removed, and double spaces are converted to single spaces.
The resulting text is converted to lowercase and tokenized with the NLTK tokenizer. Finally, a stopword removal is performed using a complete stopword list for biomedical literature 6 , punctuation is removed from the tokens, and tokens composed of a single character are discarded.

Advanced Preprocessing With Partial Stopword Removal.
This approach is similar to the previous processing pipeline, differing only in the stopword removal part. Since the list of stopwords 6 https://www.ncbi.nlm.nih.gov/IRET/DATASET for biomedical text contains terms that can be important for retaining the semantics in clinical text, a smaller stopword list from Luo et al. [21] was used instead, being composed of the following terms: 's, a, an, any, her, his, patient, that, the, these, this, those, your.

Feature Representation
A crucial step in machine learning is feature representation, which aims to transform any kind of data (text, image, and others) into a numeric representation. Regarding textual representation, the bagof-words technique (one-hot encoding) is commonly employed in traditional machine learning classifiers for tasks such as document classification or word sense disambiguation. However, in the last years, more efficient methods for textual representation have been proposed. Distributed representation of words, or word embeddings, allow the computation of word vector representations, of relatively small dimension, from large unlabeled textual data [6,7,22]. Also, vector representations of sentences, or sentence embeddings, have been investigated [25]. Word and sentence embeddings have been successfully combined with neural network models achieving stateof-the-art results in many NLP tasks. For this clinical STS task we evaluated, separately, the use of word embeddings and sentence embeddings. We employed the publicly available BioWordVec and BioSentVec models created by Chen et al. [11]. To encode the sentences using word embeddings we normalized the sum of the embedding vectors of their constituent words.

Deep Learning Model
The Keras library [12] was used to implement and test different neural network models. Our proposed model derived from other state-of-the-art works [9,11] that use word and sentence embeddings. We tested various types and configurations of deep learning models, yet simpler models yielded better results, similarly to what was observed in [11].
We present a neural network whose inputs are the embedding representations of the respective two sentences encoded by (i) word embeddings or (ii) sentence embeddings. In both cases we also included its multiplication (element-wise) and dot product. However, in the latter case we additionally included the cosine similarity since the sentence embedding vectors were not normalized, and the cosine similarity provided valuable information (preliminary results were higher).
The neural network contained a first layer with 512 units and ReLU activation. Also, we used Xavier normal initialization, a bias constant of 0.01 and L2 regularization of 0.001. Afterwards, a dropout rate of 0.4 was set, and a final unit with sigmoid activation performed the predictions. The stochastic gradient descent optimizer was employed with a learning rate of 1.0, and the mean squared error as loss function. A simplified scheme of the neural network with sentence embeddings is presented in Figure 1. For fine-tuning the hyperparameters and adapting the model architecture we used repeated K-fold cross-validation: for each "repetition" we applied cross-validation where we split the training data into three subsets: training, validation, and test. These allowed for consistent model development without biasing in regard to the test set. We used the training subset to update the network weights, the validation subset to evaluate the model performance in each epoch, and finally the test subset was used for unbiased evaluation. The model that obtained the highest result in the validation subset was chosen to evaluate model performance in the test subset. Model training was halted when the performance in the validation subset did not improve for a period of 20 epochs. After this intensive model refinement with thorough evaluation on training data, the configuration model was left unmodified to be applied on unseen test data. In the next section we evaluate the impact of using different preprocessing approaches, as well as using word embeddings or sentence embeddings as input vectors in the proposed neural network model.

RESULTS AND DISCUSSION
For gathering results we applied different text preprocessing pipelines and sentence representations: word embeddings versus sentence embeddings. We evaluated performance on (i) training data using repeated cross-validation and on (ii) test data from predictions of a single evaluation. Results presented in the training set were obtained by averaging 30 separate scores, whereas evaluation on test data was slightly different: we averaged the predictions of 30 individual models. The idea behind this was to increase model robustness against the test data, because the final model is able to 'see', and learn from, the whole training data by using diverse folds for training and validation.
During model development on training data, a compromise between the sizes of training, validation, and test subsets sizes was necessary. Because of that, we explored the use of different K-fold split values to assess the impact of using less data for training and more for validation, and vice versa. We hypothesized an 'optimal' threshold considering enough training data and a solid evaluation on validation data could reflect an improvement on unseen data. Table 3 presents detailed results from all these experiments. In general, the use of sentence embeddings provided superior results especially when using the base text preprocessing. Surprisingly, the full text preprocessing provided better results when using word embeddings, proving the benefit of stopwords removal. To assess the effect of using different K-fold splits, we highlight the top scores according to the evaluation in the training set: for 10, 5, and 3-folds these were, respectively, 0.811, 0.812, and 0.792 in training data, and 0.837, 0.819, and 0.836 in test data. Therefore, we conclude the use of different splits for training and validation did not affect significantly the results, and thus any of the number of folds we used would be acceptable.
The highest scoring model in training data (0.812), which consisted in using sentence embeddings with partial text preprocessing (5-fold setting), produced a correlation of 0.819 in test data. However, better results on test data were achieved when using sentence embeddings with the base text preprocessing (0.837, 0.831, and 0.836 for 10, 5, and 3-folds respectively). Furthermore, it is interesting to  notice that word embeddings with full text preprocessing also attained good test results (0.823, 0.826, and 0.824 for 10, 5, and 3-folds respectively) similarly to those with sentence embeddings. Based on this, we believe that the combination of word and sentence embeddings may provide further improvement. Overall, when analysing training and test performances it is noteworthy that systems obtained higher scores on test data by a margin of approximately 0.02. We suspect that one of the reasons for this is the fact that evaluation on test data was performed by averaging several models (where each model was trained and validated on distinct data subsets) whilst the results reported on training data are the average of several simulations made by an individual model. Finally, we compare our highest test result (0.837) with those achieved during the n2c2/OHNLP 2019 Clinical STS task where a total of 90 system predictions were submitted: final aggregated results (Pearson correlation) presented a mean correlation of 0.712, a median of 0.829, and a maximum of 0.901. We consider that our model attained a positive performance given its simplicity, being slightly above the median score, but also that there still exists large margin for progress given the maximum result achieved.

Error Analysis
A detailed error analysis was performed to better understand model behaviouri.e. which similarity levels the model predicted more correctly -and to perceive if this could be another reason why test results are higher than those on training (due to the dataset imbalance). As such, we used the same similarity intervals as expressed in the dataset statistics (Table 1) To perform this evaluation, we started by computing the number of true positives, false positives, and false negatives to calculate precision, recall, and F1-score. A prediction was considered a true positive if the ground truth was in the same similarity interval. Otherwise, the prediction was assigned as false positive and false negative in the corresponding intervals. To demonstrate this procedure, assume the model prediction is 3.7 whereas the ground truth is 4.1. In such case, a false positive and a false negative would be added in the similarity intervals ]3, 4] and ]4, 5], respectively. Table 4 presents this detailed error analysis in training and test data, for a model with the following configuration: sentence embeddings, base text preprocessing pipeline, and 10-fold evaluation. Overall, it is noticeable that the system had more difficulty in correctly predicting sentence pairs with scores in the interval ]1, 3].
It is interesting to note that the highest precision (around 0.8) was observed in the interval [0,1] showing the model's ability to correctly detect completely dissimilar sentences. Since the majority of samples (around 58%) in test data were in this interval, this also supports our assumption that test results were higher because of their scores imbalance. Additionally, one can observe that the F1scores were higher in extreme similarity intervals, corroborating the assumption from Wang et al. that machines can succeed at distinguishing completely similar or dissimilar sentence pairs but, alike humans, they struggle in distinguishing less clear relations of semantic similarity [31].
Finally, despite the fact that F1-scores on test data are somewhat smaller to those on training, we emphasize this type of error analysis does not elucidate directly the model performance, since (i) the test set is highly imbalanced and (ii) near-correct scores can be misinterpreted as complete failures as they fall in a different interval by a slight margin (e.g. the model predicts a score of 0.9 whereas the ground truth is 1.1).

CONCLUSIONS AND FUTURE WORK
The increasing involvement of technology in the medical field has resulted in an improvement of the quality of healthcare. However, the wide use of digital processes in medicine does also have its shortcomings, namely in the clinical documentation process where the increased easiness of replicating information results in more redundant free text and in an enhanced chance of introducing erroneous information. This is a serious concern as it can potentially lead to clinical error. A possible solution to reduce redundancy in free text is the use of STS methods to detect text excerpts with equivalent semantic content.
This paper presents the approach used in the 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity. Our system used a simple neural network to tackle the problem of ranking the degree of similarity between snippets of clinical text, and explored the impact of using different preprocessing methods, as well as using different feature representation methods (word embeddings versus sentence embeddings).
Our study demonstrated that sentence embeddings provided superior text representation than word embeddings, better capturing sentence semantics. However, word embeddings were not far behind in terms of system performance (Pearson correlation of 0.823 vs 0.837). It was also noticeable that word embeddings benefited from using a more thorough text preprocessing pipeline, whereas sentence embeddings obtained better test results with a basic preprocessing approach.
As future work we intend to explore the fusion of features from word embeddings and sentence embeddings in the same model. Moreover, other architectures could be explored such as convolutional neural networks or bidirectional long short-term memory networks with attention mechanisms, along with other types of embedding representations, such as contextual embeddings. We also aim to investigate data augmentation or distant supervision techniques.