Automated ICD-9-CM medical coding of diabetic patient's clinical reports

The assignment of ICD-9-CM codes to patient's clinical reports is a costly and wearing process manually done by medical personnel, estimated to cost about $25 billion per year in the United States. To develop a system that automates this process has been an ambition of researchers but is still an unsolved problem due to the inherent difficulties in processing unstructured clinical text. This problem is here formulated as a multi-label supervised learning one where the independent variable is the report's text and the dependent the several assigned ICD-9-CM labels. Different variations of two neural network based models, the Bag-of-Tricks and the Convolutional Neural Network (CNN) are investigated. The models are trained on the diabetic patient subset of the freely available MIMIC-III dataset. The results show that a CNN with three parallel convolutional layers achieves F1 scores of 44.51% for five digit codes and 51.73% for three digit, rolled up, codes. Although fully automated coding is not achievable, these results suggest that automated classification could be used to aid clinical staff by selecting the most probable codes.


INTRODUCTION
It is in Electronic Health Records (EHR), written by health professionals, where most of patient data resides. These records capture the state of a patient across time and are invaluable in diagnosis. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from permissions@acm.org. Besides this, they play a signicant part in the reimbursement process of health institutions by insurance companies. Medical codes, like ICD-9-CM, assigned to patient's reports after treatment serve as a justication for carrying out the said treatment. Failure in correctly assigning these codes is a liability for healthcare institutions: a missing code represents a loss of revenue and an extra one can constitute fraud. Assigning and correcting these codes is done manually by specialized medical personnel, called a medical coders, and is estimated to cost about $25 billion per year in the United States [3].
In this paper a system to automate the assignment of ICD-9-CM codes to clinical reports is described. This is achieved by developing a statistical classication model which approximates a function parameterizing the relation between a report's textual contents and its assigned codes. The used classiers are based on articial neural networks, which have shown to be capable of handling large volumes of relatively messy data [12].
The classier is developed on the diabetic patient subset of the MIMIC-III dataset. By constraining the system to this subset, its diseases, terminology and patterns, it is made less versatile but the lesser variability sets it up to produce more accurate results.

BACKGROUND 2.1 Clinical Text
Text mining clinical text presents some unique challenges when compared to other types of usually mined text such as biomedical literature. Meystre et al. sum up these challenges [9].
First, a single patient's record has several types of report with no single neatly dened structure. These can be the patient's description, his and his family's medical history, remarks made by the physician during an examination, etc. The length of the reports also varies greatly e.g. writing a couple of vital signs during a routing check-up vs. the patient's full medical history during the rst consultation. Related to length is the structure of the reports. EHR software is usually form-based ("symptoms", "suggested causes", "suggested treatment", . . . ) which could result in thematically well dened short sentences for each of the form's elds. However, the clinician might opt to ignore the form's structure and use the freeform data entry eld, if it is available, or, worst case, misuse one of the other form entries, if it is not, to introduce the report's data.
The clinician also appropriates his writing style and language depending on if the text is going to be read by other practitioners. A report, written by a radiologist, which accompanies an x-ray is composed for clear communication. In contrast, notes written by a physician during a consultation are not. These can be ungrammatical, ignoring capitalization and punctuation rules, and composed of short, telegraphic phrases where shorthand abounds -homonyms, acronyms, abbreviations and local dialectal shorthand phrases. Furthermore, the shorthand can be overloaded i.e. the same shorthand unit takes a dierent meaning depending on surrounding context. According to the 2015AA version of UMLS, the abbreviation "RA" can stand for 25 meanings as distant as "rheumatoid arthritis", "radioactive" and "ragweed antigen" [11]. Liu et al. show that 33% of the time acronyms in UMLS are highly ambiguous even in context [8]. Lexical, morphological and spelling errors also proliferate in clinical text. Ruch and Gaudinat report that about up to one in every ve sentences in a corpus of discharge summaries, surgical reports and laboratory results has a spelling error [14]. It is not uncommon to nd the acronyms themselves misspelled [9]. Text can also be interrupted by values copy-pasted from laboratory results, vital signs or drug dosages which can confuse standard text mining algorithms like sentence segmentation.

Prior Work
Assigning medical codes to health records is usually formulated as a supervised, multi-label classication task where the features are the report's text and the outputs the assigned medical codes. Therefore, most techniques, such as the support vector machine (SVM), generally used in classication tasks apply. Increasingly, neural networks are preferred to other machine learning methods. These have proven to be able to generalize to large volumes of relatively messy data, like clinical text [12].
Due to the mentioned heterogeneity in clinical records, researchers usually choose to focus on a restricted subset of data with shared characteristics. Baumel

DATASET
The Medical Information Mart for Intensive Care III (MIMIC-III) is a large, freely available database of anonymized health-related data associated with patients of the intensive care unit (ICU) of the Beth Israel Deaconess Medical Center between 2001 and 2012 [5].
MIMIC-III is composed of several tables which host several types of data from admissions and transfers to patient information and test results. For the purposes of this thesis only two are used: the notes table (NOTEEVENTS) which contains reports input by clinicians and nurses into the EHR at each patient admission and the diagnosis table (DIAGNOSIS_ICD) which contains ICD-9-CM codes, associated with each admission, generated for billing purposes at the end of each patient's hospital stay. Furthermore, diagnosis entries, and their associated notes, are only kept for patients who have been diagnosed with diabetes. This ltering adds up to a total of 406 190 reports which will be further reduced during preprocessing.
Each report in the notes table is assigned to one of fteen categories which are heterogeneous in their structure and language much like already described in Section 2.1.
The disease labels associated with each report use the ICD-9-CM taxonomy. The International Statistical Classication of Diseases and Related Health Problems (ICD) is a disease classication system maintained by the World Health Organization. These codes are composed of three primary digits and either one or two optional digits which convey extra information. The Clinical Modication (CM) augments ICD-9 with additional morbidity detail and with codes for surgical, diagnostic and therapeutic procedures.
For the classication task, two dierent representations of the codes are considered. A regular one with codes which range from three to ve digits and a rolled up version where all codes are limited to their rst three (primary) digits.
The used MIMIC-III subset has 4006 dierent regular codes and 779 dierent rolled up codes. Each of the codes is used in labeling at least one report. Procedure codes are not used. The dataset has a very low label cardinality -dened as the average number of codes per report -of 17 for regular and 15 for rolled up labels. Label densities -dene as the number of codes per report divided by the number of unique codes, averaged over the full dataset -are 0.0042 and 0.0198.

PREPROCESSING
Preprocessing starts with a simple four step process that transforms each report into a sequence of tokens: (1) Punctuation characters, except for the apostrophe, are converted to whitespace (2) Digits are replaced by the character 'd' (3) All characters are converted to lowercase (4) The report is split at whitespace characters As previously noticed, clinical text is prodigal of misspellings that can interfere with a classier's correctness. Baumel et al. suggest a simple vocabulary reduction step as a partial remedy [2]. Tokens that occur less than ve times in the whole corpus are eliminated. These eliminated tokens are then mapped to their most similar token of the ones which were kept. The most similar token is dened as the one which minimizes the Levenshtein distance string metric.
The assumption for this step is that tokens which occur infrequently are misspellings of other tokens which occur more frequently. This does not always hold true but good results were observed, with the added advantage that the vocabulary was reduced from 162 784 to 53 229 unique tokens.
Most text classication tasks opt to use word embeddings to represent tokens. Word2vec's skip-gram neural network model was input the corpus and used to generate three-hundredth dimensions word embeddings. The concrete implementation of word2vec used was the one from the gensim library 1 . Pre-trained word embeddings were also experimented with but produced worse results due, most likely, to not containing enough domain specic terminology.
Based on an analysis of the data, it was chosen to keep only reports which have more than 9 and less than 2200 tokens. By doing this, 98.38% of the reports are kept unaltered and can t in working memory while 6 567 are eliminated. The average length for the reports after this step is 309 tokens. Table 1 summarizes the preprocessed dataset.

METHODS
To better understand how this problem can be solved and determine the most advantageous form of classication, two dierent methods are experimented with: • Multi-label classication of the reports. Here, the classier must learn to predict all ICD-9-CM labels assigned to each report. Separate classiers are trained to predict labels in their regular and rolled up forms. For both forms, two variations of these models are considered. One where the model uses just the tokenized reports and another where, besides using the reports, it also takes as input the category of the report. • Binary classication of reports for each of the most common rolled up labels (see Table 2). Here, each classier is trained to predict a single label, contrasting with the previous task where a classier predicts multiple labels. Both baseline and category augmented classiers are also used.
Reports are encoded as a sequence of indexes which map to tokens in the known vocabulary; categories are encoded as a 1-hot vector where the category assigned to the report is positive; labels are encoded as a k-hot vector, where all labels assigned to the report are positive.
The models are based on two dierent types of classier: the Bag of Tricks and the Convolutional Neural Network.

Bag of Tricks
The rst method is a simple linear model based on Armand et al. 's Bag of Tricks (BoT) [6]. BoT uses a very similar architecture to Mikolov et al. 's Continuous Bag-of-Words [10] but is not limited to words and can classify arbitrary type labels.
The baseline version of this classier is composed of an embedding layer that transforms the report's tokens into high dimension word embeddings. These embeddings are then averaged into a xed dimension vector before being fed into a dense/fully-connected layer. The size of the dense layer varies according to the task: 4 006 for regular multi-label classication, 779 for rolled up label classication and 1 for binary classication.
The signicant departure from Armand et al. is in the usage of the sigmoid activation function at the dense layer's outputs instead of the softmax function. Softmax squashes the layer's inputs into a probability distribution which would not be appropriate in a multi-label scenario as labels are not mutually exclusive e.g. both "diabetes" and "heart arrhythmia" can be assigned to a single report. The sigmoid function, instead, squashes each neuron's input into a value in the range [0 1] so that each neuron outputs a Bernoulli probability distribution independent of all other neurons. To determine if the neuron's label is assigned, a xed threshold of 0.5 is used.
The classier is trained by minimizing the binary/binomial crossentropy loss function as is usually the case in multi-label and binary classication scenarios.
The Adam optimizer [7] is used and examples are fed into the classier in batches of size 32. The classier uses no other regularization except for early stopping with a minimum delta of 0.0001 and a patience of 2 epochs. Choosing the Adam optimizer, instead of the more traditional Stochastic Gradient Descent, enables the adaptive tuning of the classiers hyperparameters during training [4].
The main advantage of this method, when compared to the CNN, is that, due to its simplicity, it experiences very fast training. However, its disadvantage is signicant: linear classiers do not share parameters among features and labels. This is particularly consequential in textual data as it means word order is ignored and multi-word expressions are not established.

Convolutional Neural Network
The Convolutional Neural Network (CNN), by using parameter sharing, is able to address BoT' main problem. This comes at the cost of signicantly slower training.
This classier's baseline architecture is very similar to BoT' as can be seen in Figure 1a. Instead of being averaged, the word embeddings are input into a typical CNN three-stage computation. First, a one-dimensional convolution layer with lter size 250 and kernel size 3 outputs a set of linear activations. Next, each linear activation is run through a (non-linear) rectied linear unit. Anal max pooling layer helps to make the obtained representation invariant to small translations in the input [4].
Another architecture with three parallel convolutional layers, as shown in Figure 1b, is also experimented with. Its lters sizes, from left to right, are 2, 3 and 4. The "concat" operation merely concatenates its input vectors. All other hyperparameters are shared with the BoT classier. Figure 2 shows the category augmented architecture which is used with both BoT and CNN models. An additional input is added for the category 1-hot vectors which is then concatenated with the classier specic layer's output (see yellow color coded layers in Figures 1a and 1b). Four variations of the category augmented architecture were tested, the shallower uses no additional dense layers and the deeper uses three additional ones. All the added dense layers, colored in blue, have output size 64 and a rectied linear unit activation.

Multi-Label Classication Task
The displayed results were obtained using 5-fold cross validation. In each fold, 10% of the training subsamples were used for validation. Table 3 shows the results for the multi-labeling sub-task for both regular and rolled ICD-9-CM codes. The CNN with three    parallel convolutional layers achieved best precision, recall and, consequently, F 1 score in both code formats. Rolled up F 1 scores are an average of 12% higher than regular ones. This is to be expected as, by truncating codes to three digits, label sparsity is reduced and each code has more examples. All CNN based models are better than BoT based ones. The worst performing CNN, CNN Baseline w/ Categories 2 Dense, achieved a 7% higher F 1 score, in the regular codes, when compared to the best performing BoT, BoT Baseline w/ Categories 3 Dense. For rolled up codes, the dierence is less signicant at a 3% improvement.
Category augmented architectures managed to improve the baseline BoT model's F 1 score by a maximum of 6%, mostly through increased recall. It can be observed that deeper models marginally improved results. This once again conrms the empirical observation that deeper models result in better generalization [4]. Interestingly, not only did category augmentation fail to improve the baseline CNN model, it decreased its F 1 score by a maximum of 9%.
Another observation is that category augmented models with no additional dense layers produced no change when compared to the baseline ones. This can be justied as the model having the category information but not being able to learn how to use it.
Even though CNN based models performed the best, it came at the cost of a remarkable increase in training time, when compared to BoT. The baseline CNN took roughly 5x more time training than the BoT baseline. The CNN with triple convolution was more than twice slower than the baseline CNN at training.
As far as the author is aware, no other publication exists where the classication of the MIMIC-III diabetic subset (or any other subset of similar magnitude) is attempted. Other authors opt to limit themselves to a single category of report. Baumel et al. use 52 139 reports belonging to the Discharge Summary category with 6 527 regular and 1 047 rolled up unique labels [2]. Their CNN's F 1 scores are similar to the here obtained, which are only 2.59% and 4.26% lower for regular and rolled up labels. Curiously, Baumel et al. 's BoT model (CBOW in the paper) performed much better with 14.78% and 13.05% higher F 1 scores.

Binary Classication Task
This task restricts classication to the ten most common labels (250 -diabetes mellitus -is ignored). A binary classier is trained for each dierent label and all of them are then combined, using the binary relevance method, to make multi-label predictions [13]. This method simply consists of combining the ten distinct binary models into a single multi-label classier. To establish a comparison, the already trained multi-label classiers are used as is except all labels which are not in the most common are ignored. Table 4 shows the results for this task. Multi-label and binary BoT based models performed similarly with the category augmented binary version achieving a 3% higher F 1 score due to a 4% increase in recall. CNN based models once more were the most accurate. Binary CNN models performed better than multi-label ones with F 1 scores 5.73% higher for the baseline model and 6.94% for the category augmented version. Noticeably, the category augmented  Table 4 that the category augmented version performed marginally better than the baseline one with a 0.45% increase in F 1 score. What is surprising about this is that it contradicts what is observed in Table 3 where augmenting the CNN baseline with category information produces much worse results. This leads to the conclusion that training multilabel category augmented CNNs on less sparse labels (see Table 2) could be fruitful.

CONCLUSION
The problem of automatically assigning ICD-9-CM codes to patient reports was formulated as a multi-label supervised learning task to be ran on the diabetic patient subset of the MIMIC-III dataset using articial neural network based models. F 1 scores of 44.51% and 51.73% were obtained for regular and rolled up labels, respectively. As far as the author knows, no other publications address a similar overarching problem but when compared to other, more localized, experiments, the results appear to be good.
Furthermore, it was explored whether using a collection of binary models, instead of a single multi-label one was a better alternative. This experimentation was conducted on the restricted subset of the ten most common labels. It was shown that the best binary classiers produced an almost 7% improvement in F 1 score over its multi-label equivalent.
The results obtained indicate that such a method could be used to aid clinical coders. Also, various research direction can be taken to improve these results. First, the development of new classiers. Despite the triple convolution classier being the one with the best results, a category augmented version of it was not trained due to incredibly long training times. Doing so is certainly worthy. Given the improvements deeper architectures showed in the BoT classier, deeper yet architectures should be experimented with. Besides these, whole new models, like an LSTM based neural network should be tried. All classiers use early stopping as the single regularization technique. Trying other methods such as the L1 Loss or more sophisticated ones, like Dropout, could improve results at the cost of even longer training times.
Second, the leveraging of more features of the MIMIC-III dataset. While the reports are used, the model does not have any information regarding their temporal relation. Adding this information can enable it to develop associations between diseases. For example, if a report is classied with diabetes, the model can discern that following reports have a higher probability of being classied with diseases which are complications of diabetes like kidney and foot damage. It was also mentioned that the codes associated with each report have a sequence number assigned. Finding a way to factor it into the models and the evaluation method could be useful in better appropriating them to the task at hand.