1. Introduction

Navigating the shift of clinical laboratory data from primary everyday clinical use to secondary research purposes presents a significant challenge. Given the substantial time and expertise required to preprocess and clean this data and the lack of all-in-one tools tailored for this need, we developed our algorithm lab2clean as an open-source R-package. lab2clean package is set to automate and standardize the intricate process of cleaning clinical laboratory results. With a keen focus on improving the data quality of laboratory result values, while assuming compliance with established standards like LOINC and UCUM for test identifiers and units, our goal is to equip researchers with a straightforward, plug-and-play tool, making it smoother for them to unlock the true potential of clinical laboratory data in clinical research and clinical machine learning (ML) model development. Version 1.0 of the algorithm is described in detail in Zayed et al. 2024 [https://doi.org/10.1186/s12911-024-02652-7]

The lab2clean package contains two key functions: clean_lab_result() and validate_lab_result(). The clean_lab_result() function cleans and standardizes the laboratory results, and the validate_lab_result() function performs validation to ensure the plausibility of these results. This vignette aims to explain the theoretical background, usage, and customization of these functions.


2. Setup

Installing and loading the lab2clean package

You can install and load the lab2clean package directly in R.

#install.packages("lab2clean")

After installation, load the package:

library(lab2clean)

3. Function 1: Clean and Standardize results

The clean_lab_result() has five arguments:

  • lab_data : A dataset containing laboratory data

  • raw_result : The column in lab_data that contains raw result values to be cleaned

  • locale : A string representing the locale for the laboratory data. Defaults to “NO”

  • report : A report is written in the console. Defaults to “TRUE”.

  • n_records : In case you are loading a grouped list of distinct results, then you can assign the n_records to the column that contains the frequency of each distinct result. Defaults to NA

Let us demonstrate the clean_lab_result() function using Function_1_dummy and inspect the first six rows:

data("Function_1_dummy", package = "lab2clean")
head(Function_1_dummy,6)
raw_result frequency
? 108
* 243
[ 140
_ 268
1.1 x 10^9 284
2.34 x 10E12 42

This dataset -for demonstration purposes- contains two columns: raw_result and the frequency. The raw_result column holds raw laboratory results, and frequency indicates how often each result appeared. Let’s explore the report and n_records arguments:

cleaned_results <- clean_lab_result(Function_1_dummy, raw_result = "raw_result", report = TRUE, n_records = "frequency")

#> Step 1: Handling records with extra variables stored with the result value removing interpretative flags, or units
#> ==========================================================================================
#> 8 distinct results (8.742% of the total result records) with interpretative flags (e.g. positive, negative, H, L) -> flags removed with cleaning comment added flag).
#> 17 distinct results (20.043% of the total result records) with unit (%, exponents, or other units) -> units removed with cleaning comment added Percent, Exponent, or Units).
#> Step 2: classify and standardize different scale types - part 1
#> ==========================================================================================
#> 3 distinct results (5.373% of the total result records) of scale type ‘Ord.2’, which describes grades of positivity (e.g. 2+, 3+).
#> 7 distinct results (7.966% of the total result records) of scale type ‘Qn.2’, which describes inequality results (e.g. >120, <1).
#> 4 distinct results (6.233% of the total result records) of scale type ‘Qn.3’, which describes numeric range results (e.g. 2-4).
#> 4 distinct results (3.092% of the total result records) of scale type ‘Qn.4’, which describes titer results (e.g. 1/40).
#> 55 distinct results (61.335% of the total result records) of scale type ‘Qn.1’, which describes numeric results (e.g. 56, 5.6, 5600).
#> 4 distinct results (4.853% of the total result records) with numeric result values that cannot be determined without predefined locale setting (US or DE) -> cleaning comment added locale_check).
#> 4 distinct results (4.888% of the total result records) of scale type ‘Ord.1’, which describes positive or negative results (Neg, Pos, or Normal).
#> 1 distinct results (1.019% of the total result records) of scale type ‘Nom.1’, which describes blood groups (e.g. A+, AB).
#> Last Step: Classifying non-standard text records
#> ==========================================================================================
#> 0 distinct results (0% of the total result records) with multiple result values (e.g. postive X & negative Y) -> cleaning comment added (multiple_results).
#> 0 distinct results (0% of the total result records) with words about sample or specimen (e.g. sample not found) -> cleaning comment added (test_not_performed).
#> 8 distinct results (8.777% of the total result records) with meaningless inputs (e.g. = , .) -> cleaning comment added (No_result).
#> 1 distinct results (1.317% of the total result records) that could not be standardized or classified -> cleaning comment added (not_standardized).
#> ==========================================================================================
#> 78 distinct results (89.906% of the total result records) were cleaned, classified, and standardized.
#> ⏰ Time taken is 0.012 minutes.
#>

The report provides a detailed report on how the whole process of cleaning the data is done, and offers some descriptive insights of the process. The n_records argument adds percentages to each of the aforementioned steps to enhance the reporting. For simplicity, we will use report = FALSE in the rest of this tutorial:

cleaned_results <- clean_lab_result(Function_1_dummy, raw_result = "raw_result", report = FALSE)

#> 78 result records were cleaned, classified, and standardized.
#> ⏰ Time taken is 0.01 minutes.
#>

cleaned_results
raw_result frequency clean_result scale_type cleaning_comments
? 108 NA NA No_result
* 243 NA NA No_result
[ 140 NA NA No_result
_ 268 NA NA No_result
1.1 x 10^9 284 1.1 Qn.1 Exponents
2.34 x 10E12 42 2.34 Qn.1 Exponents
2,34 X 10^12 173 2.34 Qn.1 Exponents
3.14159 * 10^30 271 3.142 Qn.1 Exponents
1.1x10+9 179 1.1 Qn.1 Exponents
2,34X10^12 153 2.34 Qn.1 Exponents
3.14159*10^30 288 3.142 Qn.1 Exponents
3.142*10^30 152 3.142 Qn.1 Exponents
1,1 x 10e9 213 1.1 Qn.1 Exponents
3 185 3 Qn.1 NA
1.1 x 10^-9 58 1.1 Qn.1 Exponents
2.34 X 10-12 273 2.34 Qn.1 Exponents
3.14159E-30 96 3.142 Qn.1 Exponents
1x10^9 41 1 Qn.1 Exponents
1E9 119 1 Qn.1 Exponents
2+ 288 2+ Ord.2 NA
+ 270 1+ Ord.2 NA
+++ 217 3+ Ord.2 NA
0-1 203 0-1 Qn.3 NA
1-2 298 1-2 Qn.3 NA
1- 207 1 Qn.1 flag
01-02 221 1-2 Qn.3 NA
1 -2 177 1-2 Qn.3 NA
3 - 2 190 NA NA not_standardized
- 108 Neg Ord.1 NA
+ 230 70 230 Qn.1 flag
100* 290 100 Qn.1 NA
+56 274 56 Qn.1 flag
- 5 216 5 Qn.1 flag
80% 245 80 Qn.1 Percent
-5 37 -5 Qn.1 NA
> 12 159 >12 Qn.2 NA
<1050 235 <1050 Qn.2 NA
< 02 88 <2 Qn.2 NA
>= 20.3 116 >=20.3 Qn.2 NA
>1:40 93 >1:40 Qn.4 NA
1/80 69 1:80 Qn.4 NA
<1/20 142 <1:20 Qn.4 NA
< 1/020 142 <1:020 Qn.4 NA
= 130 NA NA No_result
/ 71 NA NA No_result
0.2 67 0.2 Qn.1 NA
33 Normal 93 33 Qn.1 flag
negative 0.1 156 0.1 Qn.1 flag
H 256 102 256 Qn.1 flag
30% 262 30 Qn.1 Percent
23 % 42 23 Qn.1 Percent
1056 149 1056 Qn.1 NA
1056040 246 1056040 Qn.1 NA
3560 63 3560 Qn.1 NA
0,3 181 0.3 Qn.1 NA
15,6 86 15.6 Qn.1 NA
2.9 64 2.9 Qn.1 NA
02.9 233 2.9 Qn.1 NA
2.90 272 2.9 Qn.1 NA
250 131 250 Qn.1 NA
1.025 210 1.025 Qn.1 locale_check
1.025 56 1.025 Qn.1 locale_check
1025 134 1025 Qn.1 NA
1025 104 1025 Qn.1 NA
1025.7 250 1025.7 Qn.1 NA
1.025,7 151 1025.7 Qn.1 NA
1.025,36 249 1025.36 Qn.1 NA
1,025.36 249 1025.36 Qn.1 NA
>1.025,36 244 >1025.36 Qn.2 NA
<=1,025.36 149 <=1025.36 Qn.2 NA
1.015 234 1.015 Qn.1 locale_check
1,060 200 1,060 Qn.1 locale_check
2,5 222 2.5 Qn.1 NA
2.5 30 2.5 Qn.1 NA
>3,48 158 >3.48 Qn.2 NA
3.48 89 3.48 Qn.1 NA
93 133 93 Qn.1 NA
,825 195 0.825 Qn.1 NA
0,825 125 0.825 Qn.1 NA
1.256894 60 1.257 Qn.1 NA
. 96 NA NA No_result
, 210 NA NA No_result
Négatif 0.3 143 0.3 Qn.1 flag
Négatif 243 Neg Ord.1 NA
Pøsitivo 58 Pos Ord.1 NA
A+ 147 A Nom.1 NA
pos & negative Y 296 Neg Ord.1 NA

This function creates three different columns:

1- clean_result: The cleaned version of the raw_result column. For example, “?” is converted to , “3.14159 * 10^30” to “3.142”, and “+++” to “3+”.

2- scale_type : Categorizes the cleaned results into specific types like Quantitative (Qn), Ordinal (Ord), or Nominal (Nom), with further subcategories for nuanced differences, such as differentiating simple numeric results (Qn.1) from inequalities (Qn.2), range results (Qn.3), or titer results (Qn.4) within the Quantitative scale.

3- cleaning_comments: Provides insights on how the results were cleaned.

The process above provided a generic description on how the clean_lab_result() function operates. It would be useful to delve into more details on the exact way that some of the specific raw results are cleaned:

  • Locale variable:

In the clean_lab_result() function, we have an argument named locale. It addresses the variations in number formats with different decimal and thousand separators that arise due to locale-specific settings used internationally. We chose to standardize these varying languages and locale-specific settings to have the cleaned results in English, US. If the user did not identify the locale of the dataset, the default is NO, which means not specified. For example for rows 71 and 72, there is a locale_check in the cleaning_comments, and the results are 1.015 and 1,060 respectively. That means that either “US” or “DE” locale should be specified to identify this result value. If we specified the locale as US or DE, we can see different values as follows:

Function_1_dummy_subset <- Function_1_dummy[c(71,72),, drop = FALSE]
cleaned_results <- clean_lab_result(Function_1_dummy_subset, raw_result = "raw_result", report = FALSE, locale = "US")

#> 2 result records were cleaned, classified, and standardized.
#> ⏰ Time taken is 0.007 minutes.
#>

cleaned_results
raw_result frequency clean_result scale_type cleaning_comments
71 1.015 234 1.015 Qn.1
72 1,060 200 1060 Qn.1
cleaned_results <- clean_lab_result(Function_1_dummy_subset, raw_result = "raw_result", report = FALSE, locale = "DE")

#> 2 result records were cleaned, classified, and standardized.
#> ⏰ Time taken is 0.008 minutes.
#>

cleaned_results
raw_result frequency clean_result scale_type cleaning_comments
71 1.015 234 1015 Qn.1
72 1,060 200 1.06 Qn.1
  • Language in common words:

In the clean_lab_result() function, we support 19 distinct languages in representing frequently used terms such as “high,” “low,” “positive,” and “negative. For example, the word Pøsitivo is included in the common words and will be cleaned as Pos.

Let us see how this data table works in our function:

data("common_words", package = "lab2clean")
common_words
Language Positive Negative Not_detected High Low Normal Sample Specimen
English Positive Negative Not detected High Low Normal Sample Specimen
Spanish Positivo Negativo No detectado Alto Bajo Normal Muestra Especimen
Portuguese Positivo Negativo Nao detectado Alto Baixo Normal Amostra Especime
French Positif Negatif Non detecte Eleve Bas Normal Echantillon Specimen
German Positiv Negativ Nicht erkannt Hoch Niedrig Normal Probe Probe
Italian Positivo Negativo Non rilevato Alto Basso Normale Campione Campione
Dutch Positief Negatief Niet gedetecteerd Hoog Laag Normaal Staal Monster
Polish Dodatni Ujemny Nie wykryto Wysoki Niski Normalny Probka Probka
Swedish Positiv Negativ Inte upptackt Hog Lag Normal Prov Prov
Danish Positiv Negativ Ikke opdaget Hoj Lav Normal Prove Prove
Norwegian Positiv Negativ Ikke oppdaget Hoy Lav Normal Prove Prove
Finnish Positiivinen Negatiivinen Ei havaittu Korkea Matala Normaali Nayte Nayte
Czech Pozitivni Negativni Nezjisteno Vysoky Nizky Normalni Vzorek Vzorek
Hungarian Pozitiv Negativ Nem eszlelt Magas Alacsony Normal Mintavetel Mintadarab
Croatian Pozitivan Negativan Nije otkriveno Visok Nizak Normalan Uzorak Uzorak
Slovak Pozitivny Negativny Nezistene Vysoky Nizky Normalny Vzorka Vzorka
Slovenian Pozitiven Negativen Ni zaznano Visok Nizek Normalno Vzorec Vzorec
Estonian Positiivne Negatiivne Ei tuvastatud Korge Madal Normaalne Proov Proov
Lithuanian Teigiamas Neigiamas Neaptiktas Aukstas Zemas Normalus Pavyzdys Pavyzdys

As seen in this data, there are 19 languages for 8 common words. If the words are positive or negative, then the result will either be cleaned to Pos or Neg unless if it is proceeded by a number, therefore the word is removed and a flag is added to the cleaning_comments. For example, the word Négatif 0.3 is cleaned as 0.3 and the word 33 Normal is cleaned as 33. Finally, if the result has one of those words “Sample” or “Specimen”, then a comment will pop-up mentioning that test was not performed.

  • Flag creation:

In addition to the common words, when there is a space between a numeric value and a minus character, this also creates a flag. For example, result - 5 is cleaned as 5 with a flag, but the result -5 is cleaned as -5, and no flag is created because we can assume it was a negative value.

4. Function 2: Validate results

The validate_lab_result() has six arguments:

  • lab_data : A data frame containing laboratory data

  • result_value : The column in lab_data with quantitative result values for validation

  • result_unit : The column in lab_data with result units in a UCUM-valid format

  • loinc_code : The column in lab_data indicating the LOINC code of the laboratory test

  • patient_id : The column in lab_data indicating the identifier of the tested patient.

  • lab_datetime : The column in lab_data with the date or datetime of the laboratory test.

  • report : A report is written in the console. Defaults to “TRUE”.

Let us check how our package validates the results using the validate_lab_result() function. Let us consider the Function_2_dummy data that contains 86,863 rows and inspect its first 6 rows;

data("Function_2_dummy", package = "lab2clean")
head(Function_2_dummy, 6)
patient_id lab_datetime1 loinc_code result_value result_unit
10000003 2023-08-09 1975-2 19 umol/L
10000003 2023-08-09 1968-7 20 umol/L
10000003 2023-09-09 1975-2 19 mmol/L
10000003 2023-09-09 1968-7 20 umol/L
10000003 2023-09-09 1968-7 20 umol/L
10000011 2023-10-09 1975-2 19 umol/L

Let us apply the validate_lab_result() and see its functionality:

validate_results <- validate_lab_result(Function_2_dummy, 
                                        result_value="result_value",
                                        result_unit="result_unit",
                                        loinc_code="loinc_code",
                                        patient_id = "patient_id" , 
                                        lab_datetime="lab_datetime1")

#> Preprocessing Step for Duplicate Records
#> ===============================================================================================
#> 166 duplicate records were flagged.
#> These are multiple records of the same test for the same patient at the same result timestamp.
#> Check 1: Reportable Limits Check
#> ===============================================================================================
#> 5 extremely low result records were flagged (low_unreportable).
#> 2 extremely high records were flagged (high_unreportable).
#> Check 2: Logic Consistency Checks
#> ===============================================================================================
#> 7 result records were flagged for violating relational logic rules (logic_flag).
#> Check 3: Delta Change Limits Checks
#> ===============================================================================================
#> 55 records were flagged for having extreme change values from previous results within 7 days (delta_flag_7d).
#> 15 records were flagged for having extreme change values from previous results within 8-90 days (delta_flag_8_90d).
#> ===============================================================================================
#> 99.712% of the lab data records were validated with no flag detected.
#> ⏰ Time taken is 1.596 minutes.
#>

The validate_lab_result() function generates a flag column, with different checks:

head(validate_results, 6)
loinc_code result_unit patient_id lab_datetime1 result_value flag
13457-7 mg/dL 1e+07 2023-09-09 100.0 NA
13457-7 mg/dL 1e+07 2023-10-09 100.0 logic_flag
1751-7 g/dl 1e+07 2023-08-09 3.1 NA
1751-7 g/dl 1e+07 2023-09-09 7.5 logic_flag
1751-7 g/dl 1e+07 2023-10-09 7.5 NA
18262-6 mg/dL 1e+07 2023-11-09 100.0 NA
levels(factor(validate_results$flag))

#> [1] “delta_flag_7d” “delta_flag_8_90d” “duplicate”
#> [4] “high_unreportable” “logic_flag” “low_unreportable”

We can now subset specific patients to explain the flags:

subset_patients <- validate_results[validate_results$patient_id %in% c("14236258", "10000003", "14499007"), ]
subset_patients
loinc_code result_unit patient_id lab_datetime1 result_value flag
13457-7 mg/dL 10000003 2023-09-09 100.0 NA
13457-7 mg/dL 10000003 2023-10-09 100.0 logic_flag
1751-7 g/dl 10000003 2023-08-09 3.1 NA
1751-7 g/dl 10000003 2023-09-09 7.5 logic_flag
1751-7 g/dl 10000003 2023-10-09 7.5 NA
18262-6 mg/dL 10000003 2023-11-09 100.0 NA
1968-7 umol/L 10000003 2023-08-09 20.0 logic_flag
1968-7 umol/L 10000003 2023-09-09 20.0 duplicate
1968-7 umol/L 10000003 2023-09-09 20.0 duplicate
1968-7 umol/L 10000003 2023-10-09 20.0 NA
1975-2 umol/L 10000003 2023-08-09 19.0 logic_flag
1975-2 mmol/L 10000003 2023-09-09 19.0 NA
2085-9 mg/dL 10000003 2023-09-09 130.0 NA
2085-9 mg/dL 10000003 2023-10-09 130.0 logic_flag
2085-9 mg/dL 10000003 2023-11-09 130.0 NA
2093-3 mg/dL 10000003 2023-08-09 230.0 NA
2093-3 mg/dL 10000003 2023-09-09 230.0 duplicate
2093-3 mg/dL 10000003 2023-09-09 215.0 duplicate
2093-3 mg/dL 10000003 2023-10-09 230.0 logic_flag
2093-3 ng/dL 10000003 2023-11-09 230.0 NA
2885-2 g/dl 10000003 2023-08-09 7.0 NA
2885-2 g/dl 10000003 2023-09-09 7.0 logic_flag
2885-2 mg/dl 10000003 2023-10-09 7.0 NA
2160-0 mg/dL 14236258 2180-11-23 22:30:00 13.2 NA
2160-0 mg/dL 14236258 2181-02-22 08:10:00 13.1 NA
2160-0 mg/dL 14236258 2181-03-07 11:00:00 9.4 NA
2160-0 mg/dL 14236258 2181-03-24 16:35:00 27.2 delta_flag_8_90d
2160-0 mg/dL 14236258 2181-03-25 06:25:00 16.8 delta_flag_7d
2160-0 mg/dL 14236258 2181-03-26 06:10:00 19.0 NA
2160-0 mg/dL 14236258 2181-04-02 10:00:00 9.7 delta_flag_7d
2160-0 mg/dL 14236258 2181-06-29 14:00:00 16.9 delta_flag_8_90d
2160-0 mg/dL 14236258 2181-06-30 05:32:00 10.8 delta_flag_7d
2160-0 mg/dL 14236258 2181-07-10 22:44:00 10.0 NA
2160-0 mg/dL 14236258 2181-07-10 23:25:00 10.3 NA
2160-0 mg/dL 14236258 2181-07-11 10:00:00 11.6 NA
2160-0 mg/dL 14236258 2181-07-12 02:30:00 13.6 NA
2160-0 mg/dL 14236258 2181-10-17 17:10:00 10.6 NA
2160-0 mg/dL 14236258 2181-10-18 06:40:00 12.6 NA
2160-0 mg/dL 14236258 2181-11-30 07:00:00 19.7 delta_flag_8_90d
2160-0 mg/dL 14236258 2181-12-17 06:44:00 12.1 delta_flag_8_90d
2160-0 mg/dL 14499007 2180-06-02 07:10:00 1.0 NA
2160-0 mg/dL 14499007 2180-10-26 15:00:00 0.8 NA
2160-0 mg/dL 14499007 2180-10-27 05:53:00 1.0 NA
2160-0 mg/dL 14499007 2180-10-27 15:15:00 0.0 low_unreportable
2160-0 mg/dL 14499007 2180-10-28 06:35:00 0.9 NA
2160-0 mg/dL 14499007 2180-10-29 05:52:00 1.0 NA
2160-0 mg/dL 14499007 2180-10-30 12:26:00 0.9 NA
2160-0 mg/dL 14499007 2180-10-31 03:11:00 0.8 NA
2160-0 mg/dL 14499007 2180-11-01 06:20:00 1.0 NA
2160-0 mg/dL 14499007 2180-11-02 04:22:00 1.0 NA
  • Patient 14236258 has both delta_flag_8_90d and delta_flag_7d that is calculated by lower and upper percentiles set to 0.0005 and 0.9995 respectively. While the delta check is effective in identifying potentially erroneous result values, we acknowledge that it may also flag clinically relevant changes. Therefore, it is crucial that users interpret these flagged results in conjunction with the patient’s clinical context.

Let us also explain two tables that we used for the validation function. Let us begin with the reportable interval table.

data("reportable_interval", package = "lab2clean")
reportable_interval_subset <- reportable_interval[reportable_interval$interval_loinc_code == "2160-0", ]
reportable_interval_subset
interval_loinc_code UCUM_unit low_reportable_limit high_reportable_limit
2160-0 mg/dL 1e-04 120
  • Patient 14499007 has a flag named low_unreportable. As we can see, for the “2160-0” loinc_code, his result was 0.0 which was not in the reportable range (0.0001, 120). In a similar note, patient 17726236 has a high_unreportable.

Logic rules ensure that related test results are consistent:

data("logic_rules", package = "lab2clean")
logic_rules <- logic_rules[logic_rules$rule_id == 3, ]
logic_rules
rule_id rule_index rule_part rule_part_type
3 1 2093-3 loinc_code
3 2 >( operator
3 3 2085-9 loinc_code
3 4 + operator
3 5 13457-7 loinc_code
3 6 ) operator
  • Patient 10000003 has both logic_flag and duplicate. The duplicate means that this patient has a duplicate row, whereas the logic_flag should be interpreted as follows. For the loinc_code “2093-3”, which is cholesterol, we need that the “2093-3” > “2085-9” + “13457-7”, or equivalently cholesterol > hdl cholesterol + ldl cholesterol (from the logic rules table). Therefore for patient 10000003, we have a logic flag because LDL (“13457-7”) equals 100.0 and HDL (“2085-9”) equals 130.0. Total cholesterol (“2093-3”) equals 230. Therefore we see that the rule “2093-3” > “2085-9” + “13457-7” is not satisfied because we have 230 > 100+130, i.e. 230>230, which is clearly false, and thus a logic flag is created.

5. Customization

We fully acknowledge the importance of customization to accommodate diverse user needs and tailor the functions to specific datasets. To this end, the data in logic_rules, reportable_interval, and common_words are not hard-coded within the function scripts but are instead provided as separate data files in the “data” folder of the package. This approach allows users to benefit from the default data we have included, which reflects our best knowledge, while also providing the flexibility to append or modify the data as needed.

For example, users can easily customize the common_words RData file by adding phrases that are used across different languages and laboratory settings. This allows the clean_lab_result() function to better accommodate the specific linguistic and contextual nuances of their datasets. Similarly, users can adjust the logic_rules and reportable_interval data files for validate_lab_result() function to reflect the unique requirements or standards of their research or clinical environment.

By providing these customizable data files, we aim to ensure that the lab2clean package is not only powerful but also adaptable to the varied needs of the research and clinical communities.