CLINICAL DATA CLEANING AND VALIDATION STEPS
Clinical data is one of the important assets of a pharmaceutical company. It serves as the base for the analysis, submission, approval, labeling, and marketing of a compound. A well organized, easily accessible, and properly cleaned data is essential for the value of drugs. The Data Warehousing Institute (DWI) reports that the cost of bad or ‘dirty’ data is around $600 billion annually.
No matter how well the study is designed and implemented, all studies have to deal with errors from various sources and their effects influence the study results. Even though study designs are often new or unique they require a variety of handling depending on the type of trial, and data manipulation methods that share a similar set of tasks to be performed at specific stages. Every stage involves a bit of data cleaning and validation procedures to make sure data consistency and accuracy.
DATA CLEANING AS A PROCESS
Data cleaning involves problems associated with data. Prevention of errors can minimize many problems but can’t remove them. data cleaning involves a three-step procedure, involving cycles of screening, diagnosing, and treatment of suspect data. Many data errors are identified accidentally during study activities other than data cleaning process. However, it is more beneficial to detect errors by actively searching for them in a properly planned way. It is not always immediately clear if a data contains errors. Many times, is requires careful examination for detection. Similarly, missing values require additional check. Missing values might me possibly be due to interruptions in the flow of data or the missing of the target information. Hence, predetermined rules for assessing the errors and true missing and extreme values are part of good practice.
SCREENING PHASE
In the process of screening data, it is recommended to distinguish a few important points: lack or excess of data, including inconsistencies, strange patterns in distributions, unexpected analysis results, and other types of inferences and abstractions. Screening methods need not only be statistical. But can also be detected by perceived nonconformity with prior expectations, based on the investigator's experience, pilot studies, evidence in the literature, or common sense. Detection can be done even during article review or after the publication of the article.
SCREENING METHODS
Checking of questionnaires using fixed algorithms.
Validated data entry and double data entry.
Browsing of data tables after sorting.
Printouts of variables not passing range checks and of records not passing consistency checks.
Graphical exploration of distributions: box plots, histograms, and scatter plots.
Plots of repeated measurements on the same individual, e.g., growth curves.
Summary statistics.
DIAGNOSIS PHASE
In this phase, the purpose is to clarify the true nature of the worrisome data points, patterns, and statistics. Possible diagnoses for each data point are as follows: erroneous, true extreme, true normal (i.e., the prior expectation was incorrect), or idiopathic (i.e., no explanation found, but still suspect). Some data points are logically or biologically impossible. Hence, one may predefine not only screening cut-offs as described (soft cut-offs) but also cut-offs for immediate diagnosis of error (hard cut-offs). Sometimes, suspected errors will fall in between the soft and hard cut-offs, and diagnosis will be less straightforward. In these cases, it is necessary to apply a combination of diagnostic procedures.
One procedure is to go to previous stages of the data flow to see whether a value is consistently the same. This requires access to well-archived and documented data with justifications for any changes made at any stage. A second procedure is to look for information that could confirm the true extreme status of an outlying data point.
TREATMENT PHASE
Treatment phase starts once the errors are identified, missing values, and true (extreme or normal) values are all found, then the researcher must decide what to do with problematic observations. There are very limited number of options to correcting, deleting, or leaving unchanged. There are some general rules for which option to choose. Impossible values are never left unchanged they are either deleted or should be corrected if a correct value can be found. For biological continuous variables, some within-subject variation and small measurement variation are present in every measurement. If a remeasurement is done very instantly after the initial one and the two values are close enough to be explained by these small variations alone, accuracy may be enhanced by taking the average of both as the final value.
Data cleaning often leads to in depth understanding into the nature and severity of error-generating processes. The operational staff must be given methodological feedback by the researcher to improve the study validity and precision. It may be necessary to amend the study protocol, regarding design, timing, observer training, data collection, and quality control procedures. In extreme cases, it may be necessary to restart the study.
WHAT IS CLINICAL DATA VALIDATION?
Data validation is a series of documented tests of the data to ensure the quality and integrity of the data. More specifically, validation is usually concerned with checking four of the eight characteristics of good clinical data – these characteristics are from the first guidance and the first other reference listed below. The eight characteristics are:
Attributable: The sources of the data are known and recorded.
Legible: The data are human-readable.
Contemporaneous: The source data are recorded when they are generated.
Original: All data come from the source.
Copies and transformations of the data are accurate and complete, do not overwrite original data, and are traceable back to the original data.
Accurate: The data are correct.
Enduring: The data are available for the entire time they are required to be kept.
Complete: All available data are included.
Consistent: All of the data use consistent terms and are non-contradictory.
Data validation tests usually check the original, accurate, complete, and consistent aspects of the data.
WHY DOES CLINICAL DATA NEED VALIDATION?
From a business perspective, the data are how the FDA, other regulators, and business partners evaluate the worth of the product. From an ethical perspective, clinical data affect treatment decisions, which affect patient health, and the patient population in question is virtually all of the United States and a significant fraction of the rest of the world. For both of these reasons, clinical data quality and integrity are critical.
Despite this, few regulations talk about data validation directly. Instead, the regulations and guidance focus on requirements that the data handling systems must meet to ensure data quality and integrity. Regulations and guidance that do mention clinical data validation, or a part of the process, are listed below.
WHAT IS THE VALIDATION PROCESS?
The general outline for data validation is listed below. However, the validation process is complex and dependent on the data captured, business and regulatory concerns, the data management software used, and several other factors, so there are many possible variations and options.
Planning
The Sponsor decides what checks should be used, what code lists are appropriate, and what procedures will be used for invalid results.
The checks, code lists, and procedures are documented.
Implementation and Testing
The checks and code lists are implemented in the clinical database management system.
Test procedures and test data for the checks are created, usually as part of database validation.
The test procedures are performed.
Data Entry and Validation
The checks are run during data entry, either as the data are entered or at intervals.
Invalid results are fixed or allowed using the planned procedures.
The final set of checks is usually referred to as data cleaning.
Database Lock
When no more updates or changes to the data are expected the database is locked.
Even after the database lock, analysts may run further checks to determine if any changes are necessary to produce the analysis datasets.
REFERENCES
https://www.pharmaceuticalprocessingworld.com/clinical-data-cleaning-and-validation-steps/
Gautam Jephthah
B. Pharmacy
171/0922
Comments