Insights and lessons learned about operationalising the use of real world data.

Advances in technology, from the massive storage and computing options now available through cloud computing to advances in algorithmic processing, have empowered organisations to consider secondary RWD sources as viable options for evidence generation across a wide spectrum of use cases.

Within the last few years, the exploration and practical use of secondary data has grown exponentially and is now fueling a new wave of digital disruption. New insights we never thought possible are now becoming achievable. However, operationalising this data from a research perspective has challenges and risks that need to be addressed very carefully to achieve success. 

From my own research, as a senior technologist, below are ten lessons learned:

  1. Ensure an overarching data strategy is defined with insight from your clinical stakeholders.   Suggest forming a Data Center of Excellence (CoE) to provide the appropriate governance around alignment and general use, policy definition, accountabilities and decision trees to ensure the data is managed well.  Start small and mature with time, refer to the Information Systems Audit and Control Association (ISACA) for classification, if required.
  2. Select data that provides differentiation. There are many different types of secondary data available today, most commonly, claims and EHR data.  Be sure to select data that is “fit-for-purpose” to the intended research needs.
  3. Look closely at data quality. From a data quality perspective, ensure that the schema is reviewed with detail to ensure the data fields match to the intended research need.
  4. Perform preliminary analysis. To ensure the robustness of the data, perform preliminary analysis to ensure the demographic “counts” match the requirements/protocol of the research.  Also, if you are planning to utilise Natural Language Processing (NLP) for “free-text” analysis, as an example physician note extraction from EMR systems, work with your data provider to receive a “data density” report to ensure the investment in NLP is warranted based on the results of the report.
  5. Understand the power of linking datasets. Each data set has its strengths and weaknesses and commonly they will present a more holistic longitudinal view if linked together.  For planning purposes as it relates to data linking, a Venn diagram works well to understand the logical relation of records between linked data sets.
  6. Consider the details when linking data. From a linking perspective, consider the Observational Medical Outcomes Partnership (OMOP) standardisation, as the more standardised the schema is, the more flexible to serve many use cases within the enterprise.  Ensure the data has not been tokenised beforehand, as linking may not be possible. To ensure proper linking, the fields need to be available from both sources, then the appropriate hashing algorithm can be utilised to perform the tokenisation across all sources of data.  If you are considering connecting to external systems, such as EMR systems, agree to an appropriate standard, such as the Fast Healthcare Interoperability Resources (FHIR) standard to ensure the data formats and elements are commonly understood for exchanging data.
  7. Understand the cost model.  Cost is usually measured across the following three areas; how many therapeutic areas are contained in the dataset, the “refresh/updates” (assume more from prospective studies) and lastly, any links that need to be established to other data sources.  Ensure that there is not any hidden development cost for data extraction.  Lastly, ensure that the targeted data is not already present within your organisation, as data redundancy has been rising with duplicative data spend within the enterprise.
  8. Maintain compliance with GDPR and applicable data protection. Most often the data in scope is anonymised and falls outside the scope of the GDPR. However, it can be difficult to fully anonymise data and de-identified data (pseudonymised data) falls within the scope of the GDPR. The entities involved need to ensure they comply with relevant data protection obligations and that data controller and data processor roles are clearly assigned.  A data controller determines the purposes and means of the processing of personal data, simply put “who is in charge and leading the way”.  A data processor is responsible for processing the data on behalf of the data controller, simply put “following the instructions of the data controller”.  It is important to ensure the appropriate agreements including data processing agreements are in place between the relevant parties and that there is an appropriate mechanism for the transfer of personal data, if applicable.
  9. Plan for secure transmission. From a data extraction perspective, ensure the appropriate secure transmission is applicable to satisfy compliance.  Most often, data providers have “cloud to cloud” secure transmission as a preferred choice however Secure File Transfer Protocol (SFTP) can still be used as a source of transmission and lastly, most data providers now have their own secure portal in which to download data, as always ensure appropriate safeguards are in place for encryption in transit and at rest.  
  10. Ensure appropriate training is setup and conducted.  From a clinical perspective, standard operating procedures (SOP) work well to define how to interact with the data; that is understanding the operational standards to utilise the data.  These SOPs should be part of your overall governance process.

Learn more

To learn more visit or contact our Real World Evidence Strategy and Analytics team.


Bruce Capobianco

Senior Director, RWE IT

Bruce has over 15 years’ experience in the architecture, development and implementation of complex big data solutions. He leads a team to develop, enhance, and maintain RWE technology solutions for ICON clients. He is experienced in identifying and implementing secure, usable and enduring technologies that augment business processes and optimise productivity.

Real World Data insights

ICON's real world data (RWD) continues to drive healthcare and research discussions and decisions. Stay up to date with the latest information that regulators, payers and providers demand.

Read more