Metadata Driven - Yet Another Cliche in Our Industry
- 1 ABSTRACT
- 2 INTRODUCTION
- 3 CURRENT STATE: THE WORLD IS FLAT
- 4 WHAT DOES METADATA DRIVEN MEAN?
- 5 FUTURE STATE: THE WORLD IS ROUND
- 6 CONCLUSION
- 7 CONTACT INFORMATION
- 8 RECOMMENDED READING
- 9 REFERENCES
Over the last five years the concept of metadata driven has been thrown about more often than a hot potato. In our experience, the reality is that people who use the phrase either can’t provide any substance for what it means, have very different concepts of what it represents, and/or don’t really don’t how to implement an approach that drives them forward. Vendors and industry experts tout how a robust metadata driven approach will dramatically change the industry providing the ability to drive from protocol to submission, but really don’t seem to know how or actually build innovative solutions which support this concept.
This paper will first provide a bit of history and the current state which has led to the challenges of implementing a true metadata driven approach. Then we will present our concept of what metadata driven means and what ‘data about the data’ needs to be implemented to support this approach. Then we’ll describe how emerging metadata concepts and solutions from other industries might be used to realize a future state approach.
Over the last five years the concept of metadata driven has been thrown about more often than a hot potato. In our experience, the reality is that people who use the phrase either can’t provide any substance for what it means, have very different concepts of what it represents, and/or don’t really don’t how to implement an approach that drives them forward. Vendors and industry experts tout how a robust metadata driven approach will dramatically change the industry providing the ability to drive from protocol to submission, but really don’t seem to know how or actually build innovative solutions which support this concept. This paper will first provide a bit of history and the current state which has led to the challenges of implementing a true metadata driven approach. Then we will present our concept of what metadata driven means and what ‘data about the data’ needs to be implemented to support this approach. Then we’ll describe how emerging metadata concepts and solutions from other industries might be used to realize a future state approach.
CURRENT STATE: THE WORLD IS FLAT
Many centuries ago, there was the myth that the world was flat, yet most scholars of that time did not actually believe the myth and realized there was overwhelming science that proved there was a spherical shape to the world. Over the last 30 years, we have described clinical data in flat two dimensional tables and have even built a whole set of industry standards around this flat concept. In addition to the ‘flat’ metadata our industry continues to collect, the siloed stages and people within the clinical trial process have led to disconnected data, metadata, and processes.
The reality is that clinical trial information is multi-dimensional and needs to be defined much more robustly to support the relationships within clinical information. In addition to the data, the industry needs to break down the silos across the functional areas and stop writing protocols in text documents, collecting data in electronic systems that were developed around the concept of paper, and creating two dimensional data sets.
LIVING IN A TWO DIMENSIONAL WORLD
Since the industry first started submitting clinical data to regulatory agencies, we have collected and provided data in two dimensional tables – the SAS V5 transport file. Twenty years ago this was a viable choice and a first generation option that allowed reviewers to use computers to organize and analyze the data. This was a significant leap forward in the review process providing reviewers the ability to more quickly make decisions on the safety and efficacy of the data.
However, as time progressed, tools and methodologies become more sophisticated within other industries, but yet the clinical trial continued to collect, analyze, and submit data in two dimensions. The regulatory pressures and associated red tape stifles innovation within our industry and doesn’t allow us to be ‘disruptive’ at the same time other industries accelerate forward at a rapid phase.
CDISC was initiated in 2000 with the goal of standardizing metadata across the clinical data and this has a provided an excellent forum for the critical need to standardize how data is collected and exchanged across our industry. However, CDISC was also severely limited by the tools and requirements they were working with as they begin their development. As the old saying goes “if the only tool you have is a hammer, then everything looks like a nail”. CDISC was limited to developing isolated domains, variables with 8 characters, and content (the most important thing) with length limitations.
The SDTM AE domain below is something familiar to everyone within our industry and provides a ‘standard’ structure that is familiar to people who are either trained in the standards or have extensive experience implementing those same standards.
However, as you review the content above, what and where is the information that tells you how these values are connected – what is their relationship to each other? From years of experience, we know that AEBODSYS is a concept that groups together values of AEDECOD but this information is only found in the intrinsic knowledge within our heads. When we process this data we have to know this relationship based on this predisposed knowledge and then write code to organize the data and conduct the analysis. Unfortunately, we have developed standards and data that do not really provide this relationship as part of the metadata and therefore isn’t really very valuable. There are some rudimentary attempts at defining these relationships throughout the CDISC standards (e.g. SUPPQUAL, RELREC, ODM linkages) but those attempts are limited and still need substantial knowledge to understand them. Even if we had a mechanism for defining this metadata, in the current world, this metadata is detached from the data which means it is not tightly integrated as you move this information around the clinical trial ecosystem.
This lack of a robust definition of the relationship between two values severely limits the use of the data, and while CDISC provides standard structures, it does not provide the ability to really implement automation because of these gaps.
Above, we discussed the challenges and limitations of the standards and associated data that we collect in clinical trial. The other major issue we face as we try to build a metadata driven approach from the beginning to the end is the siloed nature of our industry. The picture below is not new to those who have had any experience within clinical trials and while it says ‘in a nut shell’ the boxes outlined in the workflow are a symbol of the silos that each of the step creates.
Each of these steps has its own isolated processes, people, and deliverables. The two most critical components of this process are the design of the clinical study (e.g. protocol) and the analysis at the end that provides the positive or negative results whether the drug will work for patients. The irony is that those are the two components that are least addressed by standards.
We continue to write our protocols on virtual paper, the content of which provides no automation downstream. Therefore, a plethora of disconnected stakeholders end up transcribing the content into many systems to implement the clinical trial. We then proceed to take many steps to get from our protocol to our analysis handing off our information multiple times which is time consuming, prone to error, and needs specialized knowledge. These multiple steps, multiple users, and multiple disconnected systems leave us with a broken industry.
WHAT DOES METADATA DRIVEN MEAN?
If you were to search for the concept of metadata driven on the internet, you would most likely find references within software application development in the form of metadata-driven architecture or metadata-drive development. The idea is to move away from building software that is hardcoded to follow just a few paths; but instead separate the logic from the metadata providing the flexibility to configure the system how you need it as an end user.
To help you visualize this, imagine an XML file which contains metadata describing everything from entities, their corresponding business logic and validation rules, users and how different user levels offer different possibilities or actions to perform, to a rough layout of your screens and a description of the user processes, wizards, steps to take, and any other metadata required to drive the workflow. At a high level, a metada-driven design allows you to have a generic "logical" representation of your process that becomes a "physical instantiation" at the actual run time of the process
While this might sound easy, building the flexibility to support all future possible workflows can be a challenging, if not impossible task. If you look at something like SignUpGenius.com, an online tool for collecting sign ups from people, you will find this model layered throughout the solution. Users are walked through a metadata driven interface where they define specific information which, at the end, creates them a nice interface for their stakeholders to enter their sign up information.
The main focus of the application above is to capture sign-ups for users based on date, time, and events; a fairly simple process. The processes within clinical trial are significantly more complicated and trying to define metadata that will drive both content and process is extremely challenging. However, the current limitations in the standards and the silos described above make this even more difficult.
The industry, both within pharmaceutical companies and clinical system vendors, will make the claim that they are building metadata driven solutions, however, in our experience, the use of metadata to really drive downstream process is few and far between. It’s a bit of a conundrum since you need robust integrated data standards and processes across functions to build metadata driven methods that can really drive automation.
What shall we do? Can we fix this?
FUTURE STATE: THE WORLD IS ROUND
In the first section of this paper, we attempted to describe some of the challenges we face in the limitations of both standards, and current clinical trial workflow, and the technologies to support both. Within the next few sections, we'll discuss how other industries are leveraging much more innovative technologies by building robust metadata models allowing them to create a truly metadata driven approach. We'll also provide a glimpse into how we might be able to do this for our industry.
TECHNOLOGIES IN OTHER INDUSTRIES
The methods other industries use to capture, model, and use metadata to drive information is leap years ahead of our industry. When you enter the name of one of our keynote speakers, Simon Weston, into Google you will receive this snapshot on the right side of your search results.
- Google and the Google logo are registered trademarks of Google Inc., used with permission.
Does anyone honestly believe the information that is pulled together in the above profile is stored in some two dimensional relational table sitting in a database? Google uses a highly semantic, multi-dimensional model built based on relationships and maps to pull information together in warp speed. Technologies like Google’s BigTable (https://en.wikipedia.org/wiki/BigTable), relationship modeling, and indexing while academic to some, are real models being used to optimize data and provide the richness you need to find and analyze the information.
This information can be pulled together by having underlying values and the relationships about those values. There is no longer the conversation about data, metadata, value level etc. but instead there are just values and relationships. In the simple picture below the values and their relationships are all integrated into one model.
In the example above, you are storing individual values floating in space typing them together with relationships. The purpose of this flexibility is to help find answers to very complex questions, such as, "Which Presidents who lived in the White House had at least one child who did not live with him in the White House?" A semantic model excels at answering such complex questions involving multiple types of data from multiple sources in multiple formats.
MULTI DIMENSIONS OF CLINICAL DATA
First, as an industry we all must realize and align on the fact that clinical data is multi-dimensional as the relationships between the clinical data for a patient are complex. Subject 345, who is female and had a history of heart disease is taking dose X of drug Y and during the clinical trial they had adverse event Z which was serious and happened two days after they started taking over the counter medication W and their doctor reported an elevated laboratory V.
This relationships within this information for a patient is real; however the two dimensional world where we currently collect, organize, and report data completely fails at capturing these relationships that would make the data meaningful to the person trying to analyze and reach conclusions. Recently the FDA has released guidance required data standards, however, within the guidance itself they have not required CDISC standards. Instead they have referenced additional not guidance documents which communicate that the current requirement is CDISC. However, at the same time, we continue to hear that while the standards provide a consistent structure, the FDA is still not sure the content is in a format that meets their analysis needs.
Below is a simple picture of how a relationship could be developed within a semantic model. The key here is that the values and relationships are integrated together in one model and additional relationships and values can be added easily to the model.
The picture above provide an example of how you might map an adverse event ontology. The first question you might ask is "What does the word ontology mean?". In computer science an ontology is a structured naming and definition of the object types, properties, and the relationships of the types that define what they mean to each other. Other industries and technologies create ontologies to limit complexity and organize information which can then be applied to problem solving.
The ontology example above show how every piece of information about an adverse event can be linked together. Combining the 'variables' with relationships in one model closely integrates the content. The subject has an Adverse Event which has a Reported Term which is a subclass of Body System. This simple example ties together the adverse event with body system, something we could not do with the example earlier in this paper.
Earlier we discussed the isolated silos within the overall clinical trial workflow and how that causes challenges in trying to build a metadata driven framework and really leverage tools for automation. The hand offs from group to group is prone to error and creates those silos of different tools, disconnected metadata, and completely different processes.
This overall issue is going to be the most challenging to fix and it's not because of technology but more about the culture and existing state of the types of users in each of these silos. We need to first break down these silos and redesign organizational structures to integrate people and processes from protocol design through submission.
The first step in this process is to develop a robust protocol model that contains the critical components for downstream processes and is easy for users to understand. There have been a number of initiative over the last decade to attempt to tackle this challenging problem. CDISC developed the Protocol Representation Model over 5 years ago but it only addressed a very small piece and was not implementable. HL7 attempted to developed minimal components of the protocol visit schedule but again it was only a model and did not provide end user something implementable. There is hope though! At the recent CDISC Intrachange, CDISC, PhUSE and TransCelerate met for a day to discuss how they can bring together different initiatives to bring together a true usable and machine readable protocol. TransCelerate is working on a protocol template that provide a formal structure with underlying metadata, CDISC is improving and aligning their PRM, and PhUSE is leveraging the work being done in their Semantics Model to represent the CDISC PRM in RDF.
Within this wiki, we first attempted to describe the current state we are stuck in as an industry with severe limitations in our standards, tools, and the ability to really use metadata to drive our process. We believe it's a situation that is prohibiting us as an industry from moving forward and being innovative in improving the drug development process. At the end of the day, this inability to move forward is preventing us from helping patients.
While people might read this and believe it's an academic exercise in technology or overwhelming to consider changing the monolithic processes and models we now use, the reality is that if we don't change, we will continue to be crippled and siloed. It might no be tomorrow or next year but it has to change. If we don't change, other industries will force us to and find better ways forward.
Your comments and questions are valued and encouraged. Contact the author at: Chris Decker d-Wise Email: firstname.lastname@example.org Web: www.d-wise.com
Ian Fleming d-Wise Email: email@example.com Web: www.d-wise.com