The Technology of Context
The dictionary defines context as: the circumstances that form the setting for an event, statement, or idea in terms of which it can be fully understood and assessed. Context is an essential part of knowledge and understanding. Why is this important to us? Because context carries the meaning or the purpose of the data which enables us to properly communicate, analyze, draw conclusions, and make decisions. Technology systems that do not provide context require the human brain to bring context to the data, exposing potential human error and rendering a lower-value technological solution. In our industry, we do not take full advantage of technologies that exploit the context of our data. This paper will discuss various technologies that assist in the task of delivering data with context such as, NoSQL databases, semantic technology, machine learning, visualization, and user experience design. Various examples will be given for each of these technologies and the role they play in deriving important insights.
Do you remember the internet as it existed 10-15 years ago? When you searched for things, how hard was it to find something that was relevant to what you were actually looking for? Pages and pages of returned links to web pages required a much more in depth investigation and time to determine the pages that were helpful from the ones that weren’t. The technologies of the time were able to determine what you were searching for, but not why you were searching for it. They would return the same results to everyone, regardless of who you were, where you were located, or the time of day or year you were looking for it. The results were also “dumb” in the respect that the results had no inherent knowledge of the details of the search topic itself. The technologies couldn’t distinguish between searches for my local weather and the causes of the French Revolution, for example.
Compare that to the internet today. Searches return highly relevant results that take into account who and where you are as well as things that are going on in your life to return results that are far more contextually relevant. Searching for “weather” now returns forecasts for my current location, forecasts for anywhere I may be traveling in the near future, as well as links to relevant sites from the internet that would allow me to further generally investigate the topic of “weather”.
This relevance is only made possible by recent advancements in technology and design. Unstructured databases facilitate dynamic storage and retrieval of information in a fast paced world while also facilitating infinite scalability. Data modeling with directed graph based technologies allows for information to be richly modeled, and the nature of relationships to be described in a robust way. Machine learning techniques and algorithms are making sense of all of the data and cutting through the noise to make predictions and surface more relevant information. Advancements in the technologies for visualizing information allows people to view the information in ways that were not as accessible as in the past. Lastly, user experience in relation to the data has become an integral part of application development, which helps to optimize applications for individual users while removing extraneous effort in interacting with the data.
At some point over the last decade, prominent Internet-based companies like Google, Amazon, Facebook, and others discovered that relational databases were not the best technology fit for the massive amounts of data that they had to store and use as part of their daily business activities. The directed investment of time and energy from these powerful players has accelerated the rise of a range of technologies including unstructured databases. These data storage systems and servers are often referred to as NoSQL (Not only SQL) databases or other names depending on the context of the discussion.
While unstructured databases have served an important role in allowing big data companies to scale beyond levels previously seen, they have also opened up a different approach to the capture, storage, and evolution of data context. This is best understood by contrasting how data context is typically realized within a relational database application with how it is realized in an unstructured database.
Relational databases are excellent at storing data that has a predefined shape in much the same way as an egg carton is terrific at storing things that are egg shaped...chicken eggs, that is. A standard egg carton is not very good at storing a stick of butter, nor an ostrich egg. Unstructured databases, in contrast, do not have predefined containers in which you must store your data. They start with the assumption of supporting any structure of data you may want to place into them.
But how does this distinction relate to data context? Relational databases all but demand that you understand the full context of the data that you will be storing before you begin to store it. Unstructured databases, on the other hand, allow you to start storing data now without knowing the full context. Furthermore, this technology supports emerging contextuality where the system storing and exploiting the unstructured data can evolve organically, over time. This simple capability of deferred context has amazing potential.
Some people reading this may think, “But you can always add more data tables to your relational database or change the data structures you have to accommodate new contextual understanding.” This is true, but theory and reality are quite divergent on this point. Most organizations and technology teams struggle to make changes to the database structure for a variety of reasons mostly related to application integration and testing. As a result, data structures that embody an older, often inferior model of data context remain in place and are contorted to fake the new model by stuffing data into tables and columns where it doesn’t belong. This leads to information system “scar tissue” that further encumbers the ability to model the next iteration of data structures and data context. In contrast, unstructured databases have a design that inherently supports reforming the database schema without moving or copying data. These technologies often have support for multiple simultaneous schemas or ontologies that give meaning to the underlying data stored in the unstructured data store.
The flexibility of unstructured data stores also provides agility in the application development process. With relational databases, the required dependencies between the database and applications mean that any change to either of those layers inevitably impacts the other. This results in very large, monolithic, and inflexible systems that are incapable of dealing with the dynamic nature of clinical development conduct. Unstructured databases allow application developers to easily store and retrieve information in whatever way makes sense for their applications purpose, while maintaining a stable database layer with no changes required. This facilitates an Agile SDLC approach and continuous improvement methodologies for software built using it. Through this approach, the unstructured database technologies can often accommodate tomorrow’s ontologies and applications applied to yesterday’s data in a snap, giving your database agility, adaptability and value in a future context.
The word future is an important word to focus on for a moment. The companies that have helped to accelerate unstructured database technology are very forward-looking and heavily invested in predictive analytics. To have the best view into trends, forecasts, and predictions their databases must remain relevant. If they get stale then they will have trouble competing in a future landscape. This is often not of highest concern for the common models of retrospective statistical analysis that we see inside of clinical trials. As a result, we might be tempted to conclude that unstructured databases are not valuable to areas involved in the clinical development workflow, but that would be throwing the flexibility benefit out with the future-proof bath water. We believe that unstructured database technologies bring profound benefits to the clinical development workflow because of the flexibility they bring to the variable nature of data schemas across therapeutic areas, studies, analyses and changing data standards.
In the context of this talk, directed graph refers to the modern day technical implementations of graph theory (https://en.wikipedia.org/wiki/Graph_theory). Directed graph is a manner of describing knowledge. It easily allows concepts and relationships to be described in a mathematically consistent way. Being consistent, it also means that these descriptions can easily be consumed by a computer.
So what is directed graph actually? In its simplest form, it is the definition of one “thing” to another “thing”. In semantic terms, this is called a “triple”. A triple is comprised of three parts:
- The subject - a “thing”
- The predicate - the relationship
- The object - another “thing”
The term “directed” in directed graph means that the relationship between the subject and the object has a direction.
An example of a triple would be: “Steve has son Dave”, where “Steve” is the subject, “has son” is the predicate, and “Dave” is the object. This simple statement defines the relationship (“has son”) between two things (“Steve” and “Dave”). You could also define another triple: “Dave has father Steve”, which further defines the relationship between “Steve” and “Dave”. This second triple could even be abstracted through reasoning to make the “has son” and “has father” predicates related, meaning that once you define one of them, they other triple is implied and doesn’t need to be implicitly stated. Complex reasoning can also be introduced, so that if we have another triple that says, for example: “Dave has brother Larry”, the triples “Steve has son Larry”, “Larry has father Steve”, and “Larry has brother Dave” can be automatically deduced. This is a hugely powerful concept that allows an extensive amount of relationships to be derived from some very basic initial input and a few logical rules.
The following picture shows the previously defined example as a directed graph. Note how there are six relationships in the graph, even though we only defined two. The other four relationships are implied through reasoning. Note that in graph terminology, the circles in the following graph are called “nodes” and the arrows are called “edges”.
Building on these very simple concepts, vast knowledge can be built about topic areas. Modeling using directed graph is inherently dimensionless, meaning that context can be defined in any way that is necessary to aid understanding of the underlying knowledge. Multiple contexts can also be applied to the same underlying “things” as well so that different people can look at the same underlying subject, but view it in a way that is more contextually relevant to them. Examples of this could be doctors and patients looking at adverse events. A patient would be more likely to call something a heart attack whereas a doctor may call it a myocardial infarction. Or localization through language. Myocardial infarction can be presented in Kanji to a Japanese doctor. With directed graph, all of these things are the same physical node, but can be presented in different ways to different people depending on their particular context.
Given that these triples can be consumed by a computer, it means that computers can then start to use this information to make sense of the underlying data. In the past, context about data was stored separately from the data in the form of database keys, computer code, and inside of people's heads. This is problematic because it means that it made it difficult for people to easily access the data and understand its content. Directed graph and semantic technologies store the context with the data in the form of ontologies. These ontologies contain all of the higher order relationship definitions and reasoning described previously, and is itself represented using a directed graph. This, in essence, makes the data self describing, and democratizes the data through the included ontology to broaden understanding of the content to a wider user base.
Machine learning (ML) has actually been around for a long time. In essence, it is merely using computers to assist in determining something based on previous input. The most widely used example of machine learning in the world today is something that everyone is familiar with: email spam. ML algorithms look at incoming email messages to determine whether or not a message is spam or not. The way it does this is that the developers have “trained” the algorithm to make this determination by looking at a huge set of previously received email messages that are already classified as spam or not spam. Then by looking at the characteristics of the messages marked as spam in contrast to the ones marked as not spam, the algorithms can then be used to predict whether or not future messages that are received should be classified as spam.
The most basic machine learning tasks are merely trying to classify something that you are giving it into some kind of known categories, whether it be email spam, photos, videos, or spoken words. These types of algorithms are often referred to a supervised learning algorithms because they require humans to inform the algorithm of the correctness of the classifications. For example, if a ML algorithm looks at an email and marks it as spam, there is the chance that the email is not spam, and a human must go in after the fact and correctly classify the email as not spam. The algorithms then take this information as input to refine its algorithm and subsequently “learns” over time.
At the opposite end of the complexity spectrum, there are highly sophisticated unsupervised machine learning algorithms such as deep learning neural networks. These algorithms typically have very few initial constraints and are given simplified tasks such as “maximize (or minimize) this metric”. The algorithms then self adjust over time without human intervention in order to achieve the stated goal. Recently, a company called Deep Mind, a subsidiary of Google, used a deep learning neural network to play old Atari games such as Space Invaders and Breakout. The algorithm designers gave the system an input (the pixels on the screen), an output (the controls) with the simple task of maximizing the score. The algorithm consistently learned to play each game to the point where it was developing sophisticated strategies to get the highest score in the minimal amount of time, effectively becoming a better player at these games than any human.
Here is a video of the algorithm learning the game Breakout:
Wired UK has an interesting article on Deep Mind and the boundaries that they are pushing with their artificial intelligence research:
What does this have to do with context though? We would argue that the ability for ML to provide context is profound. When people in clinical research sit down to do some task, such as design a study, or analyze a study, what is the process that they typically go through to do that? In every case, it will involve looking at what has been done before in contextually similar situations to guide them. Researchers will look at historical studies of similar diseases that have been successful and model their designs or analyses off of those studies. These types of tasks are precisely what ML algorithms are designed to do. ML algorithms can look at historical studies with contextually similar attributes and can suggest designs or analyses which are appropriate. There are numerous examples of where ML can provide suggestions or additional context to decision points along the clinical development process by looking at the past.
Beyond just looking at the clinical development process, ML algorithms can also go a great deal towards using data to better understand disease and health on a broader spectrum. Typically, the shelf life of clinical data is limited. For programs that are not successful, data is usually archived and ignored. Even for successful programs, most of the clinical data is only useful to the point of getting a drug approved, but has little commercial use after that. Many scientists realize that there is untold value in that data to help us better understand the diseases we are trying to treat as well as providing insight into the overall topic of human health, but tapping into that data is made difficult by many logistical hurdles:
- Where is all of the data?
- Indented line Just finding out where all the data is and getting access to it has stymied many researches at the beginning stages. This data tends to be on many clinical systems in many different formats, and just figuring out where it all is is a massive undertaking.
- What is the data?
- Indented line Ok, so you found all the data, good job! Now you have to figure out what is all is. This task makes finding the data seem trivial. If you have terabytes worth of data from hundreds of different clinical studies, you need to look into each of those studies to find stuff that can be similarly classified. For example, the simple task of just identifying heart rate data in each of those studies can be one that takes months alone due to varying data structures and naming conventions based on the whims of the original researchers. Now multiply that task many times over to be able to accommodate all of the nuances of each of those studies and you begin to see why this problem is one that we tend to label as being too difficult compared to the potential benefit.
- How do you make sense of it all?
- Indented line The task now is to actually figure out interesting things from the data. Ideally, you would probably want to put this data into some kind of an ontology (discussed earlier) that would allow you to define concepts that are related in a medical way and provide relevant context to the information you are trying to catalog. Presenting the data in this way would allow it to be presented in numerous ways to various users of different specialties, who would be able to derive insight from it.
Machine learning can aid greatly in this process. ML clustering algorithms could look at the data and identify concepts that are similar in nature that can be the basis for creating a clinical ontology which is then reviewed and confirmed by bioinformaticians and medical professionals. Once the clinical ontology has been established, ML algorithms could then be used to classify new data into concepts in that ontology with very little human intervention. All of this effort would, in effect, smooth much of the pain around data integration to facilitate data mining and statistical meta-analyses.
The following site gives a good visual overview of basic machine learning concepts:
When was the last time you did a home improvement or home repair project for which you have little to no prior experience? Last weekend? How did you get it done? Youtube has rapidly become the go-to technique for doing this. You don’t have to phone a friend anymore just to learn he “kind of” did that before or he “thinks you do X” to get the job done. Now you just watch a few videos on youtube and in minutes you are locked and loaded to try something new.
Part of the reason for the potency of this approach is the ability to search for information that you need, but an even more powerful reason is the medium for presentation is extremely rich in context. A picture is worth a thousand words, right? Indeed, but a video must be worth millions. By watching a how-to video on youtube you can easily consume relevant facts, best practice procedures, alternative approaches, risk factors not to mention reverse camera angles, motion, mechanical interactions, and sound. Youtube videos provide an abundance of context for gaining new knowledge about D-I-Y projects and many other topics.
So what does this realization mean when we turn our focus to visualizing data and analytics? Humans are inherently visual creatures. Information displayed in a visual medium along with its relevant context is guaranteed to be far more consumable by humans. This is why instruction manuals have pictures. In our world, that means we should be on the lookout for tools and technology that contain visual components that aid the conveyance of information and context in a rich and consumable way. They will help us provide a much deeper and more effective communication of the science and methods surrounding the data we deal with every day.
One example of improved context in data visualization can be seen in many of the clinical visualizations currently on the market. These tools have the capability of producing animated plots that provide a richness of context far beyond that of a data table or even a static graphic. For example, an animated bubble plot can be used to examine liver toxicity levels in various patient cohorts within a study over time. It is fascinating to watch the bubbles move across the screen and even accelerate as you can see trends in toxicity leading to further ability to ask targeted questions about safety and look further into the study data.
To see some really great examples of how visualization of data can drive insight, see Hans Rosling’s TED talk: http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen
User experience is often an overlooked component in the application stack in our industry and this is a problem. How we serve up information from the clinical development chain to end users should be near the top of the list when developing applications. Current applications tend to allow users to fend for themselves when consuming the information. This is problematic because the clinical development process is complex and the data that is created tends to be esoteric and only interpretable by people with special training in understanding clinical data structures and nuances. User experience in our industry should have a heavy focus on democratizing the data to allow people who may not be familiar with clinical data structures to utilize the data in a way that is easy, transparent, and provides value.
UX designers also tend to be marginalized by software companies in our industry and only involved at the very end stages of a software development project to “work on the interface”. This characterization would infuriate most true UX designers because UX professionals are not pixel pushers whose job it is to make pretty buttons on a screen. They are one of the primary components to any meaningful application project, and should be involved with every step of the project from conception all the way through to delivery and beyond. It is not surprising that most venture capital firms in the tech industry now require a user experience designer to be one of the first hires made at any start up. They have come to realize that user experience drives products, not technology. Their designs and decisions for the products are the things that drive developers towards one technical solution or another, not the other way around. For full disclosure, Ian’s wife is the Product Design Lead at Tumblr, and he hears about this on a daily basis, so he knows a little bit about this.
So what does this mean for context? UX designers are the ones tasked with finding out about users and what they want to do. Our industry has been too busy listening to users tell us how they want to do things and then building systems accordingly that tend to be debatably successful at helping user do what they want. UX designers are tasked with creating the how that lets users do what they want to do, and then working with users directly to improve it. They are very good at looking at problems from different angles and coming up with innovative solutions that users might not see because users tend to be very focused on the how and not the what. In the end this makes the tools contextually much better for each user because someone is sitting down and spending a significant amount of time thinking about what each user is trying to do with the system and making sure that the system is optimized to allow them to do that.
Context is important. Nay, context is vital. If we, as an industry, hope to progress in any significant way, context needs to be heavily integrated into our daily workflow through our processes and toolsets. Integrated context will add value to the data we collect and analyze while at the same time broadening understanding amongst anyone who interacts with it. The technologies discussed in this paper are by no means the superset of all of the available technologies that can bring context to our work, but they are a good starting point. We encourage everyone to to look at these and see how they may be of use to you, as well as exploring other technologies that can help provide context to the work we do every day. It is our belief that more context means more efficiency, broader understanding, ease of use, and ultimately better outcomes for patients.