WG5 Project 09
- Dante Di Tommaso (contact via LinkedIn)
Project Members, 2019:
- Peter Schaefer, VCA-Plus
- Nancy Brucken, CSG
- Cynthia Stroupe, UCB
- Jessica Dai, Vertex
Additional team members are very welcome! Some thoughts about the required expertise:
- Familiarity with study data and CDISC datasets
- Programming knowledge, preferably R, but any SAS programmer willing to work with R programmers is welcome
- Enthusiasm about helping other to test systems and a basic understanding of system testing and qualification
If you want to participate, please simply contact Dante (contact via LinkedIn) or the PhUSE office @ firstname.lastname@example.org
Name: Test Data Factory
TDF is one of the projects under WG5 Standard Scripts.
Several of the PHUSE Working Groups or projects develop and specify medical research methods, features, or processes, and some even create software components or subsystems for common tasks in drug development. As part of these efforts, a variety of test datasets are required, often as CDISC datasets. The typical fallback position of project teams is to use data from the CDISC pilot project and/or anonymized study data that are provided by project team members. The Test Data Factory project aims at providing test data formatted in SDTM and/or ADaM that support a more systematic and comprehensive testing of these concepts and scripts.
The main idea behind the TDF project is to allow users to enter their test data specification and to use this input to generate test datasets. The test data would not necessarily reflect real study data, i.e., we would not attempt to do something like 'trial simulation'. But the data structure and the data itself will be compliant with CDISC standards and the datasets will be consistent across the whole package. Many data values will be generated using random generators while others will directly be based on the user input.
- GitHub Test Data Factory Repository: https://github.com/phuse-org/TestDataFactory
- PhUSE TeamWork Notebook: https://phuse.teamworkpm.net/index.cfm#/projects/336388/notebooks?catid=437738
|Generic concept for generating test datasets and specific details for LB and DM domains. Examples for how user input could be collected||June 2019|
|Working proof of concept for generating LB and DM domain datasets||September 2019|
|Public release of “TDF Tool” for LB , DM domain datasets (It still needs to be decided whether this "TDF Tool” will be just a detailed concept and examples, shared scripts, or working software)||December 2019|
Work in Progress and Notes
This section shows meeting notes, what the project team is working on, and/or the current status.
The team decided initially to work on the CDISC Pilot Data and to update the dataset to meet newer standards. This was considered "low hanging fruits" and the goal was to quickly produce useful results. It took us a little longer than we thought but now we have TDF Packages with test datasets based on the CDISC pilot.
- Completed update of SDTM and ADaM datasets from the CDISC Pilot. These datasets are part of packages (including description of the packages) that can be found here: PhUSE Working Group References (scroll down to the Scripts working group section)
Work in Progress:
Now, the team is working on the second phase of the project: The goal is to use a systematic approach to create test datasets based on user input. The team is following these steps:
- Define the scope of test data, i.e., which datasets are generated (prioritized list), what user input will be required, and what are the desired characteristics of these datasets (for example, incomplete, missing, or wrong data)
- Create a formal list of required user input that can be processed automatically
- Define and implement a process to generate the user-specified test datasets. We aim for scripts, but might initially settle for a less ambitious result.
- A bonus would be: Provide an infrastructure that enables users to easily use scripts to create test datasets.