WG5 Project 09

From PhUSE Wiki
Jump to: navigation, search

Project Team

Project Lead(s):

  • Peter Schaefer (pschaefer (at) vca-plus.com)

Project Members, 2018:

  • Nancy Brucken, Syneos Health
  • Cynthia Stroupe, UCB
  • Jessica Dai, Vertex
  • JJ Hantsch, Biorasi
  • Tania Walton

Additional team members are very welcome! Some thoughts about the required expertise:

  • Familiarity with study data and CDISC datasets
  • Programming knowledge, preferably R, but any SAS programmer willing to deal with R programmers is welcome
  • Enthusiasm about helping other to test systems and a basic understanding of system testing and qualification

If you want to participate, please simply contact Peter at pschaefer (at) vca-plus.com or the PhUSE office.

Project Description

Name:Test Data Factory

One of the projects under WG5 Standard Scripts.

Several CS Projects develop and specify medical research methods, features, or processes, and some even create software components or subsystems for common tasks in drug development. As part of these efforts, a variety of SDTM or ADaM test datasets are required. The typical fallback position of project teams is to use data from the CDISC pilot project and/or anonymized study data that are provided by project team members. The Test Data Factory project aims at providing test data formatted in SDTM and ADaM that support a more systematic and comprehensive testing of these concepts and scripts.

What we are currently working on:
The team decided in some initial meetings to work on the CDISC Pilot Data and to update the data to meet newer standards. This is considered "low hanging fruits" and will quickly produce useful results. However, we won't loose sight of the more generic approach of using simulation to generate test data.

General Approach:
The proposed project approach is to use a systematic approach to create test datasets consisting of the following activities:

  1. Define an initial scope of test data, i.e., which types of datasets are required (prioritized list) and what are the desired characteristics of these datasets (for example, incomplete, missing, or wrong data)
  2. Collect ‘real’ study data and identify existing test datasets
  3. Two options:
    1. If datasets already exist (like the CDISC Pilot project data): Update the datasets to comply with newer standards
    2. If datasets do not exist: Create scripts (preferably R scripts) that will create new test datasets with specified features through a simulation-based approach
  4. Define and implement a process to store and publish test datasets and the appropriate metadata through the existing channels (for example, PhUSE Wiki and the Github repository defined in the Standard Analyses and Code Sharing working group)
  5. Identify an infrastructure that enables users to easily use scripts to create test datasets.

Work in Progress and Meeting Notes

This section shows meeting notes, what the project team is working on, and/or the status.

Current achievements:

  • Completed update of SDTM datasets from the CDISC Pilot. These datasets can be found here: Updated SDTM Data sets
  • Nearly completed update of the ADaM datasets from the CDISC Pilot