Data Sizing Best Practices Recommendation
The SDTM Validation Rules Project in the FDA/PhUSE CSS Data Quality Working Group has initiated this Best Practices document to address the following challenge faced by industry when implementing CDISC SDTM models.
The project team will be collecting feedback on this document through the end of 2013. Add comments to the discussion section of this page or send email to CSS-DataQuality@phusewiki.org.
When submitting datasets to the FDA, sponsors can find guidance from FDA and CDISC which address the size limitations of the datasets and the variables. Some of the guidance is contradictory or unclear. This Best Practice document provides a recommendation for applying a process to limit data set sizes for submission.
- Process flow for managing the recommended solutions
- How to manage the length of character values to avoid wasted space within datasets?
- How to handle SAS xpt files when they exceed the maximum size allowed?
- What to report in define documentation?
While the CDISC Implementation Guidelines (IG) provides clear requirements for data set-level metadata properties such as names and labels, very little has been said about data set size requirements. The effect of this fact has been the submission of large data sets to the FDA. Such data sets cause difficulty in opening and analyzing them by FDA staff. In order to address this issue, this working group has identified two main factors that contribute to large data sets. One clear factor is the number of observations in a data set. The other is the space allocated to individual variables.
CDISC makes clear that for one of a number of reasons, sponsors may “split” domains. Splitting a domain means defining the domain in terms of sub-components. From a technical standpoint, rather than submitting a data set that represents an entire domain, sponsors may submit multiple data sets that each represents one of the sub-components. This document provides recommendations for limiting the number of observations in a data set by defining a threshold beyond which sponsors must establish criteria for domain splitting.
In addition to data set-level metadata properties, CDISC is clear on certain variable-level metadata properties such as name, labels, and type, but not on character variable length. Occasional mention is given to maximum lengths of certain classes of variables in order to accommodate SAS Version 5 transport file (.xpt) restrictions (e.g. --TESTCD, --TEST, QNAM, QLABEL), but in general, CDISC has historically left variable sizing up to sponsors. The effect is that sponsors often allocate more space (sometimes much more space) than is necessary. This document provides recommendations for filling in this “variable length gap” left by the IGs in an effort to limit character variable size, and ultimately, data set size.
Process flow for managing solutions
The process flow for optimizing data set size involves both factors mentioned above. The goal is to keep together data related to a given domain in accordance with the guidelines laid out by CDISC wherever possible. For this reason, it is the recommendation of this working group that all data sets planned for submission are first subjected to a variable sizing process. This step has the effect of keeping domain data together while removing unnecessary and unused space in the data set. Once the domain is at its minimum size through this approach, then data set size should be examined. Data sets that still exceed the threshold should then be split.
It is recognized that the optimization of data set size introduces an unnatural step to an otherwise natural process of transforming raw data to submission data. Keeping in mind that these recommendations are strictly for the purpose of submission, it is the recommendation of this working group that sponsors minimize disruption to the process by implementing these practices as a final step in preparing submission data. In doing so, sponsors should take special care to verify that these procedures continue to support downstream processes such as the development of ADaM data, tables, listings, and figures, as well as the submission of metadata (i.e. define documentation).
Managing character variable length to avoid wasted space in data sets
Minimizing space allocated to a character variable can only be accomplished with the knowledge of how much is needed to fit the current data. For that reason, optimization of data set size through variable size must be implemented after an initial analysis of the data. For this reason, sponsors, depending on their processes, may find useful the maintenance of operational SDTM data sets with padded variables to support analysis, as well as corresponding submission SDTM data sets with optimal variable lengths, based on the recommendations below. The following steps are recommended.
- Development of operational SDTM data using padded character variable sizes so that no data is truncated.
- Using the operational SDTM data, develop the submission data as follows:
- Within a given domain, determine the maximum number of characters used for each character variable in the operational data.
- For each character variable in the submission data, assign the quantity determined in step 2a above to the length of the variable.
NOTE: The effect of these recommendations is that all character variable lengths are data-driven. Consequences of these recommendations include the following:
- Variables that are subject to controlled terminology should only allow for lengths found in data, not the length of the longest term in the codelist.
- Because this is occurring before any domain splitting, when a domain is split, each variable will maintain the same length in both partitions of the domain.
- Because this procedure is implemented within each domain, any variable common to multiple domains has the potential of having different lengths in different domains (e.g. VISIT).
- Variables whose maximum lengths are mentioned in the IG (e.g. --TESTCD) should still reflect lengths found in data, not the maximum lengths. Note that this recommendation of the working group is in conflict with text such as that stated in section 22.214.171.124 of SDTM IG 3.1.3 which says that --TESTCD variables should have lengths set at 8.
- Any programming transformations that manipulate submission data must take these lengths into account (e.g. appending data for integrated summaries, key variables used for merging data sets). This is of particular importance to sponsor companies that plan to submit functional programs to the FDA.
Handling SAS xpt files when they exceed the maximum size allowed
One GB is the target for maximum xpt file size, with some flexibility. In general, data sets up to 1.25 GB would not need to be split, although data sets between 1 and 1.25 GB should be cleared by a reviewer. On the other hand, any data set (including split data sets) 1.25 GB or larger must be split. Within each split domain, variable presentation and variable attributes (including variable length) should be the same across each of the split domains.
For a domain based on a general observation class, whenever possible, data sets should be split based on categorical variables – first, by --CAT, and then, if necessary, by --SCAT. The value of the variable or variables used to perform the split is not allowed to be null.
The Findings About (FA) domain, should be split either by categorical variables (as noted above) or relative to the parent domain of the value in --OBJ. For example, FACM would store Findings About CM records.
Data sets which are divided should be clearly named and labeled to aid the reviewer in differentiating the split domains. For example, if LB is split by LBCAT, the split domain containing chemistry labs could be named lbc.xpt with a dataset label of “Laboratory Results - Chemistry”.
Much has been published regarding which data sets must be submitted for a split domain. Some claim that only the split partitions of the domain should be submitted while others claim that in addition to these, a large single data set that represents each observation of the domain should also be submitted. In the current version of this document, it is the recommendation of this group that when a domain is split for data set size purposes, that only the split partitions of the domain be submitted, and that the data set that represents the entire domain whose size exceeds the recommended threshold not be submitted.
What to report in define documentation
The purpose of the define documentation is to provide metadata that accurately describes the submitted data. According to Study Data Tabulation Model Metadata Submission Guidelines (SDTM-MSG) and CDISC Define-XML Specification, length is required if data type is text, integer or float. Sponsors must reflect in the define documentation the actual variable and value level lengths in the submitted files, or in other words, the reduced lengths. For split domains, metadata should be provided for each data set separately.
NOTE: This approach might not be supported by all systems that process metadata.
Q. How should a general observation class domain be split if it is not possible to split by categorical variables (e.g., all labs in an LB dataset are of the exact same lab test, and, therefore, no differentiated categorical variables)?
A. In such cases, splitting by other mechanisms (e.g., subject grouping, timepoint) is at the discretion of the sponsor.
Q. If a domain which needs to be split has an associated supplemental qualifier domain, does the supplemental qualifier domain need to be split too even if it is within size limits?
A. It is recommended that an associated supplemental qualifier domain always be split, regardless of its size, according to the criteria by which the parent domain is split. The split should be done so that the records in the split supplemental qualifier domain correspond to the matching records in the split parent domain. Dataset naming of the split supplemental qualifier domain should follow the same convention as that used for the split parent domain. For example, the split supplemental qualifier domain associated with the split domain named lbc.xpt should be supplbc.xpt. Along the same lines, it is also recommended that any FA domain that records findings about a split domain be split along the same criteria that splits SUPPQUAL.
Q. At what point during data processing do variable sizes need to be reduced?
A. The point of variable, and ultimately, data set size reduction is to facilitate data processing upon submission. For this reason, variable re-sizing can wait until the end of the process when packaging of data for submission takes place. Prior to this point, variable sizes can be kept at the lengths necessary for operational purposes.
Q. Is there an issue if the variable lengths of the datasets used to produce the analysis differ from the submitted datasets?
A. Depending on sponsor process, the length of the submitted files may differ from the variable lengths of the datasets used to produce analysis. For transparency the define documentation should note the method used for processing variable lengths for submission purpose.
Q. The CDER Common Data Standards Issue mentions a subdirectory called SPLIT, for storing data sets that are the result of splitting a domain. Where does this subdirectory fit in with these recommendations?
A. When a domain is split, the absence of a large data set that combines all of the split partitions removes the need for a distinction between split and un-split domains, and therefore removes the need for separate storage areas. For that reason, no subdirectory is needed.
- (OpenCDISC) An Information note is issued for data sets between 1 and 1.25 GB in future version.
- (OpenCDISC) When a domain has been split and the sponsor has decided to submit the parent domain along with the splits, the parent should not be identified as a violation.
- (OpenCDISC) Variables whose length in one data set is longer than the longest value of the variable should not be identified as a violation if the data set in which the variable exists is one partition of a split domain, and the variable’s length does match the length of the longest value in another partition of the same domain.
- (CDISC) Change 126.96.36.199 and any other suggestions of mandated variable lengths.