Talk:Data Sizing Best Practices Recommendation

From PHUSE Wiki
Revision as of 10:56, 13 January 2014 by Lexjansen (talk | contribs) (Re: Split datasets subdirectory -- Lex Jansen (talk) 10:55, 13 January 2014 (CST))
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
blog comments powered by Disqus

Split datasets subdirectory -- Gstoner (talk) 16:20, 3 January 2014 (CST)

The CDER Common Data Standards Issues Document clearly indicates that a large lab dataset (> 1 Gb) should be split, that both the split and full datasets should be submitted and that a /SPLIT subdirectory should be used. I submitted a question to cder-edata in 2012 and received confirmation that this guidance applies to all large datasets. Has the CSS workgroup's recommendation to submit only the split files been vetted with CDER? Will CDER guidance be updated?

Re: Split datasets subdirectory -- Lex Jansen (talk) 10:55, 13 January 2014 (CST)

In September 2013, a member of the CDISC XML Technologies team sent the following question to the CDER eData Team:

I noticed that in the CDER Common Data Standards Issues Document it states that for split domains you should submit both the split datasets and also the combined dataset (see quote below). This does not match what I have seen being done in practice - I have only ever seen the split domains provided. It also does not match what CDISC state in the SDTM and Define-XML standards or any example files from CDISC. It seems that the CDER Common Data Standards Issues Document is not consistent with CDISC's guidance. It also seems strange that it states that one of the points of splitting domains is to deal with file size issues, yet says to provide the combined dataset anyway. Finally, if it is correct that you have to submit the combined dataset does that mean the Define-XML needs to have definitions for both the combined dataset and each of the splits? I've never seen that in a Define file, either as part of a specification, an example or in a real submission.

Quote from CDER Common Data Standards Issues Document: LB Domain (Laboratory): The size of the LB domain is often quite large and can exceed the reviewers’ ability to open the file using standard-issue computers. This size issue can be addressed by splitting the large LB dataset into smaller data sets according to LBCAT and LBSCAT, using LBCAT for initial splitting. If the size is still too large, then use LBSCAT for further splitting. For example: use the dataset name lbc.xpt for chemistry lbh.xpt for hematology and lbu.xpt for urinalysis. Splitting it other ways (by subject or file size, etc) makes the data less useable. Sponsors should submit these smaller files in addition to the larger non-split standard LB domain file. The smaller split files should be submitted in a separate Sub-directory/SPLIT which is clearly documented in addition to the larger non-split standard LB domain file in the CRT directory. Please see File Size section for information about file size limits.

The answer from the CDER eData Team was:

"Yes, the agency is aware of inconsistency of the different documents regarding to the split domain and file size. Going forward, the agency will make update to have these documents to be updated. At this point, you can just submit the split datasets and the Define-XML for split datasets. For additional questions, please feel free to send an email to"