SEND Implementation Wiki - Define Fundamentals
This page gives basics on the define file (e.g., define.xml), such as what it is and key links and preparation information.
What Is the Define File?
The "define" file is a file which describes information about the SEND datasets, such as which domains are represented, which fields are present in each domain, usage of CT, and special calculations or notes on the population or calculation of fields.
The define file has two primary benefits:
- Allows for human-readable information on the contents of a SEND package
- Allows electronic systems consuming SEND packages to get an electronic "explanation" of the datasets' location and contents.
Where Do I Find Specifications?
- Case Report Tabulation Data Definition Specification (define.xml) - This is the define-XML 1.0 specification.
- http://www.cdisc.org/define-xml - Site containing resources for the define-XML standard, such as the file above, example files, xsl document, etc.
The majority of the content of the define.xml file consists of the specifications for domains and variables.
Variables are defined through ItemDef elements, usually 1 per variable per domain, although shared columns like STUDYID and USUBJID require only 1 definition regardless of how many domains use them.
Domains are specified as ItemGroupDef elements, which are in turn collections of ItemRef elements, or references to the variables' ItemDef elements.
Another key aspect to the define.xml includes CT, included as CodeList elements. These are referenced by the ItemDef (variable) elements through a CodeListRef subelement. CodeLists can either be a single reference to an external code list (e.g., SEND CT) or an itemized list of terms for a sponsor-specific list.
The following is a summary of the primary elements contained within a define.xml file.
First, header/framing elements (each consecutive element is a child of the one before it):
- ODM element, containing the datetime of file creation (also, some static references)
- Study element, which will frame the rest of the content, including a GlobalVariables element which provides the study ID, title, etc.
- MetaDataVersion element, which describes the standards used, etc.
Next, under the header elements, the following elements are used to describe the majority of the define content:
- ValueListDef elements, 1 for each custom list of values, such as enumerating the columns in a SUPP-- file
- ItemRef subelements, 1 for each item attributed to the ValueListDef, in turn referencing an ItemDef (defined later)
- ItemGroupDef elements, 1 for each domain
- ItemRef subelements, 1 for each variable in the domain and specifying internal-to-the-domain attributes, such as the ordering in the domain and so on. These elements in turn reference an ItemDef (defined later)
- def:leaf element, 1 for the domain, describing the file to which the domain is associated.
- ItemDef elements, 1 for each variable used in any of the domains. Common columns, such as STUDYID, can be defined once and referenced within each domain, as they are used identically across domains, but outside of these cases, there is typically 1 ItemDef per column per domain. The ItemDef element's attributes describe the variable, including type, name, length, comments, and so on.
- CodeListRef subelement, 0 or 1, describing the codelist (e.g., CT list) to which the ItemDef adheres (if applicable). This element in turn references the corresponding CodeList element (defined later)
- CodeList elements, 1 for each codelist used across any of the variables.
- ExternalCodeList subelement, 0 or 1, used to reference the corresponding SEND CT list (most common case)
- CodeListItem subelements, 0 to many, used for sponsor-specific lists of terms. These elements in turn have subelements for decodes and so on; reference the define specifications for details.
Please see the Case Report Tabulation Data Definition Specification (define.xml) spec for details on any of the elements noted.
Getting a Base File
If you are using a vendor solution to create SEND files, it typically will come bundled with functionality to output a define.xml file.
If you need to create one yourself, then the OpenCDISC Validator tool can be used to generate the basis of a define.xml file off a set of SEND XPT files (select "Generate Define.xml" and then your SEND XPT files). This is a good starting point for your define.xml, as it will create all of the structural basics for you; however, it does not have the ability to populate the company-specific information such as comments, desired data types, custom controlled terminology, and so on.
Another option is to use an example define file provided from the define-xml site. These have more realistic examples, although not SEND-based.
The raw define file needs several additions and refinements.
- File/study metadata:
- Creation datetime in the ODM element's CreationDateTime attribute
- Study Name, Description, and Protocol Name in the Study element's GlobalVariables subelement (e.g., ABC123, 28-Day Oral Toxicology Study in Rats, and ABC123, respectively)
- Name under the MetaDataVersion element's Name attribute (e.g., "Study ABC123, Data Definitions")
- Domain (ItemGroupDef) keys under the def:DomainKeys attribute
- Domain variable references' (ItemRef) attributes, including a review of the Mandatory and Role attributes for domain variables (can be incorrect in the templates)
- Variable definitions' (ItemDef) attributes, including:
- Populating the Origin attribute
- Populating the Comments attribute
- Revising the Type and/or Length attributes
- Added a CodeListRef subelement when CT applies to the variable
- CodeList elements for each CT list used (internal and external)
- Any value lists used (ValueList), including SUPP-- variable descriptions.
Advanced Define Concepts
Value-level metadata needs to be defined when data in all rows of a variable cannot be described by a single collection of metadata.
Using the LB domain as an example, the LBORRES variable contains both qualitative and quantitative test results. The quantitative results may be integers or floating point values, and the floating point values may have different precisions. Some data may be collected and some derived. The qualitative results may use different result coding schemes that need documenting in different codelists. Thus, LBORRES cannot be described with a single collection of metadata at the variable level, and value-level metadata is required
All of the attributes and child elements (data type, length, significantdigits, codelist, origin, derivation method, comments) available for variable-level metadata are also available for value-level metadata. Additionally, value-level metadata needs to have some qualifier that describes the subset of data that is being described. Continuing the example of LBORRES, using the entry in LBTEST or LBTESTCD might be a good way to break up the values in the dataset into subsets which allow LBORRES to be defined with one set of metadata per test. In other words, the values in LBORRES could be described separatly for each test (LBORRES values for RETI could be described separately from LBORRES values for GLUC...).
In define-XML 1.0 it is only possible to identify one variable use to break up dataset values into subsets. In define-XML 2.0, multiple variables can be used, with different comparator operators, to create a WhereClause that identifies subsets of the dataset. For example in define-XML 1.0 you could say that you want to define metadata for LBORRES when LBTEST=PROT. In define-XML 2.0 you can create a WhereClause that allows you to define metadata for LBORRES when LBTEST=PROT AND LBSPEC=URINE.
TBD - include how to specify, when/how to reference, etc.
Last revision by Jennifer.feldmann, 2018-01-5