WG5 P02 Programming Guidelines
WG5 Project 02 Programming Conventions and Guidelines
Central Tendency White Paper
- Good Programming Practice Guidance
- also published as File:GPP Guidance Document v1.1.docx
- "The guidance aims to show how to produce well-structured and well-documented programs so that they are easy to read and maintain over time. It is meant to be applicable to all programs, and hence all programmers regardless of experience." (emphasis added)
In general, the PhUSE CS scripts should adhere to 3 general principles:
- Create learning opportunities for our community. This drives how we write and organize documents, specifications, test data, scripts, etc.
- highly accessible and readable components allow us all to improve our expertise, technical capabilities and understanding of standards
- This includes ensuring that these documents reference relevant language specifications and industry standards
- e.g., to R or SAS online documentation pages for sophisticated or subtle techniques
- e.g., to CDISC SDTM or ADaM guidelines to explain test data structures
- And keep all components as simple as possible, but not simpler :-)
- e.g., provide basic functionality as required or suggested by white papers
- e.g., provide basic but not full user configuration
- e.g., keep all code relevant to statistical analysis and presentation visible, accessible for additional user customization
- ... but OK to move basic data-driven discovery into functions or macros. For example, SAS record-based processing makes it somewhat cumbersome to identify unique values in a variable. It's OK to move this into a utility macro, to replicate R-style data discovery.
The PhUSE CS repository contains 4 types of programs
- Standard Scripts (WPCT folder)
- Produce a specified data display.
- Clearly present core statistical steps relevant to specified analyses.
- Explicitly assert assumptions about the data and environment, via %ASSERT* macros.
- Hide generic discovery and processing irrelevant to specified analyses, via %UTIL* macros.
- naming convention: <white-paper-id>_<display-id>.<lang>
- Example: WPCT-F.07.01.sas
- Test scripts (qualification folder, named test_<program-name>)
- Establish expected results for intended functionality in standard, assertion and utility scripts
- naming convention: test_<program-name>.<lang>
- Example: test_assert_dset_exist.sas
- Assertion macros (utilities folder, prefix assert_)
- [ SAS focus ]
- Test conditions in the data and environment, and
- inform the end user in case of unexpected conditions, invalid states
- assertion naming convention: assert_<assertion-description>.<lang>
- Example: assert_dset_exist.sas
- Utility macros (utilities folder, prefix util_)
- [ SAS focus ]
- Accomplish discovery and processing tasks that are needed,
- but that are not particularly relevant to the analyses.
- Implementation of these tasks has no impact on the interpretation of results
- utility naming convention: util_<utility-description>.<lang>
- Example: util_access_test_data.sas
TO DO for these PhUSE CS project guidelines
- Create standard program header, as recommended in GPP Guidance, above
- Better clarity concerning language. These guidelines are currently SAS-focused, but we want to deliver R scripts, as well.
- Keep it simple. aggressively.
- before you add in complexity: stop, assess whether this is really needed, and
- justify the gain in functionality vs. the costs of complexity.
- before you finish your code: stop, review and assess whether you can make it simpler without meaningful loss
- But not too simple.
- all variable names, symbol names, macro names must be meaningful
- long, descriptive names are better for readability than short, cryptic names
- never use one-letter variables to loop (e.g., i j k ...)
- code often loops through values, or parses a delimited string and processes each piece
- e.g., Process each parameter in a list of lab parameters, or each var in a list of variables.
- Our programs should uniformly use -IDX and -NXT suffixes for such processing.
- -IDX suffix for the indexing variable (or macro symbol)
- e.g., See %assert_var_exist() for an example of looping through data sets and var names.
- DIDX indexes data set names, and VIDX indexes variable names
- -NXT suffix for the variable (or symbol) that holds the value to process next from a deliminted list
- e.g., See %assert_var_exist() for an example of looping through data sets and variable names.
- DNXT holds the next data set name, and VNXT holds the next variable name
- This makes the code easy to read!
- CSS_ prefix for all WORK data sets
- DO NOT overwrite data sets that could help the user debug their data & changes (GPP Guidance)
- DO delete other WORK data sets as soon as they are obsolete
- headers contain a TO DO list, to facilitate contribution
- TO DO placeholders within the script can also help contributors incorporate new code
- Header: see notes on "Comments", below
- Spacing and alignment
- align code with space characters, never tabs. set your editor to replace tabs with spaces. (GPP Guidance)
- use consistent number of spaces to indent within a single program
- 2-space indents are preferred (not more). set your editor to 2-space indenting, replacing tabs with spaces.
- see Explanations (a.k.a. Comments), below.
- indenting helps group related blocks of code, so 2-space indenting allows more indenting
- maintain spacing in a program.
- e.g., if you edit a program with 2-space alignment, stick with 2-space alignment
- SAS is not a case-sensitive language
- prefer lower case, unless necessary (title, labels) or helpful for clarity (comments)
- use casing functions explicitly in algorithms lowcase(), upcase(), %lowcase(), %upcase()
- do not abbreviate SAS keywords anywhere
- use the full keyword to support clarity and readability
- create a good experience for end-users of all skill levels
- (similar to GPP Guidance to always use "data=dataset" option in SAS programs)
- explicit parentheses in algorithms for readability (GPP Guidance)
- do not force reviewers to check order of operations, demonstrate that you are in control
- NO: var + 1 / 10
- YES: var + (1/10)
- macro names should be meaningful, even if long
- prefix indicates "type", e.g., assert_*, util_*, etc.
- when reading the macro name in a calling script, the purpose should be clear
- adhere to NAMING CONVENTIONS that SAS already establishes, whenever possible
- NO: %assert_dse()
- NO: %assert_dset_exists()
- YES: %assert_dset_exist(), to match the grammar of SAS elements exist(), fexist(), symexist(), etc.
- use temporary macro NULL to wrap macro logic in open code, such as an %IF block
%macro null; %if not %symexist(init_sasautos) %then %let init_sasautos = %sysfunc(getoption(sasautos)); %mend null; %null;
- see "Conventions for macro parameter names", below
- OK to assume that one-level data sets are in WORK
- without checking for the USER libname & related system option
- but keep in mind as potential bug
- macro messages to the log follow this style and format:
- NOTE: (MACRO-NAME-UPCASE) Clear informational message to user.
- WARNING: (MACRO-NAME-UPCASE) Warning message to user, but processing continues.
- ERROR: (MACRO-NAME-UPCASE) Error detected current context. Processing should stop as soon as possible.
- this makes it easy to
- extract messages from logs
- separate SAS and PhUSE CS messages
- for PhUSE CS ASSERT and UTILITY macros, see additional details, below
- macros use Quoting carefully and intentionally
- use q- versions of macro functions whenever processing unknown text.
- e.g., the following macro FAILS for some values of &vars, unless you use the %qscan() function
%macro null(vars); %if %scan(&vars, 1) = STDDEV %then %put Note: Calculating Standard Deviation.; %else %put Note: Calculating something else.; %mend null; %null(OR);
- macros clean up after themselves
- delete temp data sets before exiting
- reset any modifications before exiting
- system options,
- graphics options,
- ODS destinations
Explanations (a.k.a. Comments)
- Comments must be meaningful and easy to maintain
- No extra characters to draw boxes around comments (see header note, below)
- Explain what the code needs to achieve
- Explain decisions in the code
- why keep or drop certain vars?
- why are the merge variables or by variables correct?
- why is a particular algorithm correct? what do the elements represent?
- Comment types must be used intentionally
- Header block between starting line (/***) and ending line (***/)
- /*** ***/ style comments for blocks of explanation, like with the header
- %*--- ---*; style comments to explain macro statements
- *--- ---*; comment statements as single-line explanations
- Comments declare what program expects from macro call, such as data sets, macro vars, etc. See also "STANDARD scripts", below.
- Comments visually group blocks of related code, which are indented one additional step (GPP Guidance extended)
- Examples (consistent 2-space indentation)
*--- Single-line comment to explain the next, related steps ---*; all code that accomplishes this objective is indented to this level /*** Optional title for comment This next bit is more complicated, so requires a bit more explanation. But not too much. ***/ all code to accomplish this complex task still working on it down here %*--- OK, now I am prepared to call my utility macro ---*; %util_generic_processing(ds=my_data)
- Use PhUSE CS test data
- Access PhUSE CS test data via %UTIL_ACCESS_TEST_DATA
- Use global symbol &CONTINUE with values 0 (No, there's a problem) and 1 (Yes, continue) to monitor success of processing
- see also ASSERT macros, below
- Use assertion macro %ASSERT_CONTINUE to interrupt processing if a problem occurs (force syntax-checking mode if error indicated)
- Declare the symbols that utility programs create. E.g., see these macro calls in template program WPCT-F.07.01.sas
%*--- Return macro vars: Number of parameters (&PARAMCD_N), their Names (&PARAMCD_NAM1 ...) and Labels (&PARAMCD_LAB1 ...) ---*; %util_labels_from_var(css_anadata, paramcd, param) %*--- Return macro var: Number of planned treatments (&TRTN) ---*; %util_count_unique_values(css_anadata, trtp, trtn)
- script naming convention: test_<program-name-without-extension>.sas
- every test explicitly uses specific data
- this can be test data created specifically within the test program for specific tests, or
- centralized PhUSE CS test data available for multiple tests. see: https://github.com/phuse-org/phuse-scripts/tree/master/scriptathon2014/data
- centralized PhUSE CS data sets must include a QLTSTID variable that identifies specific test data
- QLTSTID has label "PhUSE CS Qualification Test ID", and length sufficient for all current test IDs
- see: https://github.com/phuse-org/phuse-scripts/blob/master/scriptathon2014/data/advs.xpt
- QLTSTID values should not change, once assigned.
- e.g., if some test relies on records with QLTSTID = "TEST-01-01",
- those obs should not change, individually or as a set, and
- any new obs added to the same central data set must have a new value for QLTSTID
- use %assert_depend to test conditions (e.g., valid data set and variable names, etc.), for consistency of messaging. This applies to UTIL macros, as well.
- return a 0/1 result in-line whenever possible: 0 = FAIL, 1 = PASS
- IN-LINE macros: use and return a %local OK symbol to return pass/fail result
- Base SAS macros: use the global symbol &CONTINUE to return any failure that should stop processing
- see also TEMPLATE programs, above
- declare %local and %global symbols explicitly
- always return at least one message to the log, either
NOTE: (MACRO-NAME-UPCASE) Result is PASS. Optional confirmation of the successful assertion.
ERROR: (MACRO-NAME-UPCASE) Result is FAIL. Clear explanation of failed assertion.
Depending on severity of the failed condition, a log WARNING may suffice rather than an ERROR.
- use %assert_depend to test conditions (e.g., valid data set and variable names, etc.), for consistency of messaging. This applies to ASSERT macros, as well.
- perform a specific task
- are never highjacked to perform a related task
- are never highjacked to create a convenient side-effect
Conventions for macro parameter names
|Name||Description||Comments||Programs that use|
|DS||SAS data set, one or two levels||positional, when usage is obvious|
|DSOUT||Resulting SAS data set to create, one or two levels||keyword, unless usage is obvious|
|VAR||Valid SAS var, no special chars expected||positional, when usage is obvious||assert_unique_keys|
|KEYS||Valid SAS vars that compose unique keys for a data set||positional, when usage is obvious|
|INCL||Valid SAS vars to include in an output data set||always keyword||assert_unique_keys|
|ORD||name of an ORDER variable such as AVISITN||always keyword|
|WHR||where clause||always keyword|
|SQLWHR||complete SQL where clause, quoted as needed||always keyword, does NOT include semi-colon||util_count_unique_values|
|FMT||SAS format name WITH punctuation (@$.), as nec||always keyword|
|SYM||name of a symbol (macro variable)||positional, when usage is obvious|
|CLEANUP||0/1 boolean whether to cleanup intermediate dsets. 1 = YES, 0 = NO.||always keyword|
Other macro parameters
|Other parameter||Program that uses||Comment|
|TABLE||util_freq2format.sas||a 2-var PROC FREQ table spec like var1*var2, can include extra spacing|
|FMTNAME||util_freq2format.sas||macro determines fmt type, so value does NOT include punctuation (@$.)|
|MACNAME||util_autocallpath.sas||a macro name, without any special chars|
|DSETS||assert_complete_refds.sas||list of data sets, where order has a specific meaning|
Team Review and Comments
- prep-work for WG5 P02 meeting 2015-05-26
- add comments, suggestions below