WG5 P02 Programming Guidelines

From PhUSE Wiki
Revision as of 07:40, 7 February 2016 by DanteDT (talk | contribs) (TEST programs)
Jump to: navigation, search


WG5 Project 02 Programming Conventions and Guidelines

Programming project:
Central Tendency White Paper

Starting Point:


OVERVIEW

In general, the CSS/PhUSE scripts should adhere to 3 general principles:

  • Create learning opportunities for our community. This drives how we write and organize documents, specifications, test data, scripts, etc.
  • highly accessible and readable components allow us all to improve our expertise, technical capabilities and understanding of standards
  • This includes ensuring that these documents reference relevant language specifications and industry standards
  • e.g., to R or SAS online documentation pages for sophisticated or subtle techniques
  • e.g., to CDISC SDTM or ADaM guidelines to explain test data structures
  • And keep all components as simple as possible, but not simpler :-)
  • e.g., provide basic functionality as required or suggested by white papers
  • e.g., provide basic but not full user configuration
  • e.g., keep all code relevant to statistical analysis and presentation visible, accessible for additional user customization
  • ... but OK to move basic data-driven discovery into functions or macros. For example, SAS record-based processing makes it somewhat cumbersome to identify unique values in a variable. It's OK to move this into a utility macro, to replicate R-style data discovery.


The PhUSE/CSS library contains 4 types of programs

  • Standard Scripts (WPCT folder)
    • Produce a specified data display.
    • Clearly present core statistical steps relevant to specified analyses.
    • Explicitly assert assumptions about the data and environment, via %ASSERT* macros.
    • Hide generic discovery and processing irrelevant to specified analyses, via %UTIL* macros.
    • naming convention: <white-paper-id>_<display-id>.<lang>
  • Assertion macros (utilities folder, prefix assert_)
    • [ SAS focus ]
    • Test conditions in the data and environment, and
    • inform the end user in case of unexpected conditions, invalid states
    • assertion naming convention: assert_<assertion-description>.<lang>
  • Utility macros (utilities folder, prefix util_)
    • [ SAS focus ]
    • Accomplish discovery and processing tasks that are needed,
    • but that are not particularly relevant to the analyses.
    • Implementation of these tasks has no impact on the interpretation of results
    • utility naming convention: util_<utility-description>.<lang>


TO DO for these PhUSE CS project guidelines

  • Create standard program header, as recommended in GPP Guidance, above
  • Better clarity concerning language. These guidelines are currently SAS-focused, but we want to deliver R scripts, as well.


GENERAL

  • Keep it simple. aggressively.
    • before you add in complexity: stop, assess whether this is really needed, and
    • justify the gain in functionality vs. the costs of complexity.
    • before you finish your code: stop, review and assess whether you can make it simpler without meaningful loss
  • But not too simple.
    • all variable names, symbol names, macro names must be meaningful
    • long, descriptive names are better for readability than short, cryptic names
    • looping:
      • never use one-letter variables to loop (e.g., i j k ...)
      • code often loops through values, or parses a delimited string and processes each piece
        • e.g., Process each parameter in a list of lab parameters, or each var in a list of variables.
      • Our programs should uniformly use -IDX and -NXT suffixes for such processing.
      • -IDX suffix for the indexing variable (or macro symbol)
        • e.g., See %assert_var_exist() for an example of looping through data sets and var names.
        • DIDX indexes data set names, and VIDX indexes variable names
      • -NXT suffix for the variable (or symbol) that holds the value to process next from a deliminted list
        • e.g., See %assert_var_exist() for an example of looping through data sets and variable names.
        • DNXT holds the next data set name, and VNXT holds the next variable name
      • This makes the code easy to read!
  • CSS_ prefix for all WORK data sets
    • DO NOT overwrite data sets that could help the user debug their data & changes (GPP Guidance)
    • DO delete other WORK data sets as soon as they are obsolete
  • headers contain a TO DO list, to facilitate contribution
    • TO DO placeholders within the script can also help contributors incorporate new code
  • Header: see notes on "Comments", below
  • Spacing and alignment
    • align code with space characters, never tabs. set your editor to replace tabs with spaces. (GPP Guidance)
    • use consistent number of spaces to indent within a single program
    • 2-space indents are preferred (not more). set your editor to 2-space indenting, replacing tabs with spaces.
      • see Explanations (a.k.a. Comments), below.
      • indenting helps group related blocks of code, so 2-space indenting allows more indenting
    • maintain spacing in a program.
      • e.g., if you edit a program with 2-space alignment, stick with 2-space alignment
  • capitalization
    • SAS is not a case-sensitive language
    • prefer lower case, unless necessary (title, labels) or helpful for clarity (comments)
    • use casing functions explicitly in algorithms lowcase(), upcase(), %lowcase(), %upcase()
  • do not abbreviate SAS keywords anywhere
    • use the full keyword to support clarity and readability
    • create a good experience for end-users of all skill levels
    • (similar to GPP Guidance to always use "data=dataset" option in SAS programs)
  • explicit parentheses in algorithms for readability (GPP Guidance)
    • do not force reviewers to check order of operations, demonstrate that you are in control
    • NO: var + 1 / 10
    • YES: var + (1/10)
  • macro names should be meaningful, even if long
    • prefix indicates "type", e.g., assert_*, util_*, etc.
    • when reading the macro name in a calling script, the purpose should be clear
    • adhere to NAMING CONVENTIONS that SAS already establishes, whenever possible
    • NO:  %assert_dse()
    • NO:  %assert_dset_exists()
    • YES: %assert_dset_exist(), to match the grammar of SAS elements exist(), fexist(), symexist(), etc.
  • use temporary macro NULL to wrap macro logic in open code, such as an %IF block
    • Example:
 %macro null;
   %if not %symexist(init_sasautos) %then %let init_sasautos = %sysfunc(getoption(sasautos));
 %mend null;
 %null;
  • see "Conventions for macro parameter names", below
  • OK to assume that one-level data sets are in WORK
    • without checking for the USER libname & related system option
    • but keep in mind as potential bug
  • macro messages to the log follow this style and format:
    • NOTE: (MACRO-NAME-UPCASE) Clear informational message to user.
    • WARNING: (MACRO-NAME-UPCASE) Warning message to user, but processing continues.
    • ERROR: (MACRO-NAME-UPCASE) Error detected current context. Processing should stop as soon as possible.
    • this makes it easy to
      • extract messages from logs
      • separate SAS and PhUSE/CSS messages
    • for PhUSE/CSS ASSERT and UTILITY macros, see additional details, below
  • macros use Quoting carefully and intentionally
    • use q- versions of macro functions whenever processing unknown text.
    • e.g., the following macro FAILS for some values of &vars, unless you use the %qscan() function
 %macro null(vars);
   %if %scan(&vars, 1) = STDDEV %then %put Note: Calculating Standard Deviation.;
   %else %put Note: Calculating something else.;
 %mend null;
 %null(OR);
  • macros clean up after themselves
    • delete temp data sets before exiting
    • reset any modifications before exiting
  • system options,
  • graphics options,
  • ODS destinations
  • etc


Explanations (a.k.a. Comments)

  • Comments must be meaningful and easy to maintain
    • No extra characters to draw boxes around comments (see header note, below)
    • Explain what the code needs to achieve
    • Explain decisions in the code
      • why keep or drop certain vars?
      • why are the merge variables or by variables correct?
      • why is a particular algorithm correct? what do the elements represent?
  • Comment types must be used intentionally
    • Header block between starting line (/***) and ending line (***/)
    • /*** ***/ style comments for blocks of explanation, like with the header
    •  %*--- ---*; style comments to explain macro statements
    • *--- ---*; comment statements as single-line explanations
  • Comments declare what program expects from macro call, such as data sets, macro vars, etc. See also "STANDARD scripts", below.
  • Comments visually group blocks of related code, which are indented one additional step (GPP Guidance extended)
    • Examples (consistent 2-space indentation)
 *--- Single-line comment to explain the next, related steps ---*;
   all code that accomplishes this objective is indented to this level
 
 /*** Optional title for comment
   This next bit is more complicated, so requires a bit more explanation.
   But not too much.
 ***/
   all code to accomplish this complex task
 
   still working on it down here
 
 %*--- OK, now I am prepared to call my utility macro ---*;
   %util_generic_processing(ds=my_data)


STANDARD scripts

  • Use PhUSE/CSS test data
  • Access PhUSE/CSS test data via %UTIL_ACCESS_TEST_DATA
  • Use global symbol &CONTINUE with values 0 (No, there's a problem) and 1 (Yes, continue) to monitor success of processing
  • see also ASSERT macros, below
  • Use assertion macro %ASSERT_CONTINUE to interrupt processing if a problem occurs (force syntax-checking mode if error indicated)
  • Declare the symbols that utility programs create. E.g., see these macro calls in template program WPCT-F.07.01.sas
 %*--- Return macro vars: Number of parameters (&PARAMCD_N), their Names (&PARAMCD_NAM1 ...) and Labels (&PARAMCD_LAB1 ...) ---*;
   %util_labels_from_var(css_anadata, paramcd, param)
 
 %*--- Return macro var: Number of planned treatments (&TRTN) ---*;
   %util_count_unique_values(css_anadata, trtp, trtn)


TEST scripts

  • script naming convention: test_<program-name-without-extension>.sas
  • every test explicitly uses specific data
  • centralized PhUSE/CSS data sets must include a QLTSTID variable that identifies specific test data
    • QLTSTID has label "CSS/PhUSE Qualification Test ID", and length sufficient for all current test IDs
    • see: https://github.com/phuse-org/phuse-scripts/blob/master/scriptathon2014/data/advs.xpt
    • QLTSTID values should not change, once assigned.
    • e.g., if some test relies on records with QLTSTID = "TEST-01-01",
      • those obs should not change, individually or as a set, and
      • any new obs added to the same central data set must have a new value for QLTSTID


ASSERT macros

  • use %assert_depend to test conditions (e.g., valid data set and variable names, etc.), for consistency of messaging. This applies to UTIL macros, as well.
  • return a 0/1 result in-line whenever possible: 0 = FAIL, 1 = PASS
  • IN-LINE macros: use and return a %local OK symbol to return pass/fail result
  • Base SAS macros: use the global symbol &CONTINUE to return any failure that should stop processing
  • see also TEMPLATE programs, above
  • declare %local and %global symbols explicitly
  • always return at least one message to the log, either
 NOTE: (MACRO-NAME-UPCASE) Result is PASS. Optional confirmation of the successful assertion.
or
 ERROR: (MACRO-NAME-UPCASE) Result is FAIL. Clear explanation of failed assertion.

Depending on severity of the failed condition, a log WARNING may suffice rather than an ERROR.


UTIL macros

  • use %assert_depend to test conditions (e.g., valid data set and variable names, etc.), for consistency of messaging. This applies to ASSERT macros, as well.
  • perform a specific task
  • are never highjacked to perform a related task
  • are never highjacked to create a convenient side-effect


Conventions for macro parameter names

Name Description Comments Programs that use
DS SAS data set, one or two levels positional, when usage is obvious
VAR Valid SAS var, no special chars expected positional, when usage is obvious assert_unique_keys
KEYS Valid SAS vars that compose unique keys for a data set positional, when usage is obvious
INCL Valid SAS vars to include in an output data set always named parameter assert_unique_keys
ORD name of an ORDER variable such as AVISITN always named parameter
WHR complete where statement, %str()-quoted always named parameter, includes semi-colon (;) for the statement
SQLWHR complete SQL where clause, quoted as needed always named parameter, does NOT include semi-colon util_count_unique_values
FMT SAS format name WITH punctuation (@$.), as nec always named parameter
SYM name of a symbol (macro variable) positional, when usage is obvious

Other macro parameters

Other parameter Program that uses Comment
TABLE util_freq2format.sas a 2-var PROC FREQ table spec like var1*var2, can include extra spacing
FMTNAME util_freq2format.sas macro determines fmt type, so value does NOT include punctuation (@$.)
MACNAME util_autocallpath.sas a macro name, without any special chars
DSETS assert_complete_refds.sas list of data sets, where order has a specific meaning


Team Review and Comments