Good Programming Practice Guidance

From PHUSE Wiki
Revision as of 08:56, 26 April 2013 by Mfoxwell (talk | contribs)
Jump to: navigation, search


This document provides guidance for good programming practices (GPP) for analysis, reporting and data manipulation of clinical data in health and life sciences organizations. This guidance is primarily aimed at SAS programmers however the principles of GPP also apply to other languages such as R and Stata. In addition, although this is not produced with SAS macros in mind, the same principles apply to macros too.

We often have to update existing programs to add new rules, copy programs from one study to another, and take over programs written by others. The guideline aims to show how to produce well structured and well documented programs so that they are easy to read and maintain over time. It is meant to be applicable to all programs, and hence all programmers regardless of experience. Specific rules may be of more use to novice programmers, but applying the principles should be in mind for experienced programmers and mentors.

Getting Started With a New Project

When starting work on any new study, it is important to familiarize yourself with the study. Review the study documents and try to understand the following: • The objectives of the study. • How many patients will be enrolled, randomized and treated. • Schedule of events, i.e. screening, run-in, treatment periods, washouts, how many treatments and when they are taken. • What is the primary endpoint and how, when and where is this data collected. • Timelines for the trial, when is the database lock, when should the top line results be ready, and when should all the reporting be finalized. • The current status of the project.

Study documents include:

  • Protocol (CSP) - study outline and statistical sections are usually of relevance.
  • CRF/annotated CRF (annotated with the dataset name and variable name) - to understand where the data comes from and how it was collected and where it is stored.
  • Statistical Analysis Plan (SAP) – to see what data is reported and how
  • Analysis Datasets (ADS) specifications – describes which derived datasets should be created and what will be stored within them, including detailed definitions of endpoints. Used for ADS programming and validation.
  • Table shells – used for tables, listings and graphs (TLG) programming or validation.
  • Publications, if available (to check against already available results).
  • Previous Clinical Study Reports (CSR), if available (to check against already available results).

Before you start programming, it is important that you familiarize yourself with all the relevant company IT systems, standards, SOPs and Guidelines on both programming and program validation. All these should be adhered to. • Familiarize yourself with the system you are working on. • Check for company specific programming standards. • Check for study– and project specific standards. • Check for industry standards like CDISC which are to be applied or can be applied. • Check if a similar project/study has been worked on, i.e. check if available SAS code can be reused. • Check for project-independent macros that can be applied.

Now that you are ready to start programming, keep in mind some basic standards: • Flow of data from raw data  analysis datasets  outputs. • Do not derive anything in more than one place. • All derivations should be implemented on ADS-level not during output programming. • Structure your program to read in all external data at the top, do the processing, then produce any outputs or permanent analysis datasets. • Keep in mind that you or somebody else might need to change your program in the future. • Keep in mind that somebody will validate your program. • Add comments as you are programming / do not plan to do that afterwards. • Avoid data driven programming. • Try to simplify your code to make it more readable and easy to perform source code review (e.g. do not make too many derivations in one data step, consider creating multiple data steps). Language The language used in programming code and within headers and comments is English. Program header A standard header should be used for every program. The purpose of the header is to identify the program and provide documentation including revision history. It provides the necessary information to a code reviewer to identify and understand the program and its development life cycle. Standardizing the header will allow the information contained in the header to be leveraged programmatically for things such as auditing, project documentation, macro and dataset use tracking, consistency checking, and revision history reporting. The elements included in a header will vary from organization to organization but below is a discussion of some of the most common elements.

Required elements The following should be included in all program headers:

• Identification of the project of which the program is a part. • Program name. • Author identification which should be human readable and unique. • Short description of program purpose. • List of macros used in the program. • Date program was first put into production, was finalized, or first past validation. o This date will be chosen based on the operational procedures used within the company /organization creating the program. The date should indicate the first date when the program was released for final use. • Revision history. o This is discussed further below. Recommended elements The following are not required but are highly recommended in all program headers:

• All outputs generated by the program, including both file creation and modification. • External files used such as datasets or databases that are used as data inputs to the program or macros used. • Platform and operating system which the program was developed to run in. • Software/programming language and version which the program was programmed in. Revision history The revision history section is critical to document the revisions made to the program once it is put into production. A well designed revision history section should include the author of the change, date of release of the change, a short description of the change. Revision history may also include a version number for changes which can be used as a reference in the code. Comments Comments are important to help anyone reviewing, modifying or using a program to be able to quickly understand the code. All major data or proc steps should be commented, especially data specific and complex code. Ideally comments should be comprehensive, and should describe the rationale and not simply the action. For example, instead of simply typing "Access demography data", describe which data elements you are accessing and why they are needed, for example, “Bringing in DM to get gender and age and subset to include only the intent to treat population”. Comments can also include links to external documentation (requirement specifications, design documents. The programs can also be split up into sections by creating a different type of comments, e.g. many rows with stars. This helps to structure the program and make it easier for others to see an overview of the program. Naming conventions All organizations should have standard naming conventions. Program naming conventions should make it possible to identify groups of related programs such as adverse events tables. Dataset and variable names should describe as best as possible their content, but of course datasets following Clinical Data Interchange Standards Consortium (CDISC) standards will have pre-defined names. Coding conventions In order to be efficient and streamline the sharing of program code between programmers, with regulatory agencies, and with external partners or vendors, it is vital for code structure to follow standard conventions. SAS code which follows these conventions is much easier to read, modify, maintain, and correct. These conventions are divided into those which should be considered as required, and those which are merely recommendations to be followed as applicable.

Required conventions • Do not overwrite existing datasets, use different meaningful names for each temporary dataset • Use lowercase • Separate data steps and procedures with at least one blank line • Use ‘data=dataset’ option in procedure statements so that the dataset being used is explicitly stated to ensure that the statement will work if it is moved to another location • End data steps and procedures with run or quit to provide a boundary and allow for independent execution • Split data steps into logical parts • Put each statement on a separate line • Left justify global statements and data and procedure statements and their corresponding run and quit statements • Indent statements belonging to a level by 2 to 5 columns (use the same number of spaces throughout the program), i.e. every nesting level should be visibly indented from the previous level. • Do not use tabs for indentation because they will display differently depending on the platform and text editor being used, use blanks instead • For do loops place the end statement in the same position as the do statement so that they can be easily matched • Insert parentheses in meaningful places in order to clarify the sequence in which mathematical or logical operations are performed • When converting character variables to numeric or vice versa, use the put and input functions to explicitly convert the variable to ensure that it is done in the way intended and to avoid errors, warnings, and notes in the program log

Recommended conventions • Perform only one task per module or macro • Use logical groupings to separate code into blocks • Double space between sections • Group similar statements together • Define new variables with the attrib statement in order to ensure that the variable properties such as length, format, and label are correct instead of allowing them to be implicitly determined by the circumstances in which they are initialized in the code Portability Most organizations are now working across multiple platforms, commonly combining Windows and Unix environments. There can be many occasions where code will work on one platform and not on another. Portability is more than just working across multiplatform environments, it is also about making programs easier to be used across projects. Below are some suggestions to address some of the most common impediments to portability. • Use rounding in newly created variables (if applicable) in order to avoid different results e.g. on 64 bit operating systems to 32 bit systems. • Avoid explicitly defining file paths in libname, filename, and %include statements requiring platform specific syntax such as forward slash or back slash. • Avoid the use of X commands to execute statements directly on the operating system. • Avoid explicit project or data specific code by using macro variables where possible. An example of this is using macro variables to describe dosing groups in table headers instead of typing them out in the report section.

Hard coding A hardcode is programming code which sets values to an informative variable based on a non-informative variable. Hardcoding may be done temporarily in order to get a program to run due to dirty data or correct for database incompatibilities. Permanent hardcoding to fix incorrect data values in a final database is strongly discouraged, but if it is unavoidable then it must be approved by management and clearly documented using standard comments and PUT statements to the log to show what has been hard coded. Defensive programming Defensive programming is an approach to programming intended to anticipate future changes of the data that might influence the coding algorithms.. Ideally programs should be written in such a way that they will continue to work correctly in case of new or unexpected data values which did not exist at the time the code was developed. Analysis dataset and table programs are often developed in the early stages of a project or even when the only available data is test data. In these situations the data often does not contain all possible values of data points such as visits or time points, race values, and questionnaire responses, but the program must be able to handle those values when they do become present in the data at a later point.