SAS macro %progress

From PHUSE Wiki
Jump to: navigation, search

Macro %progress - Building Real-Time Feedback in the Data Step

Motivation

Developing a process to read back into SAS dozens of Summary Tables extracted as .txt files, and identify the various blocks of text present on each page - titles, headers (distinguishing between single column headers and multi-column spanning headers or "supra-headers"), table body, footnotes.
The identification of text lines corresponding to column headers can be done according to the number and maximum length of “blank fields” ( ≥2 spaces) found between non-space characters.
CALL PRXNEXT is used to search a line of text for matches against a regular expression (RegEx) pattern '/(?<!\s)\s{2,}(?!\s)/' that could be described as: 2 or more spaces, not immediately following a space, nor immediately followed by a space. It loops over all matches, until no further match is found, counting the number of matches and retaining their maximum length.
To avoid repetitive and somewhat bulky code in a data step, the CALL PRXNEXT loops are encapsulated into 2 PROC FCMP custom functions.
The program has to process a growing number of pages.

PhUSE EU Connect 2019 - CT03

This topic was presented at the PhUSE EU Connect 2019 conference in the stream Coding Tips & Tricks (CT).
Links:

Minimum reproducible Example

Creating a dummy page dataset

The code on this page is shared under the MIT license: http://opensource.org/licenses/MIT

*- Create a dataset with one dummy table page for demonstration, indexed by line number (l) -*;
data page(index = (l));
      infile cards truncover;
      input @1 line $char200.;
      l = _n_;
      output;
      put l @8 line $char.;
   cards;
   SPONSOR                                                                                                       CONFIDENTIAL
   INVESTIGATIONAL NEW DRUG / Indication                                                                                Final
   ST0012
                                                           Table 2.1.1    
                                 Demographics and Baseline Characteristics by Period 2 Treatment Group
                                                    Analysis Set: Safety Set 

                                                                      Period 2 Treatment Group
                                             ___________________________________________________________________
                                                          IND 2.5mg Q4H/   IND 4.5mg    IND 2.5mg    IND 2.5mg
                                                          IND 2.5mg Q4H +     Q4H/         Q4H/         Q4H/
                                                          IND 2.5mg Q4H/   IND 4.5mg    IND 2.5mg    IND 4.5mg
                                               PBO/PBO     IND 4.5mg Q8H      Q4H          Q4H          Q8H      All Subjects
    Variable                     Statistic       N=40          N=40           N=49         N=20         N=20         N=169

    Weight (kg)
      Min - 33.3pctile           n (%)        10 (25.0)   16  (40.0)      16  (32.7)    6  (30.0)    6  (30.0)   54  (32.0)
      (39.7 - 64.8)
      >33.3pctile - 66.6pctile   n (%)        30 (75.0)   12  (30.0)      14  (28.6)    6  (30.0)   10  (50.0)   72  (42.6)
      (>64.8 - 74.1)
      >66.6pctile - max          n (%)         0          12  (30.0)      19  (38.8)    8  (40.0)    4  (20.0)   43  (25.4)
      (>74.1 - 123.0)

    Height (cm)                  n               40           40              49           20           20          169
                                 Mean           163.13       167.31          165.74       166.78       167.84       166.30
                                 SD               7.15         6.65            7.40         6.63         6.79         7.07
                                 Median         162.50       167.30          165.50       166.25       168.60       166.30
                                 Min            157.0        155.0           149.2        155.0        155.4        149.2
                                 Max            170.5        186.0           180.6        181.5        186.0        186.0

   _______________________________________________________________________________________________________________________
   BMI=Body Mass Index, BSA=Body Surface Area, IND=Investigational New Drug, IA=Interim Analysis, F=Female, M=Male, 
   Max=Maximum, Min=Minimum, PBO=Placebo, Q4H=Every 4 Hours, Q8H=Every 8 Hours, SD=Standard Deviation, pctile=Percentile.
   Note: Only data assessed during blinded treatment is presented in this table.
   References: Listing 2.1.1, Listing 2.3.1

   Program: t_dm211, 2019-09-01 at 14:24                                                                          Page x of y
;
run;

Expanding to 10,000 pages

*- duplicate the page data 10,000 times by re-reading the page dataset repeatedly -*;
data page10000(drop = start elapsed);
   label p = 'page number' l = "line number";
   put nobs=;
   start = time();
   do p = 1 to 10000; *- loop over the desired number of pages -*;
      elapsed = time()-start;
      if mod(p, 1000) = 0 then put p= elapsed= time12.3;
      do l = 1 to nobs; *- loop over each line of the page -*; 
         set page key = l nobs = nobs; *- the key option allows reading the matching observation using the dataset index -*;
         if l = nobs then substr(line, 105) = put(catx(' ', 'Page', p, 'of', 10000), $20.-r); *- replace "Page x of y" by the actual numbers -*;
         output;
      end;
   end;
   stop;
run;


PROC FCMP functions


proc fcmp outlib=work.functions.prx; 

   *- Function to return the number of times a PRX pattern has matches in a given string-*;

   function PRXNMATCH(pattern $, text $) ;
      prx=prxparse(pattern);
      start = 1;
      stop = lengthn(text);
      matchnum = 0;
      pos = 0;
      len = 0;
      if (stop > 0) then do until(pos = 0); 
         ini=start;
         call prxnext(prx, start, stop, text, pos, len); 
         *- CALL PRXNEXT searches a string 'text' for a pattern match (PERL regular Expression parsed as 'prx')
            multiple times in succession, between the 'start' and the 'stop' positions (initially set to first 
            and last positions in the string).  When a match is found, the starting position 'pos' and length 'len'
            of the matched substring are updated, and the 'start' for the next search is positionned at the first
            character after the end of matched substring.  When no match is found, the position 'pos' and length 'len'
            are set to 0, and the value of 'start' is left unchanged. -*;
         if (len>0) then matchnum+1; *- count the number of matches -*;
      end;  
      return(matchnum); *- return the number of matches -*;
   endsub;

      *- Function to return the maximum length of all matches a PRX pattern has in a given string-*;
   function PRXMAXMATCHLEN(pattern $, text $) ;
      prx = prxparse(pattern);
      start = 1;
      stop = lengthn(text);
      pos = 0;
      len = 0;
      maxlen = 0;
      if (stop > 0) then do until(pos = 0); 
         ini = start;
         call prxnext(prx, start, stop, text, pos, len);
         if (len>maxlen) then maxlen = len;  *- store the max length of all matches found in the current string -*;
      end;  
      return(maxlen); *- return the maximum length of all matches -*;
   endsub;

run;
quit;

option cmplib = work.functions;


Processing various number of pages


option nofullstimer nomprint;
option mcompilenote = all;

*******************************************************************************************************************;

*- Define a macro to process a certain number of pages, 
   and report the time and number of pages processed per second when a condition is met
   (by default: when the last page has been processed) -*;

%macro process(pages = 1, report_when = last);
   data pages&pages(drop = starttime elapsed pps status)
        status&pages(keep = p elapsed pps status);
      retain starttime;
      length status $100;
      if _n_ = 1 then starttime = time();

      set page10000 (where = ( p <= &pages )) end = last;
      by p l;
      *- To help separate titles, table header, body and footnotes lines,
         we will identify the number and largest size of embedded blank fields 
        (at least 2 consecutive spaces not preceded nor followed by any other space) 
         per line (excluding leading and trailing blanks) -*;
      blankFieldsNum  =      PRXNMATCH('/(?<!\s)\s{2,}(?!\s)/', cats(line));  *- number of 'blank' fields of at least 2 consecutive spaces in current LINE -*;
      maxBlFieldWidth = PRXMAXMATCHLEN('/(?<!\s)\s{2,}(?!\s)/', cats(line));  *- max width of all 'blank' fields of at least 2 consecutive spaces in current LINE -*;
      output pages&pages;

      if last.p then do; *- calculate status and save it at the end of every page -*;
         elapsed = time() - starttime;
         if elapsed > 0 then pps = p / elapsed;
         status = "Processed " || put(p, comma6.) || " pages in " || put(elapsed, time12.3) || ", i.e. " || put(pps, 8.2) || " pages per second";
         output status&pages;
         if &report_when 
            then put status;
      end;
      format starttime elapsed time12.3 pps 8.2;
   run;
%mend process;

option fullstimer;
%process(pages =    10);
%process(pages =   100);
%process(pages =  1000);

*- report status every 100 pages until 1000, then every 500 pages -*;
*
%process(pages = 10000
        ,report_when = (p <= 1000 and mod(p, 100) = 0) or (p > 1000 and mod(p, 500) = 0) );
 *- not run -*;

Processing Problem

Run times increase unexpectedly

Initial run Intermediate run Final run
150 pages 1000 pages 8000 pages
< 1 sec < 1 min >1h, still running!

Solution

Diagnosis: Real-time Monitoring

*- Simplified macro for real-time monitoring -*;
%macro progress(counter=_n_, check_cond=1, every_sec = 10, total=., ref=.);
  retain _start_time _check_time _check_count _check_speed _remain_time;
  if _start_time = . then do;
     _start_time  = time();    _check_time  = time() - &every_sec +1;
     _check_count = &counter;  _check_speed = .X; 
  end;
  _elapsed_time = time() - _start_time;
  if (time() > _check_time + &every_sec) and (&check_cond) then do;
    _check_speed = round((&counter - _check_count)/(time() - _check_time), .1);
    _check_time = time(); _check_count = &counter; _msg_now = "Y";
  end;
  if &total>0 and _check_speed>0 
     then _remain_time = (&total - &counter) / _check_speed;
  _msg_ = catx(" ", "&counter:", &counter, "/", &total, "- Elapsed:",
               put(_elapsed_time, time8.), "- Remains:", put(_remain_time,
               time8.), "- Speed:", _check_speed, "/s", "- Ref:", &ref);
  if getoption("DMS") = "DMS" then do; 
     window progresswin rows=1 columns=100 #1 @3 _msg_;
     display progresswin noinput; 
  end; else if _msg_now="Y" then put _msg_;
%mend progress;

*- demo on 1000 pages -*;
%let pages=1000;
data text2;
  set page10000 (where = ( p <= &pages ));
  by p l;
  blankFieldsNum = PRXNMATCH('/(?<!\s)\s{2,}(?!\s)/', cats(line));
  maxBlFieldWidth = PRXMAXMATCHLEN('/(?<!\s)\s{2,}(?!\s)/', cats(line));
  %progress(counter = p, check_cond = last.p, total = &pages );  
run; 

Treatment: Real-time Monitoring + Stop-and-Restart Loop



*- Simplified macro for real-time monitoring + stop-and-restart loop -*;
%macro progress(counter=_n_, check_cond=1, every_sec = 10, total=., ref=., 
  stop_pct_speed = 10, stop_statement = %bquote(stop;), stop_counter = stop_n); /*- new line -*/
  retain _start_time _check_time _check_count _check_speed _remain_time 
         _max_speed _pct_max_speed;                                             /*- new line -*/
  if _start_time = . then do;
     _start_time  = time();    _check_time  = time() - &every_sec +1;
     _check_count = &counter;  _check_speed = .X; 
  end; 
  _elapsed_time = time() - _start_time;
  if (time() > _check_time + &every_sec) and (&check_cond) then do;
    _check_speed = round((&counter - _check_count)/(time() - _check_time), .1);
    _check_time = time(); _check_count = &counter; _msg_now = "Y";
    if _check_speed > _max_speed then _max_speed = _check_speed;                /*- new line -*/
    _pct_max_speed = 100 * _check_speed / _max_speed;                           /*- new line -*/
  end;
  if &total>0 and _check_speed>0 
     then _remain_time = (&total - &counter) / _check_speed;
  _msg_ = catx(" ", "&counter:", &counter, "/", &total, "- Elapsed:",
               put(_elapsed_time, time8.), "- Remains:", put(_remain_time,
               time8.), "- Speed:", _check_speed, "/s", "- Ref:", &ref);
  if getoption("DMS") = "DMS" then do; 
     window progresswin rows=1 columns=100 #1 @3 _msg_;
     display progresswin noinput; 
  end; else if _msg_now="Y" then put _msg_;
  if (&check_cond)                                                              /*- new line -*/
    then call symputx("&stop_counter", &counter.);                              /*- new line -*/
  if .Z < _pct_max_speed <= &stop_pct_speed then do;                            /*- new line -*/
    put "Not" "ice: Stopping at " _msg_;                                        /*- new line -*/
    %unquote(%superq(stop_statement));                                          /*- new line -*/
  end;                                                                          /*- new line -*/
%mend progress;

*- demo on 10000 pages -*;
%macro stop_and_restart(pg = 10000);
  %let compl_pages=0; %let iter=0; %let start_t=%sysfunc(time());
  proc datasets nolist; delete text2; quit; *- first append to empty dataset -*; 
  %do %until(&compl_pages >= &pg);
    %let iter = %eval(&iter + 1);
    data __text2;
      set page10000(where=(&compl_pages< p<=&pg));
      by p l;
      blankFieldsNum = PRXNMATCH('/(?<!\s)\s{2,}(?!\s)/', cats(line));
      maxBlFieldWidth = PRXMAXMATCHLEN('/(?<!\s)\s{2,}(?!\s)/', cats(line));
      output; *- ouput last obs before stop -*;
      %progress(counter=p, check_cond=last.p, total=&pg, stop_pct_speed=30 
               ,stop_counter=compl_pages, ref=&iter);  
    run; 
    proc append base=text2 data=__text2; run; 
  %end;
  %let eltm=%sysevalf(%sysfunc(time())-&start_t); %let spd=%sysevalf(&pg/&eltm);
  %put Processed &pg pages in &iter iterations, &eltm sec, speed: &spd p/sec.;
%mend stop_and_restart;
%stop_and_restart;

Performance


*- Computing Elapsed Times and Speed by number of pages -*;
data stats(keep=p elapsed _check_speed _check_time start_time rename=(_check_speed=speed));
   set text2(where=(_check_speed > 0 and l=37));
   retain start_time;
   if _n_=1 then start_time = _start_time;
   elapsed = _check_time - start_time;
   format elapsed time8.;
run;

data stats;
   set stats;
   by elapsed;
   if first.elapsed;
run;

%let avg_speed = .;
data _null_;
   set stats end=last;
   if last;
   avg_speed = round(p / elapsed, 0.01);
   call symputx('avg_speed', avg_speed);
run;

*- plotting the results -*;
proc sgplot data = stats;
   series x = p y = speed / lineattrs=(color=blue); 
   refline &avg_speed / axis = y lineattrs=(pattern=dot thickness=0.5% color='light blue') 
                        label=("Avg speed: &avg_speed") labelattrs=(Family=Arial Size=9 Weight=Bold color='light blue') labelloc=inside labelpos = min;
   series x = p y = elapsed/ y2axis lineattrs=(thickness=0.8% color=red); 
   label p = "Pages" speed = "Pages per second" elapsed = "Elapsed time";
   xaxis   valueattrs=(Family=Arial Size=9 Weight=Bold) labelattrs=(Family=Arial Size=10 Weight=Bold);
   yaxis   valueattrs=(Family=Arial Size=9 Weight=Bold color=blue) labelattrs=(Family=Arial Size=10 Weight=Bold color=blue);
   y2axis   valueattrs=(Family=Arial Size=9 Weight=Bold color = red) labelattrs=(Family=Arial Size=10 Weight=Bold color = red) valuesformat=time8.;
   keylegend  / noborder  valueattrs=(Family=Arial Size=9 Weight=Bold);
run;


proc sgplot data = stats;
   series x = elapsed y = speed ; 
   refline &avg_speed / axis = y lineattrs=(pattern=dot thickness=0.5% color='light blue') 
                        label=("Avg speed: &avg_speed") labelattrs=(Family=Arial Size=9 Weight=Bold color='light blue') labelloc=inside labelpos = min;
   series x = elapsed y = p / y2axis lineattrs=(thickness=0.8% color=red); 
   label p = "Pages" speed = "Pages per second" elapsed = "Elapsed time";
   xaxis   valueattrs=(Family=Arial Size=9 Weight=Bold) labelattrs=(Family=Arial Size=10 Weight=Bold) valuesformat=time8.;
   yaxis   valueattrs=(Family=Arial Size=9 Weight=Bold color=blue) labelattrs=(Family=Arial Size=10 Weight=Bold color = blue);
   y2axis   valueattrs=(Family=Arial Size=9 Weight=Bold color = red) labelattrs=(Family=Arial Size=10 Weight=Bold color = red);
   keylegend  / noborder  valueattrs=(Family=Arial Size=9 Weight=Bold);
run;


  • 10000 pages processed in 00:06:48
  • 14 iterations
  • Avg: 25 p./sec
  • 18.7 x faster than initially
  • Avg speed (per iteration) stable over time


Root Cause

  • Compiling Perl Regular Expressions (RegEx) eats up memory
  • When short of memory, the system starts swapping memory blocks to disk – a much slower way of allocating memory

RegEx in the Data Step

  • A Constant RegEx is compiled only once
  • The memory used by the compiled RegEx is freed automatically at the end of the Data Step.

Note: in SAS Help, the example for PRXPARSE() does not include a call to PRXFREE() in the Data Step.

RegEx in our FCMP functions

  • The RegEx is passed as a variable (though its value remains constant)
  • It is recompiled at each call (twice per line x 37 lines/page x 10000 pages i.e. 740000 times)
  • The memory used by each compilation is never freed up until end of SAS session (since CALL PRXFREE is not called)

Optimized FCMP Functions

principles

  • The previously used RegEx (character) and the corresponding compiled RegEx IDs (numeric) are stored in 2 STATIC* arrays (*: i.e. their values are retained across calls)
  • At each call, the character RegEx passed as argument is compared to the stored values in the array; if found, the corresponding compiled RegEx ID is used; otherwise, the new RegEx is compiled and added to both STATIC arrays for reuse. When arrays are full, old REgEx entries are discarded and memory is released with CALL PRXFREE.
  • The max number of RegEx stored in memory = arrays size (e.g. n = 10)

code


*- Use PROC FCMP to create user-defined functions
   that we may easily call multiple times in our data steps.
   This reduces the amount of code and makes it more readable and maintainable. -*;

proc fcmp outlib=work.functions.prx; 

   *- OPTIMIZED Function to return the number of times a PRX pattern has matches in a given string-*;

   function PRXNMATCH(pattern $, text $) ;
     *prx=prxparse(pattern);
      array _pattern[10]  $400 / nosymbols ;
      array _prx[10] / nosymbols ;
      static _pattern _prx;
      i=0;
      do until(_pattern[i]=pattern);
         i+1;
         if i > dim(_pattern) then do;
            put "WAR" "NING:(PRXNMATCH): Reached max dim of arrays: " dim(_pattern);
            /*
            if i >= 100 then do;
               call dynamic_array(_pattern, i);  *- Argument number 1 to the subroutine DYNAMIC_ARRAY must be a numeric array. // fails with character arrays. -*;
               _pattern[i]=" ";
               call dynamic_array(_prx, i);
               _prx[i]=.;
            end;
            */
            i=dim(_pattern);
            prx=_prx[i];
            call prxfree(prx);
            _prx[i]=.;
            _pattern[i]=' ';
         end;
         if _prx[i]<=0 then do;
            _pattern[i]=pattern;
            prx=prxparse(pattern);
            _prx[i]=prx;            
         end;
      end;
      prx=_prx[i];
      start = 1;
      stop=lengthn(text);
      matchnum=0;
      pos=0;
      len=0;
      if (stop > 0) then do until(pos=0); 
         ini=start;
         call prxnext(prx, start, stop, text, pos, len);
         if (len>0) then matchnum+1;
      end;  
     *call prxfree(prx);
      return(matchnum);
   endsub;

   *- OPTIMIZED Function to return the maximum length of all matches a PRX pattern has in a given string-*;
   function PRXMAXMATCHLEN(pattern $, text $) ;
     *prx=prxparse(pattern);
      array _pattern[10]  $400 / nosymbols ;
      array _prx[10] / nosymbols ;
      static _pattern _prx;
      i=0;
      do until(_pattern[i]=pattern);
         i+1;
         if i > dim(_pattern) then do;
            put "WAR" "NING:(PRXNMATCH): Reached max dim of arrays: " dim(_pattern);
            /*
            if i >= 100 then do;
               call dynamic_array(_pattern, i);  *- Argument number 1 to the subroutine DYNAMIC_ARRAY must be a numeric array. // fails with character arrays. -*;
               _pattern[i]=" ";
               call dynamic_array(_prx, i);
               _prx[i]=.;
            end;
            */
            i=dim(_pattern);
            prx=_prx[i];
            call prxfree(prx);
            _prx[i]=.;
            _pattern[i]=' ';
         end;
         if _prx[i]<=0 then do;
            _pattern[i]=pattern;
            prx=prxparse(pattern);
            _prx[i]=prx;            
         end;
      end;
      prx=_prx[i];
      start = 1;
      stop=lengthn(text);
      matchnum=0;
      pos=0;
      len=0;
      maxlen=0;
      if (stop > 0) then do until(pos=0); 
         ini=start;
         call prxnext(prx, start, stop, text, pos, len);
         if (len>maxlen) then maxlen=len;
      end;  
     *call prxfree(prx);
      return(maxlen);
   endsub;


run;
quit;

option cmplib = work.functions;

*- report status every 100 pages until 1000, then every 500 pages -*;
%process(pages = 10000, prefix=opt
        ,report_when = (p <= 1000 and mod(p, 100) = 0) or (p > 1000 and mod(p, 500) = 0) );

*- plot the results -*;
proc sgplot data = optstatus10000;
   series x = p y = pps /  lineattrs=(color=blue thickness=0.8%); 
   series x = p y = elapsed/ y2axis lineattrs=(color=red thickness=0.8%); 
   label p = "Pages" pps = "Pages per second" elapsed = "Elapsed time";
   format elapsed time8.;
   xaxis   valueattrs=(Family=Arial Size=9 Weight=Bold) labelattrs=(Family=Arial Size=10 Weight=Bold);
   yaxis   valueattrs=(Family=Arial Size=9 Weight=Bold color=blue) labelattrs=(Family=Arial Size=10 Weight=Bold color=blue);
   y2axis   valueattrs=(Family=Arial Size=9 Weight=Bold color = red) labelattrs=(Family=Arial Size=10 Weight=Bold color = red);
   keylegend  / noborder  valueattrs=(Family=Arial Size=9 Weight=Bold);
run;

*- Plot number of pages processed  and speed by elapsed time -*;
proc sgplot data = optstatus10000 ;
   xaxis valuesformat=time8. valueattrs=(Family=Arial Size=9 Weight=Bold) labelattrs=(Family=Arial Size=10 Weight=Bold);
   yaxis   valueattrs=(Family=Arial Size=9 Weight=Bold) labelattrs=(Family=Arial Size=10 Weight=Bold);
   y2axis   valueattrs=(Family=Arial Size=9 Weight=Bold) labelattrs=(Family=Arial Size=10 Weight=Bold);

   series x = elapsed y = pps ; 
   series x = elapsed y = p / y2axis; 
   format elapsed time8.;
   label p = "Pages" pps = "Pages per second" elapsed = "Elapsed time";
   keylegend  / noborder  valueattrs=(Family=Arial Size=9 Weight=Bold);
run;

performance

  • 10000 pages processed in 00:00:02.06
  • Avg: 4854 p./sec
  • 198 x faster than Stop-and-Restart loop
  • 3700 x faster than originally
  • Avg speed stable over time


Conclusions

Monitoring the progress of SAS execution can be done:

  • In Data Step, Macro (loop), and a few Procedures supporting FCMP functions.
  • Real-time feed-back can be provided, especially with the interactive Display Manager.


Monitoring the execution speed in a data step within a Stop-and-Restart loop can lead to a large gain in performance:

  • when the original processing speed decreases over time
  • without identifying the root cause nor optimizing the code.


Identifying the Root Cause and Optimizing the code are still worth doing:

  • when the process has to be repeated time and again.


PROC FCMP functions, though powerful, are NOT as much optimized for ease of use as the data step equivalent code

  • they have scarcely documented limitations
  • some improvements by the SAS Institute would be much appreciated.


Author

Jean-Michel Bodart
Business & Decision Life Sciences
Rue Saint-Lambert 141
1200 Brussels
Belgium
http://www.businessdecision-lifesciences.com
--Jmbodart (talk) 10:11, 20 September 2019 (EDT)