Subsetting and Resampling in SAS

This usage note describes the SAS code needed to subset or resample a SAS dataset. Such code can form the basis of jackknife or bootstrapping macros. This document is designed for those who have some SAS experience and who are also familiar with basic regression methods and assumptions.

Overview

Described in this document is the code necessary to perform the following tasks:

* selection of successive observations

* selection of specific observations

* random selection of an approximate proportion of observations

* random selection of a specific number or proportion of observations

* jackknife ("leave-one-out") selection

* repeated split-sample selection

* random selection with replacement (bootstrapping)

This document explains the code sequentially: you may need to refer to an earlier technique to understand a later one. Each set of code assumes you are using a SAS dataset alldata as input and are creating two new datasets (analysis and holdout), each consisting of subsets of alldata.

Selecting Successive Observations

The following code divides alldata into analysis, which will consist of the first 67 observations, and holdout, which will consist of the remaining observations.

DATA analysis holdout;
SET alldata;
IF 1 < = _N_ < = 67 THEN OUTPUT analysis;
ELSE OUTPUT holdout;
RUN;

_N_ is a SAS automatic variable; in this DATA step it keeps a count of the observations being read from alldata.

Selecting Specific Observations

The following code demonstrates how to create a dataset whose observations meet a specific condition. In this example, analysis consists only of those observations in alldata having the value f for the variable sex, and holdout consists only of those observations having the value m. Note that observations in alldata that have values other than f or m will not be present in either of the resulting datasets.

DATA analysis holdout;
SET alldata;
IF sex = 'f' THEN OUTPUT analysis;
ELSE IF sex = 'm' THEN OUTPUT holdout;
RUN;

The single quotes (apostrophes) around values are required when you are matching values of a character variable. SAS uses case (upper or lower) when matching on character variables, so be sure to use the right case for the value in your code.

Randomly Selecting an Approximate Proportion

The following code randomly draws a subset of an approximate size (here 67/100) from alldata to make up the dataset analysis, with the remaining observations being placed in holdout.

DATA analysis holdout;
SET alldata;
IF RANUNI(0) < = 2/3  OUTPUT  analysis;
ELSE OUTPUT holdout;
RUN;

The RANUNI function generates a different random number from 0 to 1 (exclusive) for each observation. In the code above, if this number is less than or equal to 2/3, the observation is placed in analysis; otherwise, it is placed in holdout. The actual proportion of observations falling into the analysis dataset will be only approximately equal to the value indicated; in small datasets the actual proportion can be quite different from the coded proportion.

The RANUNI function's argument (here 0) is called the seed and can be any integer less than two billion. An integer of 0 or less initializes the seed to the computer's internal clock. If you want to be able to repeat a selection process exactly, assign the same positive integer value to the seed in each replication.

Randomly Selecting an Exact Proportion

The following code demonstates how to select an exact proportion (here 67%).

DATA analysis holdout;
SET alldata;
RETAIN k 67 n 100;
IF RANUNI(358798) < = k/n THEN DO;
  k = k-1;
  OUTPUT analysis;
END;
ELSE OUTPUT holdout;
n = n-1;
DROP k n;
RUN;

The RETAIN statement initializes the variables k and n to the specified values (here 67 and 100) and then retains the modified values of k and n between the processing of observations. The variable n must be initialized to the number of observations in the original dataset (alldata). The variable k must be initialized to the integer that captures the desired proportion of the original dataset. For instance, if the original dataset contains 231 observations and you want two-thirds of that, then k=INT(2*231/3)=154. The ratio k/n changes as observations are written to the output datasets, forcing the correct proportions to be assigned. The DROP statement drops k and n from all datasets since they serve no further purpose.

Repeated Selection: Jackknife, Split-Sample, and Bootstrap

In these three methods of intensive resampling, many subsamples are drawn from the original dataset. The jackknife method of sampling is also known as the "leave-one-out" method because it uses all-but-one observation in each subsample. The left-out (or more commonly, held-out) observation changes with each subsample so that every observation is held out exactly once. In the split-sample technique, the original sample can be repeatedly split into more than two subsets, with subset size depending on the researcher's needs. The bootstrap method creates subsets "with replacement". While each iteration of the jackknife and split-sample methods creates mutually exclusive and exhaustive subsets, each iteration of the bootstrap creates a subset that can contain multiple copies of an observation, and some observations from the original dataset may not occur in any subset.

The theory of these techniques is rapidly developing. The following examples show how to create such samples using SAS, but are not intended to show the best use of the samples.

A Jackknife Example

Model selection and validation can be accomplished with jackknifing by building a model based on all-but-one observation and then using this model to predict the dependent value for the held-out observation. Thus there are n models built, each using n-1 observations for model construction and the remaining observation for model validation.

The SAS macro shown below will make 100 iterations (one for each observation in the original dataset). In each iteration, it will:

1) hold out a different single observation as the only observation in the dataset holdout, and put the rest of the observations in the dataset analysis.

2) estimate the parameters of a given model by using the analysis dataset.

3) create a new dataset (parms) containing these parameter estimates.

Stop here if you are interested only in the distribution of the estimated parameters; continue if you are also interested in the distribution of the validity measures for these models (which differ in their parameter values). In each iteration, the remaining code:

4) uses the current parameter estimates to calculate a predicted score on the single observation in holdout (which was not used in the estimation of the parameters).

5) saves the predicted (yhat) and actual score (y) of the held-out observation in a new dataset (valcheck).

%MACRO jackknif;
%DO i = 1 %to 100;
  DATA analysis holdout; 
  SET alldata;
  IF _N_ = &i THEN OUTPUT holdout;
  ELSE OUTPUT analysis;
  RUN;
  PROC REG DATA = analysis  NOPRINT  OUTEST= outests 
        (KEEP = intercep x1  x2  x3  x4);
  MODEL y = x1 x2 x3 x4;
  RUN;
  PROC APPEND BASE= parms DATA=outests  FORCE;
  RUN;
  DATA check  (KEEP = y yhat);
  MERGE holdout outests
     (RENAME = (x1=x1h  x2=x2h  x3=x3h  x4=x4h));
  yhat = SUM(intercep,x1*x1h,x2*x2h,x3*x3h,x4*x4h);
  RUN;
  PROC APPEND BASE=valcheck DATA=check  FORCE;
  RUN;
%END;
%MEND;

The % symbol indicates that SAS macro function is involved. This macro, called jackknif, will perform 100 iterations of the code enclosed between the %DO and %END statements. The end of the macro is defined by the %MEND statement.

The first step in the macro loop creates a dataset containing all but one observation and another dataset containing only the remaining observation. The held-out observation is identified by the loop-counting variable &i.

The next step estimates the parameters of a specified model using PROC REG, and outputs only these estimated parameters to a dataset (outests). PROC APPEND then adds this dataset's single line of data to a new dataset (parms) containing the regression output produced by previous iterations.

The second DATA step computes the predicted value for the held-out observation. To do this, the parameter estimates that were stored under the variable names must be renamed in order to distinguish them from the variables themselves. The dataset option RENAME applies only to the dataset that immediately precedes it (outests). Again, PROC APPEND is used to append the output from this iteration (the single line of data in the dataset check) to a dataset (valcheck) containing the output from all iterations.

When this code is submitted to SAS, the macro jackknif is defined. You will need to modify this code to fit your circumstances:

1) Replace alldata with the name of your dataset. Remember that you will need to have defined its libref in a LIBNAME statement if you are accessing a stored dataset.

2) Replace the loop-terminating value of 100 with the number of observations in your dataset.

3) Replace the model variables y and x1-x4 with the appropriate variables from your dataset.

Once the macro is defined for your situation, you will want to run it. Notice that the macro produces no output: you must decide how you want to use the datasets parms and valcheck. The following code runs the macro and then merely prints the two datasets.

%jackknif;
PROC PRINT DATA=parms; 
PROC PRINT DATA=valcheck;
RUN;

Note: The PRESS statistic is available as a MODEL statement option in the GLM, REG, and RSREG procedures. This "predicted residual" sum of squares is the residual sum of squares for the held-out observations. Thus this estimate of the model's fit in the population is less overly optimistic than that of the residual sum of squares produced by the same sample that was used to estimate the parameters. For further information, see:

SAS/STAT User's Guide, Volume 1, Version 6, Fourth Edition. SAS Institute Inc., 1991.

Allen, D.M. "Mean Squared Error of Prediction as a Criterion for Selecting Variables", Technometrics, 13 (1971), 469-475.

A Split-Sample Validation Example

The following code shows how to modify the jackknife example to create holdout datasets containing more than one observation. Here holdout contains a randomly selected 33% of the 100 observations in alldata, while analysis contains the remaining observations. However, since only 20 such sets of samples are created, the summary datasets parms and valcheck will each contain 20 lines.

%MACRO sampler;
%DO i = 1 %to 20;
  DATA analysis holdout 
  RETAIN k 67 n 100;
  SET alldata;
  IF RANUNI(&i+234334) < = k/n THEN DO;
     OUTPUT analysis;
     k = k-1;
  END;
  ELSE OUTPUT holdout ;
  n = n-1;
  RUN;
  
  PROC REG DATA = analysis NOPRINT OUTEST= outests 
     (KEEP= intercep  x1  x2  x3  x4);
  MODEL y = x1 x2 x3 x4  ;
  RUN;
  PROC APPEND BASE = param DATA = outests;
  RUN;
  
  DATA check  (KEEP= yhat y sample);
  MERGE holdout outests
     (RENAME = (x1=x1h x2=x2h x3=x3h x4=x4h));
  yhat = SUM(intercep,x1*x1h,x2*x2h,x3*x3h,x4*x4h);
  RUN;
  PROC APPEND BASE = valcheck DATA = check;
  RUN;
%END;
%MEND;

A Bootstrap Example

The bootstrap calls for an arbitrary number of random samples to be drawn from the original dataset, with replacement: each sample unit is "replaced" in the dataset so that it may be chosen again in the next random selection.

The following example code will:

1) take a random sample, with replacement, of the same size as the original dataset.

2) estimate the parameters of a specified model using this resample.

3) save the parameters estimates from the resample model in a new dataset (outests).

4) append the single line in outest to a summary dataset parms.

%MACRO boot;
%DO i = 1 %to 20;
  DATA analysis;
  choice = INT(RANUNI(23456+&i)*n)+1;
  SET alldata POINT = choice NOBS = n;
  j+1;
  IF j > n THEN STOP;
  RUN;
  PROC REG DATA = analysis NOPRINT  OUTEST = outests        
             (KEEP= intercep  x1 x2  x3  x4 );
  MODEL y = x1 x2 x3 x4;
  RUN;
  PROC APPEND BASE = parms DATA = outests;
  RUN;
%END;
%MEND;

There are some dangers in this code. Using the POINT= option can result in infinite looping (and high CPU charges) if an explicit and appropriate STOP method is not used.

The value of choice is set to a different random number from 1 to n (the size of the alldata dataset) with each pass though the DATA step. The POINT= option selects the corresponding observation for output to the analysis dataset.

This process of adding observations from alldata to analysis stops when the counter j exceeds the coded limit. Here, this limit has been set to the number of observations in alldata by use of the NOBS= option.

When the analysis dataset is complete, the parameters in a specified model are estimated using it, and these estimates are appended to a summary dataset (parms).

This process is repeated with each iteration. Since the number of iterations is set to 20 in this example, the dataset parms will contain 20 observations.

Note: If your purpose in using bootstrap resampling is only to adjust p-values to account for multiple hypothesis testing on a single sample, SAS offers a BOOTSTRAP option in the MULTTEST procedure. This subject, however, is beyond the scope of this usage note. See SAS Technical Report P-229, SAS/STAT Software: Changes and Enhancements, Release 6.07. SAS Institute Inc., 1992.

Additional Information

Efron, Bradley. The Jackknife, the Bootstrap, and Other Resampling Plans. Society for Industrial and Applied Mathematics, 1982.

Efron, Bradley and Tibshirami, Robert J. An Introduction to the Bootstrap. Chapman and Hall, 1993.

Hjorth, J.S. Urban. Computer Intensive Statistical Methods. Clays Ltd., 1994.

SAS Language: Reference, Version 6, First Edition. SAS Institute Inc., 1990.

SAS Guide to Macro Processing, Version 6, Second Edition. SAS Institute Inc., 1990.

SAS Language and Procedures, Usage 2, Version 6, First Edition. SAS Institute Inc., 1991.