Are you looking for a more efficient way to manage your SAS datasets? Would you like to learn a more modern SAS Procedure that can replace 3 older SAS Procedures?
PROC DATASETS is not only a very useful tool to manage, manipulate and modify your SAS datasets, but it is often much more efficient than preforming the same tasks with a Data Step. The efficiency of PROC DATASETS comes from the fact that it does not not need to read in or write observations of a dataset in order to make modifications to it.
In this article, we will show you 5 ways to manipulate your SAS datasets with PROC DATASETS. We will also show you how you can use a single procedure, PROC DATASETS, to complete a variety of tasks that you would otherwise need to use PROC CONTENTS, PROC APPEND and PROC COPY to do. The result is that you only need to learn the syntax for one procedure (PROC DATASETS) but can take advantage of the functionality of all three (PROC CONTENTS, PROC APPEND and PROC COPY).
Before we continue, make sure you have SAS Studio or SAS 9.4 installed. Don't have the software? Download SAS Studio now. It's free!
The examples used in this article are based on the CLASS, CLASSFIT and BWEIGHT datasets from the SASHelp library.
To start, we will first use SAS data step to add 2 datasets to the WORK library for use in some of the examples that following this article. To create the temporary datasets and save them to your WORK library, submit the following code:
After running the code above, the CLASS and CLASSFIT dataset will remain in your WORK directory for the duration of your SAS session. If you need to close and re-open your SAS session for any reason, or would like a clean, modified version of these datasets in your session simply re-run the code above to re-create these 2 datasets in your WORK library.
[Don't have the software yet? Download SAS Studio here for free.]
1. View Contents of a SAS Library
One of the simplest tasks that you can perform with PROC DATASETS is to output a list of datasets found in a SAS library. While some of the following tasks can be performed with PROC CONTENTS, all of the functionality of PROC CONTENTS is built directly into PROC DATASETS, alleviating the need for you to learn both procedures.
After running the setup code above, you should have 2 datasets in your SAS WORK library, CLASS and CLASSFIT
Using Proc Datasets, you can verify the existence of those 2 datasets in the WORK library with the following code:
Running this code will produce the output shown here in the results window:
You will notice that in addition to the two highlighted datasets we were expecting to see, there are other Member Types listed in the output window. To reduce the output to show only datasets, you can add the memtype option and specify data.
Below, you can see the output now only includes datasets:
To view a list of variables and their attributes within a SAS dataset, you can add the CONTENTS statement within your call to PROC DATASETS and specify the dataset for which you would like to see the attributes for:
Similar to the output you would get from PROC CONTENTS, this contents statement will, among other details about your dataset, produce a list of the variables and their attributes:
Unlike most SAS Procedures, you have likely noticed that PROC DATASETS ultimately ends with a quit statement. This is because PROC DATASETS supports what is known as run-group processing. This allows you to submit multiple run groups within the procedure without ending the procedure.
To demonstrate this with an example, you can combine two contents statements into a single call of PROC DATASETS. The following code will produce a list of the datasets found in the WORK library, along with the list of variables and their attributes for both the CLASS and CLASSFIT datasets:
To stop the procedure and move on to the next part of your SAS program, it is common practice to submit a quit statement at the end. Alternatively, your PROC DATASETS call will also end the next time a RUN CANCEL, new PROC statement or new DATA statement are found in your program.
2. Combine SAS Datasets
The most common way to combine two datasets together in SAS is using the SET statement within a Data Step. Using PROC DATASETS, you can accomplish the same task as the SET statement but in a more efficient way.
By incorporating the abilities of PROC APPEND into PROC DATASETS, two datasets can be combined more efficiently because SAS only needs to read in the observations from the dataset being appended. When working with many records or “Big Data”, this can be huge time saver.
As a simple example, let’s say you wanted to append a copy of the CLASS dataset to itself, so that you have 2 records for every student in the class.
This can be done by using the APPEND statement as shown below. The out option specifies the name of the dataset that you are planning to append to. The data option specifies the dataset which you would like to add. If the dataset specified in out does not yet exist, SAS will create it for you.
To verify the results, you can use PROC DATASETS again to check that you now have double the number of observations (38) in the WORK.CLASS dataset:
Which will produce the following:
An important note when using the APPEND statement is that unlike when you combine datasets using a Data Step SET statement, PROC DATASETS will stop running and produce an error if there is a mismatch in the variables found the two datasets you are trying to combine.
To show this with an example, first re-create your WORK.CLASS and WORK.CLASSFIT datasets by running the following:
Now, try to append WORK.CLASSFIT to WORK.CLASS with this code:
You will notice that you get the following error message in your SAS Log:
If you want to ignore this error and append the datasets anyway (do so with caution!), you simply need to add the force option.
By adding a contents statement to the same PROC DATASETS call, we can also verify the results of our forced append before including the quit statement.
By looking at the SAS log (shown below) or the results (not shown), you can verify that you now have 38 observations in your CLASS dataset:
3. Copying, Moving, and Deleting Datasets
As mentioned earlier, PROC DATASETS also includes all the functionality of PROC COPY, allowing you to easily copy and move your datasets between libraries. In addition to copying and moving datasets, you can also use PROC DATASETS to delete datasets within a library.
First, let’s make sure you have the same two datasets in your work directory as described earlier and also create the library MYDATA. Note that the path to MYDATA will need to be modified to point to a location on your computer.
Next, you can use the COPY statement within PROC DATASETS to copy all the datasets from your temporary WORK library to the permanent library called MYDATA.
You can quickly verify that CLASS and CLASSFIT datasets were successfully copied by using PROC DATASETS as described earlier:
You should see now both datasets listed in the Results window:
If you only wanted to copy a selection of datasets from one library to another, you can add the SELECT statement to specify which datasets you would like to copy:
Again, you can verify the copy by running PROC DATASETS to see if the CARS and BASEBALL datasets are now found in the MYDATA library:
You also have the option to move datasets from one library to the other.
For this example, first ensure you have the CLASS and CLASSFIT datasets in your WORK directory by running the following code:
To move datasets (i.e. copy a dataset from one location to another and then delete it from the original location), you simply add the move option. In this example, you can use the move option in the COPY statement to move the CLASS dataset to the MYDATA library:
You can then verify with PROC DATASETS again that CLASS is no longer found in the WORK library:
You should now see that it is found in the MYDATA library (along with CLASSFIT, CARS and BASEBALL if you did not yet delete them from the earlier example):
In addition to copying and moving datasets, PROC DATASETS also allows you to programmatically delete datasets without having to use Windows Explorer or other external commands.
To delete a selection of one or more datasets, simply specify which library the datasets reside in that you would like to delete with the lib option, and then use the DELETE statement to list the dataset name(s).
In this example, again first ensure that you have the CLASS and CLASSFIT datasets in your WORK directory by running the following:
To delete the CLASS dataset, simply use the lib=work option to indicate that the dataset you would like to delete is in the WORK library along with a SELECT statement to specify which dataset you would like deleted. The memtype=data option is again included to ensure only datasets are listed in the Results window:
You will see in the output produced that the CLASS dataset is no longer listed:
If you would like to delete all datasets found in a library, you can simply use the kill option. This option should of course be used with extreme caution, as it will remove every dataset in that library without any prompting!
To remove all the datasets in the WORK library, simply use a PROC DATASETS statement pointing to the WORK library with the kill option included:
If you now try to list all the datasets in WORK with PROC DATASETS again, you will get a WARNING in your log indicating that no datasets can be found:
4. Modifying Dataset and Variable Attributes
While the previous examples demonstrated how a single procedure, PROC DATASETS, can be used in place of PROC CONTENTS, PROC APPEND and PROC COPY, the advantage is mostly around the fact that you only need to remember to use 1 procedure rather than 3.
When it comes to modifying datasets, PROC DATASETS can not only perform many of the dataset modification tasks that a SAS Data Step can, it can do them much faster and more computationally efficient. The reason is that no observations are read in or written out during processing, making PROC DATASETS an excellent tool for handing “Big Data”.
There are many ways in which you can modify your datasets using PROC DATASETS. Some of the most common and useful include labeling datasets and variables or changing variable informats and formats.
Using the CLASS dataset as an example, the following code demonstrates how you would add a dataset label, variable labels to the Name and Sex variables, and add a format to the height and weight variables.
First, ensure that you have an unmodified version of the CLASS and CLASSFIT datasets in your WORK library by running the following code:
Similar to the earlier examples, the first line of the PROC DATASETS code below starts with the lib=work option pointing PROC DATASETS to the WORK directory and also a memtype=data option so that your PROC DATASETS output only includes information about datasets.
To modify individual attributes of a dataset within that library, you start with the modify statement followed by the dataset name. To add a dataset label to the selected dataset (CLASS in this example) you simply add the option in parentheses after the dataset name.
On the third line of code in this example, the format statement is used to ensure that both height and weight always include two decimal points.
To modify individual variable labels, a label statement is included starting from the fourth line to add labels to both the name and sex variables. After modifying the desired variable attributes, simply add a run statement to conclude that section of the PROC DATASETS call.
To verify the results and ensure the modifications were successful, a contents statement can be added to the same PROC DATASETS call to list out the attributes of the CLASS dataset in the WORK directory.
After submitting the above code, you can see in the partial output shown below that the dataset label has been added, the variable labels to name and sex have been added, and the height and weight variables now have a format:
To check that the formats are working correctly with the CLASS data, you can open the CLASS dataset from the WORK library to confirm that both height and weight have 2 decimal points for all records:
Another notable feature about PROC DATASETS is that you can continue to expand a single PROC DATASETS call to modify multiple datasets at a time.
Building upon the previous example, you can make modifications to the attributes of the CLASSFIT dataset after making modifications to the CLASS dataset within the same PROC DATASETS call.
By adding a second modify statement, you can modify the CLASSFIT dataset after you are done modifying the CLASS dataset. Here, the rename statement is used to rename the name variable to name_fit. The format statement is then used to add the number format with 3 decimal places to the predict and lowermean variables in CLASSFIT.
Finally, a second contents statement can be used to verify the attributes were modified as expected in CLASSFIT:
From the partial output shown below, you can see that the name variable was correctly renamed to name_fit and the number format was applied to the lowermean and predict variables:
You can also open the WORK.CLASSFIT dataset to verify the predict and lowermean variables are now only shown with 3 decimal places:
5. Creating an Index
SAS indexes can be an extremely valuable tool when you often need to subset large datasets. By creating an index with a SAS dataset, the run time of your SAS programs can often be greatly reduced when your program involves subsetting or updating large datasets.
The extent to which your performance will improve varies widely depending on your environment, the structure of your SAS datasets, and the tasks you are trying to complete. Because there are so many factors involved with getting the greatest performance out of your SAS indexes, this topic will not be covered in depth in this article but the steps to create an index are outlined in the following section.
The following examples will show you how to create an index for your datasets using PROC DATASETS so that you can start testing and see what might work for you.
The variable(s) that you choose for your index will depend on the use case for your data. The SASHELP.BWEIGHT dataset contains birth weight and other demographic information for 50,000 people. If, for example, you often subset this data on different birth weights before conducting your analysis, it may be helpful to set up an index for this dataset on the WEIGHT variable if the subsetting is taking a long time to run.
First, create a copy of the WEIGHT dataset in your WORK library using the following code for this example:
Next, to create your simple index using PROC DATASETS, you use the modify statement to let SAS know which dataset you would like to create the index for.
To create a simple index with 1 variable, use the INDEX CREATE statement followed by the name of the variable you would like to create the index on. In this case, the index is created on the variable weight.
To check that the index was properly created, you can also include a contents statement for the BWEIGHT dataset.
By looking at the partial output shown below, you can see that the BWEIGHT dataset in WORK now has 1 index associated with it:
To take advantage of the indexed dataset, you can then use the variable(s) specified in your index when using WHERE statements, match merging or doing BY statement processing in DATA or PROC steps.
For example, if you wanted to create a new birth weight dataset for only those people with birth weights over 4000, you can use the following code which will take advantage of the index:
If you often subset your data on more than 1 variable at a time, you can create a composite index on 2 more variables.
Using the BWEIGHT dataset as an example, you can create an index on the WEIGHT (birth weight) and MARRIED (married status indicator) variables.
The main difference in this example compared with the simple 1 variable index above is that the index itself is now given a name, bwindx, and then two variables are listed after the name of the index, weight and married, to indicate that both variables should be included in the index.
By looking at the output, shown partially below, you should now see that the index called bwindx exists on the variables Weight and Married: