Are you looking for an efficient way to compare two datasets without having to merge them together? Would you like a simple way to summarize differences between two variables? Are you looking for a method to compare the metadata and structure of a dataset without comparing individual values? If so, then PROC COMPARE can be a very useful tool for you.
In this article, we will discuss the many different ways you can compare datasets and variables using PROC COMPARE. A variety of examples will be presented to highlight the different options available with PROC COMPARE that allow you to compare, contrast and report on the differences between datasets and the variables within them.
In particular, this article will cover:
Basic comparison of two datasets
Comparing two datasets with a common ID variable
Comparing Metadata (dataset attributes) between two datasets
Comparing datasets by Observations
Comparing variables within a Dataset
Comparing specific values in an output dataset
Before we continue, make sure you have SAS Studio or SAS 9.4 installed. Don't have the software? Download SAS Studio now. It's free!
In this article, the CLASS and CLASSFIT datasets from the SASHELP library are compared. PRDSALE and PRDSAL2 from SASHELP are also used as example comparison datasets.
For more in-depth comparison examples, you will need to run the following code to create two new datasets, CUSTOMERS1 and CUSTOMERS2. These datasets contain a list of customers, their unique customer IDs, first names, last names and middle initials.
data customers1; infile datalines delimiter = ','; input customerid $ firstname $ lastname $ middleinitial $; datalines; 001, Bob, Smith, A 002, Alex, Jones, B 003, Tom, Cook, C 004, Wendy, Jefferies, D 005, Gail, Phillips, E 006, Samantha, Williams, L 007, Bill, Taylor, K 008, Diane, Miller, Q 009, Pat, White, P 010, Donna, Summers, A ; run; data customers2; infile datalines delimiter = ','; input customerid $ firstname $ lastname $; datalines; 001, Bobby, Smith 002, Alex, jones 004, Wendy, Jeffery 005, Gail, Phillips 006, Samantha, Williams 007, George, Taylor 012, Diane, Miller 009, Pat, White 010, Donna, Summer ; run;
[Don't have the software yet? Download SAS Studio here for free.]
Basic Comparison of Two Datasets
First, let’s look at a straight forward comparison of two datasets. For this example, we are going to compare the SASHELP.CLASS dataset with the SASHELP.CLASSFIT dataset.
The most basic default usage of PROC COMPARE is to simply specify one dataset as your “base” dataset using the BASE argument and then specify the dataset you would like to compare it against using the COMPARE argument:
proc compare base = sashelp.class compare = sashelp.classfit; run;
After running the above code, you will notice the Results are divided into 5 summaries. Let’s walk through each section in more detail
First, you will see the dataset summary which compares the structure or “metadata” of a dataset including the names, created dates, modified dates, number of variables, number of observations and any labels:
The next section is the Variable Summary, which summarizes how many variables are found in both datasets (i.e. the variables in common) and how many variables are in one dataset but not the other. In this case, there are 5 variables in SASHELP.CLASSFIT which are not in SASHELP.CLASS and no variables that are in CLASS but not CLASSFIT:
The third section is the Observation Summary, which summarizes how many observations are in each dataset and how many have equal or unequal values in some of the variables:
The fourth section is the Values Comparison summary, which lets you know how many different variables were found that had either all values exactly equal or had or some non-matching values:
Finally, the details of the value comparisons are shown, which displays those observations, one variable at a time, with unequal values between the two datasets. In the partial output below, you can see that all 19 observations have different values for NAME between the two datasets but SEX is only unequal in 6 of the 19 observations:
Limiting the Comparison to Certain variables
With large datasets, the Value Comparison Results (the final summary) can become quite lengthy when you are comparing many variables. Fortunately, you can limit the comparison to a list of variables that you specify.
For example, if you would only like the details of the age variable, you can use the VAR option to specify the variable AGE as follows:
proc compare base = sashelp.class compare = sashelp.classfit; var age; run;
Although the initial summaries about the metadata are the same as before, you will notice the details in the Value Comparison Summary are now limited to only the age variable:
Comparing Two Datasets with a Common ID variable
In many cases when you are comparing two datasets, there will be a common ID variable or multiple ID variables between the two datasets. If both datasets have a unique ID number you can merge the files together and conduct a more in-depth record level comparison. By running a PROC COMPARE with ID variable, you can compare those records with a matching ID without having to explicitly merge your data together or create a new dataset.
Using the same CLASS and CLASSFIT datasets from the SASHELP library, you may have noticed both datasets contain a NAME variable. When you review the results from the initial PROC COMPARE, notice the summary indicates that none of the values for NAME match:
However, when you look at the list of NAME values, you can see that many of the names would actually match:
When you run a standard PROC COMPARE with no options, SAS will just compare observation 1 from the BASE dataset with observation 1 from the COMPARE dataset followed by comparing observation 2 from the BASE dataset compared with observation 2 from the COMPARE dataset and so on. Thus, if the observations are not sorted and do not line up exactly, it will appear as though nothing matches between your two datasets.
To make the compare more useful, you can include the ID option. To demonstrate this with an example, we can compare the SASHELP.CLASS dataset to SASHELP.CLASSFIT, using the NAME variable as the ID variable. In this manner, SAS will only compare records from CLASS with records from CLASSFIT that have identical values for ID.
Before using the ID variable, it is best practice to sort each dataset by the ID variable unless you are certain there is a perfect one-to-one match. Since we are not yet certain, we will sort both CLASS and CLASSFIT by NAME and create 2 new datasets, WORK.CLASS and WORK.CLASSFIT. After sorting, simply add the ID option and specify the NAME variable as the ID variable in your call to PROC COMPARE:
proc sort data=sashelp.class out=class; by name; run;
proc sort data=sashelp.classfit out=classfit; by name; run;
proc compare base=class compare=classfit; id name; run;
Although we are comparing the same two datasets as in the first example, the results are much different now with the ID option included.
When the NAME variable is used with the ID option, the Observation Summary now indicates that of the 19 observations in each dataset, all 19 ID values are common between the 2 datasets. Furthermore, for those common variables, there are no unequal values in any variable now that we have matched the datasets on NAME:
Comparing Metadata (Dataset Attributes)
In many scenarios, comparing the values between two datasets may not be useful if you already know the datasets will have different values. It can however still be useful to compare the structure of two datasets to see if there are any differences in the variables or their attributes.
In the SASHELP library, the PRDSALE and PRDSAL2 datasets both contain information about predicted and actual furniture sales. If for example you know that the records differ widely between the two datasets but you would like to know how the structure of the datasets compare, you can add a few different options to the PROC COMPARE.
First, you can add the NOVALUES option, which suppresses any value comparisons in the PROC COMPARE output:
proc compare base = sashelp.prdsale compare = sashelp.prdsal2 novalues; run;
The Results output will be similar to before, listing the Data Set Summary, Variables Summary, Common variables with differing attributes, Observation Summary, Values Comparison Summary and a list of all variables with Unequal values (without actually reporting those observations with differences).
As you can see in the output shown partially below, there are 2 variables in PRDSALE that are not in PRDSAL2 and conversely there are 3 variables in PRDSAL2 that are not in PRDSALE:
To expand the comparison and see which variables are in one dataset but not the other, simply add the listvar option:
In addition to determining what are the differences in variables between two datasets, it can also be helpful to understand what are the differences in observations between two datasets. To see observation differences, there are a variety of other LIST options that can be used.
Some of the most common and useful LIST options that you can use include:
LISTBASEOBS – list observations found in the BASE dataset but not the COMPARE dataset
LISTCOMPOBS – list observations found in the COMPARE dataset but not the BASE dataset
LISTOBS – list observations that are found in either the BASE or COMPARE dataset but not both
LISTALL – list all observations and variables that are found in either the BASE or COMPARE dataset but not both
Using the WORK.CUSTOMERS1 and WORK.CUSTOMERS2 datasets created at the beginning of this article, let’s look at a couple examples.
First, say for example you would like to know which observations are in WORK.CUSTOMERS1 but not WORK.CUSTOMERS2. Here, we will select the CUSTOMERS1 dataset as the BASE dataset and CUSTOMERS2 as the COMPARE dataset. Since we would like to know which observations are in CUSTOMERS1 and not CUSTOMERS2, and CUSTOMERS1 is our BASE dataset, we will add the LISTBASEOBS option.
Recall that since we are comparing observations, we will use the CUSTOMERID variable as our ID variable in PROC COMPARE to ensure the correct records are being compared. Both input datasets are also sorted by ID prior to running the comparison.
proc sort data = customers1; by customerid; run;
proc sort data = customers2; by customerid; run;
proc compare base=customers1 compare=customers2 listbaseobs; id customerid; run;
As you can see from the output, there is now a new section that lists out the CUSTOMERID’s which are found in the BASE dataset (CUSTOMERS1) but not the COMPARE dataset (CUSTOMERS2):
The LISTCOMPOBS works in exactly the same way, and you would just need to replace LISTCOMPBASE with LISTCOMPOBS in your PROC COMPARE to produce similar results, except now those observations in CUSTOMERS2 which are not in CUSTOMERS1 will be listed:
proc compare base = customers1 compare = customers2 listcompareobs; id customerid; run;
To avoid having to run PROC COMPARE many times, you can take advantage of the LISTALL option, which will essentially combine the output of LISTVARS, LISTBASEOBS and LISTCOMPOBS into a single output.
proc compare base = customers1 compare = customers2 listall; id customerid; run;
Now in the Results output you will see a list of any variables found in one dataset but not the other (equivalent to what LISTVARS produces) and a list of any observation found in one dataset but not the other (equivalent to what LISTBASEOBS, LISTCOMPOBS or also LISTOBS produces):
Comparing Values within a Dataset
Not only can PROC COMPARE be used for comparisons between datasets, but it can also be used to compare variables within a single dataset.
The PRDSALE dataset from the SASHELP library contains furniture sales data over time. One variable, PREDICT, contains the predicted sale amount of each item while another variable, ACTUAL, contains the actual amount the item was sold for. If for example you wanted to examine how many times the predicted value was different from the actual value, the largest difference between values as well as the specific differences and percentage differences between the two you can use a single call to PROC COMPARE.
The syntax for within dataset comparisons is quite simple. As before, you specify a BASE dataset (PRDSALE in this example) but this time you do not specify a COMPARE dataset. Next, you use the VAR statement to list any variables you want to compare. After specifying the variables you want to compare, the WITH statement is used to specify the corresponding variables you want to compare the first set of variables with. So, to compare ACTUAL with PREDICT in the PRDSALE data, you would use the following syntax:
proc compare base = sashelp.prdsale; var actual; with predict; run;
In the first part of the output, you will see the summary of the comparison between the variables. Notably, 1439 out of 1439 records are unequal, so the predicted sale value was never equal to the actual sale value for any observation. Also, the maximum difference was found to be 978 ($978.00) between Actual and Predicted Sales:
Examining the output further, you can see the observation level comparison where the difference and percentage difference for the first 50 observations are shown, as you can see in the partial output below:
proc freq data = sashelp.heart; tables deathcause /missing; run;
Now, the output table will have a row just for the missing values, and the proportion of missing values will be factored in to all the percentage calculations:
Note that by default, the observation level comparisons are limited to 50 records, which is true both for within dataset comparisons and comparisons between two datasets. To expand the comparison to 200 records for example, simply use the MAXPRINT option with PROC COMPARE, as shown here:
proc compare base = sashelp.prdsale maxprint = 200; var actual; with predict; run;
Note that MAXPRINT can be used for both within dataset comparisons and between dataset comparisons. As you can see the output shown partially below, the comparison results are now printed for up to 200 observations:
Need help studying for the new SAS Certified Specialist Exam?
PROC COMPARE is also a useful tool for conducting more in-depth comparisons between two datasets. If there are differences in values between the two datasets you are comparing and you would like to drill down to find out exactly how the values of two matching variables are different, PROC COMPARE has a number of options for reporting on these differences.
Similar to the LIST options, there are a number of different options that can be used in different combinations with the OUT argument. Here are the most common and useful options that can be used with the OUT argument:
OUTCOMP – include the values from the COMPARE dataset
OUTBASE – includes the values from the BASE dataset
OUTDIF – includes indicators to show what position the differences are found
OUTNOEQUAL – restrict output observations to only those observations with differences
Using the CUSTOMERS1 and CUSTOMERS2 datasets created at the beginning of this article, the following example illustrates how to create both a convenient Output dataset and report on differences in values.
As before, we will be using the ID option with PROC COMPARE so we need to first ensure that the CUSTOMERS1 and CUSTOMERS2 datasets are sorted prior to running the compare:
proc sort data = customers1; by customerid; run;
proc sort data = customers2; by customerid; run; proc format; value $missing_char ' ' = 'Missing' other = 'Present' ; value missing_num . = 'Missing' other = 'Present' ; run;
To create an output dataset of differences, you simply need to add the OUT argument to the PROC COMPARE statement. Immediately following the OUT argument is the name of the output dataset that will be created. In the following example, the output dataset will be named CUSTOMER_COMPARE.
After the dataset name, the desired options OUTCOMPARE, OUTBASE, OUTDIFF and OUTNOEQUAL as described above are added.
Since we are only interested in the output dataset in this example, the NOPRINT option is also included which suppresses the printing of any Results:
proc compare base = customers1 compare = customers2 out = customer_compare outcomp outbase outdif outnoequal noprint; id customerid; run;
As you can see in the output dataset shown below, we now have groups of records where there is one record for each BASE observation, one for each COMPARE observation, as well as a DIF record which highlights which position the differences occur between the two values. When there is no matching observation (based on CUSTOMERID) between the BASE and COMPARE datasets, there will be only the respective BASE or COMPARE observation that is found in one dataset but not the other.
While comparing differences within a dataset is certainly useful, it can often be easier to review differences within CUSTOMERIDs by further summarizing the results with PROC PRINT. Using the following PROC PRINT syntax, the output dataset CUSTOMER_COMPARE we created above with PROC COMPARE can be easily grouped by CUSTOMERID for further review:
proc print data = customer_compare; by customerid; id customerid; run;
In the resulting output shown partially below, you can now easily compare the values from the BASE and COMPARE datasets within each by group (CUSTOMERID in this example):