Proc Freq: 7 Ways to Compute Frequency Statistics in SAS
In this article, we will show you 7 different ways to analyze your data using the FREQ procedure. You will learn how to see frequencies of different variables, find the most/least commonly occurring values in your data, check for missing values,… Let's get started!
Before we continue, make sure you have SAS Studio or SAS 9.4 installed. Don't have the software? Download SAS Studio now. It's free!
The examples used in this article are based on the CLASS and CLASSFIT data sets from the SASHelp library.
You can find the CARS and HEART data sets from the sashelp library:
[Don't have the software yet? Download SAS Studio here for free.]
1. Basic Usage
The most basic usage of Proc Freq is to determine the frequency (number of occurrences) for all values found within each variable of your dataset.
Using the CARS dataset as an example, you can determine the frequencies of all variables within your dataset with the following code:
Proc freq data = sashelp.cars; Run;
The code above creates a frequency table for each of the variable in the data set.
For example, below is a frequency table for the variable MAKE.
If you scroll down, you will also see the frequency tables for the variable ORIGIN and DRIVETRAIN:
By default, the TABLES statement used with Proc Freq will output a table which lists the values found within the variable(s) specified, the frequency of each value, the percentage of that value relative to all other value as well as the cumulative frequencies and cumulative percentages.
The cumulative frequencies and percentages are rolling totals determined by adding the number from each row to the row above it.
However, using Proc Freq in this manner without any options is usually not recommended, particularly if you have a large dataset which contains variables that have many unique values (levels).
A variable such as Model with a large number of unique (distinct) values will produce a very long output which will be difficult to read and not very useful:
A more efficient and effective use of Proc Freq is to use the TABLES statement to limit the variables that are reported on.
Here, the TABLES statement is used to only output the frequencies and percentages of the Origin variable to determine how many cars originate from which continent:
Proc freq data = sashelp.cars; Tables Origin; Run;
The resulting table from this code is shown here:
2. Sort output to determine the most/least commonly occurring values
You can use proc freq to determine the most or least commonly occurring values within a variable or multiple variable(s).
Using the order option, you can easily see the most or least commonly occurring values of both Type and Origin variables:
Proc freq data = sashelp.cars order=freq; Tables type origin; Run;
The resulting tables shows the frequency of each variable sorted with the most common variable on top and the least common on the bottom:
3. Check for Missing Values
Proc freq is an excellent tool to check for missing values in your dataset.
For this example, the SASHELP.HEART dataset is used. The SASHELP.HEART dataset can be accessed in the same way as the CARS dataset described above.
To check for the frequency of missing values in the DeathCause variable from the HEART dataset, you would use the following code:
With the MISSING option, you can see the frequencies and the percentage of missing values within the table:
Finally, there is also a way to include the frequency of missing values within your output table without factoring the percentage of missing values into the calculation of the percent and cumulative percentage of the other values.
Notice that in this table, the percentages of each value are lower than the percentages in the previous table, as the missing values are not factored into this calculation.
Using the Unknown value as an example, the percentage of records that have an Unknown value for Cause of Death is 5.63% with MISSPRINT, compared to only 2.15% in the previous table with the MISSING option:
Frequencies and percentages calculated using Proc Freq can also be saved to an output dataset using the OUT option combined with the TABLES statement.
The OUTCUM option can also be added to include the cumulative frequencies in the output dataset if desired:
Proc freq data = sashelp.cars order=freq; Tables type /out=cars_freq outcum; Run;
5. Use the FORMAT statement to categorize and analyze data
When combined with Proc Format and a FORMAT statement, Proc Freq also becomes a powerful tool to categorize and subsequently analyze continuous variables (or variables with a large number of unique values).
Using the MSRP (Manufacturer’s Suggested Retail Price) variable in the Cars dataset as an example, you can see that the standard Proc Freq output shown below does not produce very useful information for a variable such as MSRP:
Proc freq data=sashelp.cars; Tables msrp; Run
However, buy using Proc Format you can create categories (or groups) of MSRPs to see, for example, how many cars fall within a particular price range.
In this example, both Origin and DriveTrain each have 3 possible values. As a result, the cross-tabulation produces a 3x3 table which includes a total of 9 combinations (i.e. 3x3 = 9):
While this table may seem overwhelming at first, let’s walk through it step-by-step to understand what each component refers to.
As shown in the legend, the first row corresponds to the frequencies. For example, the 34 in the top left box indicates that there are 34 cars from Asia that have an “All” for DriveTrain.
Moving from left to right, the 99 in the top middle box indicates that there are 99 cars from Asia that have a “Front” drivetrain, and so on.
The second row contains the percentages relative to the other 8 combinations. Using the top left box again as an example, the 7.94% indicates that out of the 9 possible combinations of Origin and DriveTrain, 7.94% of records have Origin=Asia and DriveTrain=All.
The third row contains what is known as the row percentages. Starting with the top left box as an example, the 21.52 indicates that of those records with Origin=Asia, 21.52% have a DriveTrain=All. Moving across the row from left to right, you can see that for Origin=Asian cars, 62.66% have DriveTrain=Front, and 15.82% have a DriveTrain=Rear. Notice that these 3 percentages total 100% when summed (added together) across the row.
The fourth row contains what is known as the column percentages. Starting with the top left box as an example, the 36.96 indicates that of those records with DriveTrain=All, 36.96% have Origin=Asia. Moving down the column from left to right, you can see that for DriveTrain=All cars, 39.13% have Origin=Europe and 23.91% have Origin=USA. Notice that these 3 percentages total 100% when summed (added together) down the column.
Depending on the desired results, you can choose to suppress some of these numbers from the output. The NOCOL, NOROW, NOFREQ and NOPERCENT options can be used to suppress the column percentages, row percentages, frequencies and overall percentages from your output. These options can be used independently or in different combinations together.
For example, if you wanted to suppress the row and column percentages, but keep the frequencies and overall percentages, you would use the following code:
This produces the following table, which contains only the frequencies and overall percentages:
Two-way or multi-way tables can also be displayed in more of a list format for improved readability. This is especially useful when there are many possible combinations between the two variables. To display a cross tabulation in the long form “list” format, you can simply use the LIST option:
The results are identical to those produced without the LIST option, the only change is in how the information is displayed:
7. Produce dot and bar plots
Another useful feature of Proc Freq is the ability to create graphical representations of the frequencies and percentages.
Within Proc Freq, you have the ability to create either dot or bar plots, which can be created based on either the frequencies or the overall percentages.
In the following example, the TABLES statement is used to create both a 1-way frequency table for the Origin variable, and a 3x3 frequency table for the DriveTrain variable crossed with Origin.
To produce a dot plot for these variables, the plots=freqplot (type=dot) option is added. In order to produce these graphs, ODS graphics must also be turned ON (and subsequently turned OFF) as shown below:
Along with the frequency tables, the following 2 graphs are produced with the code above. The first graph shows a dot plot of the frequencies for each continent of Origin in the Cars dataset:
Because the “DriveTrain*Origin” portion of the TABLES statement in this example was also included in the code above, this second graph is produced which shows the frequency distribution of DriveTrain by each Origin in a single graph:
Alternatively, similar code can be used to produce bar plots based on the percentages instead of the frequencies. Of course, you can also mix and match combinations to produce a dot plot of percentages or a bar plot of frequencies if desired.
Using some of the code discussed earlier on this page to group and report on the MSRPs, the type=bar and scale=percent options are added to produce a bar plot that graphically represents the corresponding percentages with bars.