Would you like to better understand how to work with missing values in SAS? Would you like to know how to easily and quickly remove missing values from your data? Are you looking for tips to quickly filter, identify, count or report on missing values?
In this article, we will discuss different methods for handle missing values with both character and numeric variables. Ways to remove records with missing values will be discussed and we will provide some tips and tricks on how to easily count or flag missing values in your data. Finally we will demonstrate different strategies you can use to summarize and report on the presence of missing values.
Along the way, examples will be provided to illustrate the different methods for handling missing values and show how different SAS Procedures deal with missing values. Some of the associated options which affect how SAS procedures behavior with missing values will also be discussed.
In particular, the main topics covered include:
Character vs. Numeric Missing Values
Removing Records with Missing Values
The Missing Function
Counting and Reporting on the Frequency and Proportion of Missing Values
Using PROC FORMAT to Report on Missing Values
The NMISS and CMISS Functions
Before we continue, make sure you have SAS Studio or SAS 9.4 installed. Don't have the software? Download SAS Studio now. It's free!
The examples used in this article are based on the CLASS and HEART datasets found in the SASHELP library.
The HEART dataset in particular contains a variety of character and numeric variables which contain missing values.
This datasets which serve as great practice for gaining a better understanding on how to handle missing values in your SAS datasets
[Don't have the software yet? Download SAS Studio here for free.]
Character vs. Numeric Missing Values
All variables in SAS fall under one of two variable types, character or numeric. Within these two variable types, missing values are handled slightly differently.
With character variables, missing variables are denoted with a blank. In SAS code, a blank is typically specified by enclosing a space between two quotation marks, such as “ “ or ‘ ‘. With numeric variables, a missing value is denoted with a period. In SAS code, the period [.] is specified by simply using the period with no quotation marks or parentheses around it.
Let’s look a simple example where we create both a new character and numeric variable which contain missing values.
For this example, we use the SASHELP.CLASS dataset, which contains the list of students in a class and some basic information about them.
In the code below, we create a new dataset, WORK.CLASS, with two new variables: CHAR_MISS (which will be a character variable) and NUM_MISS (which will be a numeric variable). When SEX is “M” (male) in the CLASS dataset, we will set both CHAR_MISS and NUM_MISS to missing. For all other values of SEX, CHAR_MISS and NUM_MISS will be equal to 1:
data class; set sashelp.class;
if sex = 'M' then char_miss = ' '; else if sex = 'F' then char_miss = '1';
if sex = 'M' then num_miss = .; else if sex = 'F' then num_miss = 1;
As you can see the Output Data created, when SEX=”M” the variable CHAR_MISS has missing values denoted as blanks and the variable NUM_MISS has missing values denoted as periods.
Removing Records with Missing Values
Sometimes, you simply want to remove records which contain missing values for a particular variable in your dataset. Let’s look at a few simple examples where we remove records based on both missing character and numeric values.
The following examples are based off the SASHELP.HEART dataset, which contains various health and demographic variables about different people.
Example 1 – Removing Records with Missing Character Values
To remove records that have a missing value for a particular character variable, you simply need to use an IF statement to check for blanks, followed by a THEN DELETE statement. In this example, we are going to remove all records with a missing value for the DeathCause variable.
data heart_char; set sashelp.heart;
if DeathCause = ' ' then delete; run;
As you can see from the SAS Log, the SASHELP.HEART dataset started with 5209 observations, but the output dataset HEART_CHAR only has 1991 observations after running the code above:
Looking at the Output Data shown partially below, you can see that any records with a blank for DEATHCAUSE have been removed, and the total number of rows is now only 1991:
Example 2 – Removing Records with Missing Numeric Values Similar to the previous example, to remove records that have a missing value for a particular numeric variable, you simply need to use an IF statement to check for a period, followed by a THEN DELETE statement. In this example, we are going to remove all records with a missing value for the AgeCHDdiag variable using the following code:
data heart_num; set sashelp.heart;
if AgeCHDdiag = . then delete; run;
As you can see from the SAS Log, the SASHELP.HEART dataset started with 5209 observations, but now the output dataset HEART_NUM has 1991 observations after running this code:
Looking at the Output Data shown partially below, you can see that any records with a blank for AgeCHDdiag have been removed, and the total number of rows is now only 1449:
The Missing Function
In large datasets with many variables you are not very familiar with, it is easy to lose track of which variables are numeric and which are character. To avoid having to look up the variable type before trying to perform an operation on missing values, you can use the MISSING function. The missing function is helpful in this regard because it works the same way for both character and numeric variables.
The MISSING function is utilized within SAS Data Step code. There is only one input to the MISSING function, which is the name of the variable you would like to scan for missing values. The greatest advantage of MISSING is that it works with both character and numeric variables, so you do not need to do anything to indicate what the type of the variable is that you would like to check for missing values.
Let’s look at an example using the SASHELP.HEART dataset and the DeathCause variable.
To use the MISSING function, simply enclose the variable you would like to check for missing values inside the parentheses such as “missing(DeathCause)”. If SAS finds a missing value for the variable specified, it will return a value of 1. If it does not find a missing value (i.e. there is a value present for a given record) it will return a value of 0.
To illustrate this, the following syntax creates a new variable, MISSING_FLAG, which contains the result of “missing(DeathCause)”, either a 1 or 0, for each record:
data missing_func; set sashelp.heart;
missing_flag = missing(deathcause);
keep missing_flag deathcause; run
As you can see the output shown partially below, the new variable MISSING_FLAG equals 1 when DeathCause is missing but equals 0 when there is a value for DeathCause present:
The MISSING function can be used in place of filtering values on blanks or periods for character and numeric variables, respectively.
Going back to the previous example of deleting records when there is a missing value, here is how it could be done using the MISSING function on the variable DeathCause:
data missing_delete; set sashelp.heart;
if missing(deathcause) = 1 then delete; run;
As you can see from the SAS Log, the results are the same as the previous example where the missing function was not used and we are left with 1991 observations after removing those records with a missing value for DeathCause:
Alternatively, this code can be simplified even further but omitting the the “=1” portion of the code. Since this is a true/false binary condition, simply using “if missing(deathcasuse) then delete;” has the same effect:
data missing_delete; set sashelp.heart;
if missing(deathcause) then delete; run;
The result again can be quickly confirmed by checking the SAS log to verify that only 1991 observations are remaining:
Are you totally new to SAS?
Take our Practical SAS Training Course for Beginners and learn how to code your first SAS program!
On very large datasets, it is more efficient to use a sub-setting WHERE statement instead of an IF statement to remove records with missing values. Since the WHERE statement applies to the incoming dataset on input, the entire dataset does not need to be read-in before filtering, reducing processing time and the amount of system memory required to complete the task.
Let’s look at an example where the SASHELP.HEART dataset is read-in to only include those records with missing values for DeathCause. Using a WHERE statement and the MISSING function, this can be easily accomplished with a single line of code within a SAS data step:
data missing_where; set sashelp.heart; where missing(deathcause); run;
As you can see the output shown partially below, we now only have those 3218 records with a blank for DeathCause:
To do the inverse, i.e. exclude those variables with missing values for DeathCause, we just need to use a WHERE NOT MISSING instead of WHERE MISSING:
data notmissing_where; set sashelp.heart; where not missing(deathcause); run;
As you can see in the Output Data shown partially below, we now only have those 1991 records with a non-missing value for DeathCause as before:
Counting and Reporting on Frequency of Missing Values
In many cases, simply removing those records with missing values is not an option. When analyzing data, it is important to have a good understanding of the frequency and percentage of missing values within your data.
Building upon some of the strategies discussed earlier to flag and subsequently delete records with missing values, we can also use SAS to flag and report on the proportion of missing values.
Let’s explore two examples which demonstrate different ways to report on the proportion of missing values.
Example 1 – Using PROC FREQ to check for Missing Values Using PROC FREQ to check for and report on missing values is one of the simplest methods. For this first example, you would like to determine the frequency and proportion of missing values for the DeathCause variable in SASHELP.HEART.
To determine the frequencies and percentages of each value in DeathCause, we run a simple PROC FREQ with the TABLES statement using the following code:
proc freq data = sashelp.heart; tables deathcause; run;
By default, when you run a PROC FREQ on a particular variable, only the frequency of the missing values will be shown and the proportion of missing values will not be included in the percentage calculations:
To include the missing values as part of the calculated proportions, you need to add the missing option to the TABLES statement as shown here:
proc freq data = sashelp.heart; tables deathcause /missing; run;
Now, the output table will have a row just for the missing values, and the proportion of missing values will be factored in to all the percentage calculations:
Example 2 – Using IF-THEN Statements and PROC FREQ to Report on Missing Values
While reporting on missing values can be done with a simple PROC FREQ as discussed in the previous example, it is often helpful to flag those records with missing values and then report on them. After examining the proportion of missing values, you can then use the flagged variables for further filtering, calculations and analysis.
In this example, you would like to determine the proportion of missing values in the variable AgeCHDdiag found in the SASHELP.HEART dataset. The AgeCHDdiag variable is a numeric variable, so we know that we can use an IF statement with a period to look for missing values.
To calculate the proportion and create an easy to read output table, we will create a new variable, MISS_FLAG and set it equal to “missing” when AgeCHDdiag is missing. We will thenset MISS_FLAG equal to “present” when AgeCHDdiag is not missing. To simplify the output, only the MISS_FLAG and AgeCHDdiag variables are kept in the MISS_AGE_REPORT dataset.
Once the new variable is created, a PROC FREQ can be used to summarize the results:
data miss_age_report; length miss_flag $8; set sashelp.heart;
if AgeCHDdiag = . then miss_flag = 'missing'; else miss_flag = 'present';
keep miss_flag AgeCHDdiag; run;
proc freq data = miss_age_report; tables miss_flag; run;
Examining the Output Data shown partially below, we can see that the miss_flag is being created correctly:
The output from the PROC FREQ can then be used to see both the frequency and percentage of missing values for the AgeCHDdiag variable:
Example 3 – Using the Missing Function and PROC FREQ to Report on Missing Values
The missing function can also be used to count the number of missing values and ultimately determine the proportion of missing values. In this example, we create the MISS_FLAG variable by setting it equal to the output of the MISSING function used on AgeCHDdiag. In this way, the MISS_FLAG will take the value of 1 when AgeCHDdiag is missing, and 0 when it is not.
First, we check the Output Dataset shown partially below to verify the miss_flag variable is being created correctly:
Next, we can look at the PROC FREQ Results to determine the frequency of missing values for AgeCHDdiag in our data.
It is important to note that while this method requires a bit less coding than the previous example, the PROC FREQ output is not as easy to interpret. Although the results are the same as the previous example, here we must remember that 1 is missing and 0 is not missing, which is less obvious than our previous example where the PROC FREQ Output had “missing” and “present” shown in the place of 1 and 0 in the Results table:
Example 4 – Using PROC MEANS to Count Missing Values
For numeric variables, PROC MEANS provides a simple way to count the number of missing values. Although PROC MEANS will only work on numeric variables, it provides a quick and easy way to count the number of missing values for multiple variables with a single run.
By using the NMISS and N options, PROC MEANS will output the number of missing values and the total number of values for all numeric variables in your dataset. This is demonstrated using the following code on the SASHELP.HEART dataset:
proc means data=sashelp.heart nmiss n; run;
As you can see in the output shown below, we now have a summary of all numeric variables and the number of missing values found in each one:
Of course, you can also restrict the PROC MEANS output to only report on specific variables. If you would like for example to only report on Height and Weight, you can add the VAR statement followed by the variables of interest:
proc means data=sashelp.heart nmiss n; var height weight; run;
Now, the Results are limited only to the height and weight variables:
Need help studying for the new SAS Certified Specialist Exam?
Using PROC FORMAT and PROC FREQ to Report on Missing Values
Another useful tool for summarizing and reporting on missing values is PROC FORMAT. By creating your own custom format to categorize missing vs. non-missing values, you can quickly get a sense of the proportion of missing vs. non-missing values in each variable of your dataset.
Once the formats have been created, you can continue to use them throughout your SAS session, making the format a very efficient and powerful tool.
The first step is to create both a character and numeric format so that you will have a format that you can use for each variable type.
Using the code below, you will create a format that can be used with character variables called $MISSING_CHAR, which assigns the value of ‘Missing’ when there is a missing value and ‘Present’ when there is a non-missing value. You will also create a format called MISSING_NUM which will similarly assign the value of ‘Missing’ where there is a missing value in a numeric variable and ‘Present’ when there is a non-missing value in a numeric variable:
proc format; value $missing_char ' ' = 'Missing' other = 'Present' ; value missing_num . = 'Missing' other = 'Present' ; run;
Once the format has been created, you can then apply this format anywhere you would normally apply formats in your SAS code. Looking at the SASHELP.HEART dataset and using PROC FREQ, you can for example apply the $MISSING_CHAR format to the character variable DeathCause and apply the MISSING_NUM format to the AgeAtDeath variable:
proc freq data = sashelp.heart; tables DeathCause AgeAtDeath /missing; format DeathCause $missing_char. AgeAtDeath missing_num.; run;
As you can see in the output shown below, the results are now group into 2 levels – Missing and Present:
These formats can also be expanded to generate reports on the entire dataset. By using the following keywords together with your newly created format, you can quickly create a missing value report on your entire dataset. Here are the SAS keywords we will use:
_all_ - SAS keyword equivalent to listing all variables found in your dataset
_character_ - SAS keyword equivalent to listing all character variables found in your dataset
_numeric_ - SAS keyword equivalent to listing all numeric variables in your dataset
By utilizing these keywords, we can expand the PROC FREQ above to report on all variables, and then apply our character and numeric missing formats to the appropriate variables of the corresponding types.
First, in the tables statement we use _all_, to tell SAS that we would like frequency tables generated for all variables. Next, in the format statement, we apply the $MISSING_CHAR format to all character variables using the _character_ keyword and apply the MISSING_NUM format to all numeric variables using the _numeric_ keyword.
proc freq data = sashelp.heart; tables _all_ /missing; format _character_ $missing_char. _numeric_ missing_num.; run;
As you can see in the output shown partially below, we now have frequency tables showing the proportion of missing vs. non-missing values in all variables found in the SASHELP.HEART dataset:
NMISS and CMISS Functions
The NMISS and CMISS functions work similarly to the MISSING function described earlier in this article, except they can be used to indicate the presence of missing values in more than one variable at time. Another key difference is that there is a function specific for each variable type: NMISS is used for numeric variables and CMISS is used for character variables.
The inputs for NMISS and CMISS are a list of variables. The List of variables can be provided in a variety of ways, including:
NMISS(var1,var2,var3) or NMISS(of var1-var3)
The output of NMISS is number of missing numeric values found within the range of variables specified. For example, if you provide 3 variables as arguments to NMISS and 2 of those 3 variables contains a missing value in a given record the output of NMISS will be 2. If only 1 variable of the 3 contains a missing value for a given record, the output will be 1. If no variables contain missing values for a given record the output will be zero, and so on.
The arguments, output and behavior for NMISS are identical to that of CMISS, except CMISS is used for character variables as mentioned earlier.
Let’s look at a couple examples where NMISS and CMISS can be used.
Example 1 – Using NMISS to Count Missing Values in Numeric Variables
Using the SASHELP.HEART dataset, you would like to determine the number of missing values in the 3 age variables – AgeCHDdiag, AgeAtStart and AgeAtDeath. To do this, we simply create a new variable, NUM_MISS to store the result of using NMISS on these 3 age variables. To simplify the output, we then keep only the newly created NUM_MISS and the 3 age variables.
Finally, to summarize how often we have either 0, 1, 2 or 3 missing values among these 3 variables, we use PROC FREQ to determine the frequency of NUM_MISS.
First, by looking at the Output Data, you can verify that NUM_MISS has correctly counted when 0, 1, 2 or 3 missing values occur:
By examining the PROC FREQ output, you can see how many records fall into each grouping. Notice that there are no records where all 3 variables are missing:
Example 2 – Using CMISS to Count Missing Values in Character Variables
Using the SASHELP.HEART dataset, you would like to determine the number of missing values in the 4 status variables – Chol_Status, BP_Status, Weight_Status and Smoking_Status. To do this, we simply create a new variable, CHAR_MISS to store the result of using CMISS on these 4 status variables. To simplify the output, we then keep only the newly created CHAR_MISS and the 4 Status age variables.
Finally, to summarize how often we have either 0, 1, 2, 3 or 4 missing values among these 4 variables, we use PROC FREQ to determine the frequency of CHAR_MISS.