SASCRUNCH TRAINING
  • Home
  • SASĀ® Certification Training
    • SAS Certified Specialist Exam Training Program
    • How to Prepare For SAS Certified Specialist Base Programming Exam
  • Online Courses
    • Practical SAS Training Course for Beginners
    • Proc SQL Course
    • SAS Project Training Course
    • Logistic Regression (Credit Scoring) Modeling using SAS
  • Articles
    • Get Started >
      • 18 Free Resources to Help You Learn SAS
      • SAS Tutorial
      • How to Install SAS Studio
      • How to Learn SAS Fast
    • Data Import >
      • Importing Excel Spreadsheet
      • Importing CSV Files
      • Importing Text Files
    • SAS Functions >
      • CAT, CATT, CATS, CATX Functions
      • If-Then-Else statement
      • TRIM Function
      • STRIP Function
      • YEAR, MONTH, DAY Functions
      • Compress Function
      • Do-Loop
      • SCAN Function
      • LIKE Operator
      • INDEX Function
    • Data Manipulations >
      • The Ultimate Guide to Proc SQL
      • Proc Datasets
      • Dictionary Tables
      • Dealing with Missing Values
      • Proc Compare
      • Proc Transpose
      • RETAIN Statement
      • SAS Formats
      • SAS Arrays
    • Statistical Analysis >
      • Proc Means
      • Proc Freq
      • Proc Tabulate
    • Machine Learning >
      • Predicting Fish Species Using K-nearest Neighbor in SAS
      • Classify Product Reviews on Amazon Using Naive Bayes Model in SAS
    • Informational Interviews >
      • How to get a Clinical Trial/Research job without experience
      • Senior Recruiter at a Fortune 500 Retail Company
      • Manager, Non-profit Health Services Research
      • HR Manager
      • Quantitative Analyst
  • Services
    • The Ultimate Job Search Automation Services
    • Statistical Consulting
    • SAS Project or Assignment Help
    • Data Import Services
    • Data Manipulation and Reporting Services
  • In-class Training
    • SAS Training for Job Seekers
  • Guest Lecture
  • Sample Resume
  • About us
  • Contact Us
Practical SAS Training Course for Beginners


Get Access to:
​
  • 90+ Training ​Modules
  • 150+ ​Practice ​Exercises
  • 5 ​Coding ​Projects
  • 1000+​ Satisfied ​Students
Start your Free training!
x
Picture
Need help studying for the new
SAS Certified Specialist Exam?
Get access to:
  • Two Full Certificate Prep Courses
  • ​300+ Practice Exercises
Picture
Start your free training now
How to Prepare for the SAS Certified Specialist Base Programming Exam
Picture
 

Predicting Fish Species Using K-nearest Neighbor in SAS


​Can you tell that the two fish below are different?
Picture
Bream
Picture
Perch

For a human, it is very easy to spot the difference between the two fish.

Bream have a rounded body and are greyish in color. Perch, on the other hand, have a long body and are dark gold in color.

To a machine, distinguishing the two fish species could be a challenging job.
In this article, we will explain how to train a machine to identify the fish species.
​

We will build a K-nearest neighbor (KNN) model that enables the machine to correctly identify the fish species.
Let’s get started!

Software​
Before we continue, make sure you have SAS Studio or SAS 9.4 installed. Don't have the software? Download SAS Studio now. It's free!​
SAS Studio

Data Sets
In this article, we will use the Fish data set from the SASHELP library to illustrate the machine learning model.

The Fish data set contains many species. In this exercise, we will look at only bream and perch. 

In terms of the variables (or features), only the Weight and Height will be used.
​

You can copy and run the code below to create the Fish data set on your SAS Studio.
data fish;
set sashelp.fish;
where species in ('Bream', 'Perch');
keep species weight height;

run;
Picture

[Don't have the software yet? Download SAS Studio here for free.]

Understand the Data

Before we start building the model, we must first understand the data on hand.

​Let's run a quick Proc Means to look at the distribution of the data.
proc means data=fish nmiss n mean std min max; 
var weight height; 
class species; 
run;

This Proc Means data step above computes the summary statistics of the two features that we have:
  • Weight
  • Height
Picture

​In the Fish data set, we have 35 bream and 56 perch. ​
Picture

Bream, on average, are bigger with an average weight of 626 grams, and an average height of 15.1 cm.
​

Perch are smaller. The average weight and height are 382 grams and 7.8 cm, respectively.
Picture

We also noticed a missing weight for a Bream. This will need to be taken care of.
Picture

Visualizing the Data
Visualizing the data usually helps us to better understand the data. 
​

Let's plot a scatterplot for the fish.
ods graphics on / attrpriority=none;
proc sgplot data=fish;
scatter y=height x=weight / group=species;
styleattrs datasymbols=(circlefilled Triangle);
run;

The Proc SGPLOT procedure above plots the fish on a scatterplot:
Picture
The blue dots represent bream, and the red triangles represent perch.
​

You can see quite a distinct distribution between the two types of fish.

Cleaning up the data
As mentioned earlier, there is a missing weight for a bream. This could happen in practice because of measurement errors.
​

The K-nearest neighbor model cannot handle observations with missing values. We will replace the missing value with the average weight of bream.
data fish2;
set fish;
if species = 'Bream' and weight = . then weight = 626;
keep species weight height;
run;

​As discussed earlier, the average weight of a bream is 626 grams.
​

We have replaced the missing weight with 626:
Picture
We are now ready to build the model.

Are you totally new to SAS?
Picture
Take our Practical SAS Training Course for Beginners and learn how to code your first SAS program!
Start learning now

What is K-Nearest Neighbor (KNN) model?
K-nearest neighbor (KNN) model is a machine learning model that is commonly used to solve classification problems.

It classifies data based on their k-nearest points.
​

Let's run the entire code below on SAS Studio:
** Illustration **;
data illu1 illu2;
set fish2;
where 500 < weight < 800;

if _n_ = 9 then do;
Species = 'Unknown';
output illu1;
end;
else if _n_ = 19 then do;
Species = 'Unknown';
output illu2;
end;

else do;
output illu1;
output illu2;
end;
run;

proc sort data=illu1; by species; run;
proc sort data=illu2; by species; run;

ods graphics on / attrpriority=none;
proc sgplot data=illu1;
scatter y=height x=weight / group=species;
styleattrs datasymbols=(circlefilled Triangle square );
xaxis min=400 max=850;
yaxis min=8 max=18;
run;

ods graphics on / attrpriority=none;
proc sgplot data=illu2;
scatter y=height x=weight / group=species;
styleattrs datasymbols=(circlefilled Triangle square );
xaxis min=400 max=850;
yaxis min=8 max=18;
run;

The code above creates two graphs.

Graph 1
​

In this graph, the blue dots and the red triangle represent the bream and perch, respectively.
Picture

The green square represents an unknown fish.
Picture

The unknown fish is either a bream or a perch.

If you were to guess, what would it be?

It would be a bream, right?
​

Most of its closest points are bream. We would classify this unknown fish as a bream simply because of its proximity to other bream.
Picture
Graph 2

In Graph 2, the unknown fish is surrounded by perch, instead:
Picture

If we were to guess, the fish would be a perch.

But here we have a problem.

How do you determine how many neighbors to look at?
​
For our example in Graph 2, if we look at only the three closest neighbors, we would conclude that the unknown fish is a perch.
Picture

However, when looking at the 20 closest fish, there are more bream (14) than perch (6).​
Picture

We would conclude that the unknown fish is a bream, instead!

In practice, selecting the optimal "k" (i.e. number of nearest neighbors) is not an easy task.

In this example, we will simplify the process, and build the KNN model with k = 3.
​

This means, we will use the three closest neighbors to determine which fish it is.

Splitting the data into Training and Test Set

The first step in building a machine learning model is to split the data into training and test sets.

The training set is used to build the machine learning model and the test set is to check how the model performs.
​

We will do an 80-20 split for the training set and test set:
data fish_train fish_test_temp1;
set fish2;
rand = ranuni(100);
if rand <= 0.8 then output fish_train;
else output fish_test_temp1;

run;

data fish_test;
set fish_test_temp1;
num = _n_;
run;

We have created the FISH_TRAIN data set, which has 66 observations.
Picture

The FISH_TEST data set has 25 observations.
Picture

Picture
Need help studying for the new
SAS Certified Specialist Exam?
Get access to:
  • Two Full Certificate Prep Courses
  • ​300+ Practice Exercises
Start your free training now
How to Prepare for the SAS Certified Specialist Base Programming Exam


Manual Method: Predicting the Fish in the Test Set
To illustrate how the KNN model works, we will first build the model manually.

As discussed earlier, we will test the model using k = 3. 
​

For each of the 25 fish in the test set, we will first find the Euclidean distance to every other fish in the training set.
Picture

The three closest fish will be used to predict the species.

Let's run the code below on SAS Studio:
proc sql;
create table combine_temp1 as
select a.num, a.species as species_true,
       b.species as species_neighbor,
       sqrt((a.weight - b.weight)**2 + 
            (a.height - b.height)**2    
            ) as distance

from fish_test a, fish_train b
order by a.num, distance;
quit;

​data combine;
set combine_temp1;
by num distance;
if first.num then i = 0;
i + 1;
if i <= 3;
run;

The code above finds the three closest species for each fish in the test set:​
Picture

If the three closest fish consist of two or more perch, then we predict the fish to be a perch. Otherwise, it would be a bream.
Picture

​Let's run the code below:
proc freq data=combine noprint;
table species_neighbor / out = fish_freq;
by num species_true;
run;

proc sort data=fish_freq; by num count; run;

data fish_freq2;
set fish_freq;
by num count;
if last.num;

if species_true = species_neighbor then match = "Y";
else match = "N";
run;

The SPECIES_NEIGHBOR column represents the prediction that we make.

The MATCH column represents whether our prediction matches the correct species of the fish:
Picture

Let's look at how our model performs:
proc freq data=fish_freq2;
table species_true*match / norow nocol nopercent;
run;
Picture
​Out of the 25 fish in the test set, we predicted 22 of them correctly.

We predicted 3 fish incorrectly.
​

Is this a good model? Not really. We can refine it further.

Feature Scaling
When building a KNN model, it is common to do what we called the feature scaling.

Feature scaling is used to standardize the features so that the Euclidean distance calculation will not be dominated by features with a larger scale.

Let's look at the formula for the Euclidean distance:
Picture

The Euclidean distance consists of mostly two components:
  1. The weight difference and
  2. The height difference. 

The weight difference looks at the difference in weights between the fish.

In our example, the fish weight ranges from 5.9 to 1100 (grams). The difference could be more than 1000 grams.

The height, on the other hand, ranges from 2.112 to 19 (cm) across both species of fish. The maximum height difference is no more than 17 cm.
Picture

Because of the wider scale, the weight will dominate the distance calculation.

We will need to rescale the features to ensure each feature will have equal or similar weight when calculating the Euclidean distance.
proc standard data=fish2 out=fish3 mean=0 std=1;
var weight height;
run;


** Split into train and test **;
data fish_train fish_test_temp1;
set fish3;
rand = ranuni(100);
if rand <= 0.8 then output fish_train;
else output fish_test_temp1;
run;

data fish_test;
set fish_test_temp1;
num = _n_;
run;

The Proc Standard step above rescales the Weight and Height columns so that each feature has a zero mean and one standard deviation:
Picture

We will then split the data into training and test sets, and repeat what we have done above:
proc sql;
create table combine_temp1 as
select a.num, a.species as species_true,
       b.species as species_neighbor,
       sqrt((a.weight - b.weight)**2 + 
            (a.height - b.height)**2    
            ) as distance
from fish_test a, fish_train b
order by a.num, distance;
quit;

data combine;
set combine_temp1;
by num distance;
if first.num then i = 0;
i + 1;
if i <= 3;
run;

proc freq data=combine noprint;
table species_neighbor / out = fish_freq;
by num species_true;
run;

proc sort data=fish_freq; by num count; run;

data fish_freq2;
set fish_freq;
by num count;
if last.num;

if species_true = species_neighbor then match = "Y";
else match = "N";
run;

proc freq data=fish_freq2;
table species_true*match;
run;
We now have 100% accuracy!
​

Our prediction matches the actual fish species for all 25 fish in the test set.
Picture
Of course, when solving real-life machine learning problems, it is very rare, if ever, that you can get 100% accuracy.

Our fish species have very distinct features that allows the machine to predict accurately.

​Real-life data is usually messier to deal with.

Proc Discrim Method: Predicting the Fish in the Test Set

In practice, you don't have to write the code manually just to build the model.

The built-in Proc Discrim can be used to run KNN model with just a few lines of code. 

Let's look at the example below:
proc discrim data = fish_train test = fish_test
  testout = _score1 method = npar k = 3 testlist
;
  class species;
  var weight height;
run; 

The Proc Discrim step does the prediction automatically:
Picture

​It also computes the error rate to give you a sense of how the model performs:
Picture

The KNN model is great at solving simple classification problems.

In the next machine learning article, we will look at how to use the KNN model to predict the Titanic survivors.


 

Master SAS in 30 Days

Start your Free training now!
Copyright © 2012-2019 SASCrunch.com All rights reserved.
  • Home
  • SASĀ® Certification Training
    • SAS Certified Specialist Exam Training Program
    • How to Prepare For SAS Certified Specialist Base Programming Exam
  • Online Courses
    • Practical SAS Training Course for Beginners
    • Proc SQL Course
    • SAS Project Training Course
    • Logistic Regression (Credit Scoring) Modeling using SAS
  • Articles
    • Get Started >
      • 18 Free Resources to Help You Learn SAS
      • SAS Tutorial
      • How to Install SAS Studio
      • How to Learn SAS Fast
    • Data Import >
      • Importing Excel Spreadsheet
      • Importing CSV Files
      • Importing Text Files
    • SAS Functions >
      • CAT, CATT, CATS, CATX Functions
      • If-Then-Else statement
      • TRIM Function
      • STRIP Function
      • YEAR, MONTH, DAY Functions
      • Compress Function
      • Do-Loop
      • SCAN Function
      • LIKE Operator
      • INDEX Function
    • Data Manipulations >
      • The Ultimate Guide to Proc SQL
      • Proc Datasets
      • Dictionary Tables
      • Dealing with Missing Values
      • Proc Compare
      • Proc Transpose
      • RETAIN Statement
      • SAS Formats
      • SAS Arrays
    • Statistical Analysis >
      • Proc Means
      • Proc Freq
      • Proc Tabulate
    • Machine Learning >
      • Predicting Fish Species Using K-nearest Neighbor in SAS
      • Classify Product Reviews on Amazon Using Naive Bayes Model in SAS
    • Informational Interviews >
      • How to get a Clinical Trial/Research job without experience
      • Senior Recruiter at a Fortune 500 Retail Company
      • Manager, Non-profit Health Services Research
      • HR Manager
      • Quantitative Analyst
  • Services
    • The Ultimate Job Search Automation Services
    • Statistical Consulting
    • SAS Project or Assignment Help
    • Data Import Services
    • Data Manipulation and Reporting Services
  • In-class Training
    • SAS Training for Job Seekers
  • Guest Lecture
  • Sample Resume
  • About us
  • Contact Us