« NSF Unidata Update:... | Main | NSF Unidata 2024... »

04 March 2024

By Thomas Martin, AI/ML Software Engineer

Fred Rogers, famous for asking people to be his neighbor
(Click to enlarge)

K Nearest Neighbors (KNN) is a supervised machine learning method that 'memorizes' (stores) an entire dataset, then relies on the concepts of proximity and similarity to make predictions about new data. The basic idea is that if a new data point is in some sense 'close' to existing data points, its value is likely to be similar to the values of its neighbors. In the Earth Systems Sciences, such techniques can be useful for small- to moderate-scale classification and regression problems; one example uses KNN techniques to derive local-scale information about precipitation and temperature from regional- or global-scale numerical weather prediction model output.

When using a KNN algorithm, you select the number of 'neighbors' to consider (K), and potentially a way of calculating the 'distance' between data points. KNN algorithms can be used for both classification and regression problems. For regression problems, KNN predicts the target variable by using an averaging scheme. For classification problems it takes the mode of the nearest neighbors; as a result, it is generally recommended that the value of K be an odd number. Effective use of KNN often requires some experimentation to determine the best value for K.

.' href='https://assets.unidata.ucar.edu/blog_content/images/2024/20240219_ml_k_neighbors.png'> Comparing decision boundary

Comparing the decision boundary between using 1 neighbor vs 20, from Kevin Zakka's blog.

KNN is sometimes called a 'lazy learning' method. This is because it does not generate a new explicit model, but rather memorizes the dataset in its entirety. While the scikit-learn API uses a .fit() method, this is largely to match the rest of the scikit-learn API.

Why you might use KNN for your ML project

It's simple. Because KNN is a lazy learner, there is no complex model and only limited math is needed to understand the inner workings.
It's adaptable to different data distributions. KNN works well with odd distributions of data.
It's good for smaller datasets. Because no model is being constructed, KNNs can be a good choice for smaller datasets.

Some Downsides to KNN

It's sensitive to outliers and poor feature selection. KNN does not do any automatic feature selection like decision tree models. These types of models can struggle in high dimensional space, both with a large number of input features and outliers within those features.
It has a relatively high computational cost. While the analog/sample matching behavior of KNNs are great from an explainability point of view (model-free ML is great!), for large datasets the cost of memorizing the entire dataset can be enormous.
It needs a complete dataset. Like many other ML models, KNNs do not handle missing data or NaN (Not a Number) values. If your dataset is not complete, you'll need to impute the missing values before using a KNN.

KNNs have been discussed previously on MetPy Mondays here: MetPy Mondays #183 - Predicting Rain with Machine Learning - Using KNN

KNNs are a great supervised ML model to try out if your dataset is on the smaller side. Happy modeling! What ML model should I cover in an upcoming blog?

Sun	Mon	Tue	Wed	Thu	Fri	Sat
« July 2025
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Today

« NSF Unidata Update:... | Main | NSF Unidata 2024... »

Why you might use KNN for your ML project

Some Downsides to KNN

More reading and resources

News@Unidata

News@Unidata

Recent Entries:

Take a poll!

Browse By Topic

Browse by Topic

Blog Search

« NSF Unidata Update:... | Main | NSF Unidata 2024... »

Why you might use KNN for your ML project

Some Downsides to KNN

More reading and resources

News@Unidata

News@Unidata

Recent Entries:

Take a poll!

Browse By Topic

Browse by Topic

Search by Tag

Blog Search