# Memory Error Numpy/Python Euclidean Distance

## Issue

I’m trying to run a K-Means Clustering algorithm with numpy and python, but keep running into a memory error if I use larger values of K (anything greater than 10 seems to result in the error). I have two numpy arrays of size [42000,784] (the dataset) and [K,784] (the centroids). The memory error occurs when calculating the euclidean distance between each of the centroids and each of the data points. This is the function I have been using:

``````def dist(a,b):
a = a[np.newaxis,:,:]
b = b[:,np.newaxis,:]
dist = np.linalg.norm((a-b), axis=2)
return dist
``````

Is this a memory leak or do I legitimately not have enough memory (I have 8GB)? How can I fix this?

## Solution

`scipy` has built-in functions for distance computations, which are lightning fast compared to home made implementations.

So, the first idea is to replace your whole `distance` function by the following expression:

``````from numpy.random import rand
from scipy.spatial import distance

# sample data
a = randn(42000, 784
b = randn(256, 784)

# distance computation
dist = distance.cdist(a, b, metric='euclidean')    # about 8.02 s on
# my 8 GB RAM machine
``````

Note that `dist` in this example is transposed according to your example. If you want to the shape of your example just do `dist = distance.cdist(a, b).T`.

It is further possible to speed-up the computation a little by omitting the square root operation. You may accomplish this by `dist = distance.cdist(a, b, metric='sqeuclidean')`.

This whole approach does not greatly reduce memory consumption but it takes the memory only for a few seconds.

The second idea is to not use home made implementations at all, but some reliabe third party packages, like the well-knwon `Scikit Learn`:

``````from sklear.cluster import KMeans
a = randn(4200, 200)

km = KMeans(n_clusters=256) 