I’m trying to run a K-Means Clustering algorithm with numpy and python, but keep running into a memory error if I use larger values of K (anything greater than 10 seems to result in the error). I have two numpy arrays of size [42000,784] (the dataset) and [K,784] (the centroids). The memory error occurs when calculating the euclidean distance between each of the centroids and each of the data points. This is the function I have been using:
def dist(a,b): a = a[np.newaxis,:,:] b = b[:,np.newaxis,:] dist = np.linalg.norm((a-b), axis=2) return dist
Is this a memory leak or do I legitimately not have enough memory (I have 8GB)? How can I fix this?
scipy has built-in functions for distance computations, which are lightning fast compared to home made implementations.
So, the first idea is to replace your whole
distance function by the following expression:
from numpy.random import rand from scipy.spatial import distance # sample data a = randn(42000, 784 b = randn(256, 784) # distance computation dist = distance.cdist(a, b, metric='euclidean') # about 8.02 s on # my 8 GB RAM machine
dist in this example is transposed according to your example. If you want to the shape of your example just do
dist = distance.cdist(a, b).T.
It is further possible to speed-up the computation a little by omitting the square root operation. You may accomplish this by
dist = distance.cdist(a, b, metric='sqeuclidean').
This whole approach does not greatly reduce memory consumption but it takes the memory only for a few seconds.
The second idea is to not use home made implementations at all, but some reliabe third party packages, like the well-knwon
from sklear.cluster import KMeans a = randn(4200, 200) km = KMeans(n_clusters=256) km.fit(a) # about 10 s
One of several advantages of this implementation is, that it automatically decides how to compute the distances so that it does not blow your memory.
Answered By – MaxPowers