Numpy matrix multiplication issue with 20 elements

Issue

I am using a matrix multiplication method to retrieve the position of True and False into an array; this is necessary because I cannot use a for look (I have thousands of records).
The procedure is the following:

import numpy as np
# Create a test array
test_array = np.array([[False, True, False, False, False, True]])
# Create a set of unique "tens", each one identifying a position
uniq_tens = [10 ** (i) for i in range(0, test_array.shape[1])]
# Multiply the matrix
print(int(np.dot(test_array, uniq_tens)[0]))
100010

The 10010 must be read from right to left (0=False, 1=True, 0=False, 0=False, 1=True). Everything works fine except if the test_array is of 20 elements.

# This works fine - Test with 21 elements
test_array = np.array([[False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, True, True]])
print(test_array.shape[1])
uniq_tens = [10 ** (i) for i in range(0, test_array.shape[1])]
print(int(np.dot(test_array, uniq_tens)[0]))
21
111000000000000000010

# This works fine - Test with 19 elements
test_array = np.array([[False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True]])
print(test_array.shape[1])
uniq_tens = [10 ** (i) for i in range(0, test_array.shape[1])]
print(int(np.dot(test_array, uniq_tens)[0]))
19
1000000000000000010

# This does not work - Test with 20 elements
test_array = np.array([[False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True]])
print(test_array.shape[1])
uniq_tens = [10 ** (i) for i in range(0, test_array.shape[1])]
print(int(np.dot(test_array, uniq_tens)[0]))
20
10000000000000000000

I tested with numpy version 1.16.4/1.19.4 and 1.19.5. Could you please help me in understanding why? I am worried it can happen also with other numbers, not only 20.

Thanks a lot for your help!

Solution

Explanation

You are hitting int64 limit:

print(len(str(2 ** (64 - 1))))
# 19

when computing uniq_tens, which causes some data type problems in conjuction with np.dot() with mixed data type inputs.


More precisely, what happens here is that:

  1. uniq_tens content is Python’s int, which is arbitrary precision
  2. when you call np.dot() the uniq_tens list is converted to a NumPy array, with unspecified data type
    • when the maximum value is up until np.iinfo(np.int64).max the datatype is inferred to be int64
    • when the maximum value is up between np.iinfo(np.int64).max and np.iinfo(np.uint64).max the datatype is inferred to be uint64
    • when the maximum value is above that it retains the Python object and falls back to arbitrary precision
  3. There might be an extra cast in np.dot() if the inputs are of mixed dtype. In the case of np.bool_ and np.uint64 the inferred common type is np.float64.

Now:

max_int64 = np.iinfo(np.int64).max
print(max_int64, len(str(max_int64)))
# 9223372036854775807 19

max_uint64 = np.iinfo(np.uint64).max
print(max_uint64, len(str(max_uint64)))
# 18446744073709551615 20

print(repr(np.array([max_int64])))
# array([9223372036854775807])
print(repr(np.array([max_uint64])))
# array([18446744073709551615], dtype=uint64)
print(repr(np.array([max_uint64 + 1])))
# array([18446744073709551616], dtype=object)

So, up until 19 and above 21 everything works well.
When you use 20, it does convert to uint64.
However, when you use np.dot() it realizes it can no longer use int64 nor uint64 to hold the result and casts everything to np.float64:

print(np.dot([1], [max_int64]))
# 9223372036854775807
print(np.dot([1], [max_uint64]))
# 1.8446744073709552e+19
print(np.dot([1], [max_uint64 + 1]))
# 18446744073709551616

Instead, when you start with something that is already a uint64 it keeps using that:

print(np.dot(np.array([1], dtype=np.uint64), [max_uint64]))
# 18446744073709551616
print(np.dot(np.array([4321], dtype=np.uint64), [max_uint64]))
# 18446744073709547295  # wrong result

which has its own overflow problems.


Mitigation

One way to make sure the above code works all the time is to force the dtype of uniq_tens to object:

import numpy as np


test_array = np.array([[0, 1] + [0] * 17 + [1]])
uniq_tens = np.array([(10 ** i) for i in range(test_array.shape[1])], dtype=object)

print(test_array.shape[1], int(np.dot(test_array, uniq_tens)[0]))
# 20 10000000000000000010

Other approaches

If we are after the fastest way of computing the integer with a specific base, a number of approaches can be devised:

import numpy as np
import numba as nb


def bools_to_int(arr, base=2):
    return sum(base ** i for i, x in enumerate(arr.tolist()) if x)


def bools_to_int_dot(arr, base=2):
    pows = np.array([base ** i for i in range(len(arr))], dtype=object)
    return np.dot(arr, pows)


def bools_to_int_mul_sum(arr, base=2):
    pows = np.array([base ** i for i in range(len(arr))], dtype=object)
    return np.sum(arr * pows)


@nb.njit
def bools_to_int_nb(arr, base=2):
    n = arr.size
    result = 0
    for i in range(n):
        if arr[i]:
            result += base ** i
    return result

The looped approach can be accelerated also with Cython:

%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True

# cimport numpy as cnp
# cimport cython as ccy

# import numpy as np
# import cython as cy


cpdef bools_to_int_cy(arr, base=2):
    cdef long n = arr.size
    result = 0
    for i in range(n):
        if arr[i]:
            result += base ** i
    return result

Note that the bools_to_int_nb() approach will work until the int64 limit.

Since the power operation is one of the most expensive in such computation, it can be precomputed to further speed-up multiple calls:

MAX_PRE_VAL = 256
BASES = list(range(2, 16))
POWS = {
    b: np.array([b ** i for i in range(MAX_PRE_VAL)])
    for b in BASES}


def bools_to_int_pre(arr, base=2, pows=POWS):
    return sum(pows[base][i] for i, x in enumerate(arr.tolist()) if x)


def bools_to_int_dot_pre(arr, base=2, pows=POWS):
    return np.dot(arr, pows[base][:len(arr)])


def bools_to_int_mul_sum_pre(arr, base=2, pows=POWS):
    return np.sum(arr * pows[base][:len(arr)])

It is easy to see that all methods produce the same result (save for the bools_to_int_nb() with the limitations already noted):

funcs = (
    bools_to_int, bools_to_int_pre,
    bools_to_int_dot, bools_to_int_dot_pre,
    bools_to_int_mul_sum, bools_to_int_mul_sum_pre,
    bools_to_int_cy, bools_to_int_nb)


rng = np.random.default_rng(42)
arr = rng.integers(0, 2, 112)
for func in funcs:
    print(f"{func.__name__!s:>32}  {func(arr)}")
                    bools_to_int  3556263535586786347937292461931686
                bools_to_int_pre  3556263535586786347937292461931686
                bools_to_int_dot  3556263535586786347937292461931686
            bools_to_int_dot_pre  3556263535586786347937292461931686
            bools_to_int_mul_sum  3556263535586786347937292461931686
        bools_to_int_mul_sum_pre  3556263535586786347937292461931686
                 bools_to_int_cy  3556263535586786347937292461931686
                 bools_to_int_nb  -4825705174627124058

With the following code to produce the timings:

rng = np.random.default_rng(42)


timings = {}
k = 16
for n in range(1, 128, 3):
    arrs = rng.integers(0, 2, (k, n))
    print(f"n = {n}")
    timings[n] = []
    base = [funcs[0](arr) for arr in arrs]
    for func in funcs:
        res = [func(arr) for arr in arrs]
        is_good = base == res
        timed = %timeit -r 8 -n 16 -q -o [func(arr) for arr in arrs]
        timing = timed.best * 1e6 / k
        timings[n].append(timing if is_good else None)
        print(f"{func.__name__:>24}  {is_good}  {timing:10.3f} µs")

to be plotted with:

import pandas as pd


df = pd.DataFrame(data=timings, index=[func.__name__ for func in funcs]).transpose()
df.plot(marker='o', xlabel='Input size / #', ylabel='Best timing / µs', figsize=(10, 8))

benchmark

Showing that before hitting the int64 limit, the bools_to_int_nb() approach is the fastest by far and large.
Conversely, for larger values np.dot() with pre-computing is the fastest.
Without pre-computing, resorting to simple manual looping is the fastest, and the Cython acceleration provides a small but appreciable speed-up.

Beware that power-of-2 problem can probably be optimized more.

Answered By – norok2

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published