# Numpy matrix multiplication issue with 20 elements

## Issue

I am using a matrix multiplication method to retrieve the position of True and False into an array; this is necessary because I cannot use a for look (I have thousands of records).
The procedure is the following:

``````import numpy as np
# Create a test array
test_array = np.array([[False, True, False, False, False, True]])
# Create a set of unique "tens", each one identifying a position
uniq_tens = [10 ** (i) for i in range(0, test_array.shape[1])]
# Multiply the matrix
print(int(np.dot(test_array, uniq_tens)[0]))
100010
``````

The 10010 must be read from right to left (0=False, 1=True, 0=False, 0=False, 1=True). Everything works fine except if the test_array is of 20 elements.

``````# This works fine - Test with 21 elements
test_array = np.array([[False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, True, True]])
print(test_array.shape[1])
uniq_tens = [10 ** (i) for i in range(0, test_array.shape[1])]
print(int(np.dot(test_array, uniq_tens)[0]))
21
111000000000000000010

# This works fine - Test with 19 elements
test_array = np.array([[False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True]])
print(test_array.shape[1])
uniq_tens = [10 ** (i) for i in range(0, test_array.shape[1])]
print(int(np.dot(test_array, uniq_tens)[0]))
19
1000000000000000010

# This does not work - Test with 20 elements
test_array = np.array([[False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True]])
print(test_array.shape[1])
uniq_tens = [10 ** (i) for i in range(0, test_array.shape[1])]
print(int(np.dot(test_array, uniq_tens)[0]))
20
10000000000000000000
``````

I tested with numpy version 1.16.4/1.19.4 and 1.19.5. Could you please help me in understanding why? I am worried it can happen also with other numbers, not only 20.

Thanks a lot for your help!

# Explanation

You are hitting int64 limit:

``````print(len(str(2 ** (64 - 1))))
# 19
``````

when computing `uniq_tens`, which causes some data type problems in conjuction with `np.dot()` with mixed data type inputs.

More precisely, what happens here is that:

1. `uniq_tens` content is Python’s `int`, which is arbitrary precision
2. when you call `np.dot()` the `uniq_tens` list is converted to a NumPy array, with unspecified data type
• when the maximum value is up until `np.iinfo(np.int64).max` the datatype is inferred to be `int64`
• when the maximum value is up between `np.iinfo(np.int64).max` and `np.iinfo(np.uint64).max` the datatype is inferred to be `uint64`
• when the maximum value is above that it retains the Python object and falls back to arbitrary precision
3. There might be an extra cast in `np.dot()` if the inputs are of mixed dtype. In the case of `np.bool_` and `np.uint64` the inferred common type is `np.float64`.

Now:

``````max_int64 = np.iinfo(np.int64).max
print(max_int64, len(str(max_int64)))
# 9223372036854775807 19

max_uint64 = np.iinfo(np.uint64).max
print(max_uint64, len(str(max_uint64)))
# 18446744073709551615 20

print(repr(np.array([max_int64])))
# array([9223372036854775807])
print(repr(np.array([max_uint64])))
# array([18446744073709551615], dtype=uint64)
print(repr(np.array([max_uint64 + 1])))
# array([18446744073709551616], dtype=object)
``````

So, up until 19 and above 21 everything works well.
When you use 20, it does convert to `uint64`.
However, when you use `np.dot()` it realizes it can no longer use `int64` nor `uint64` to hold the result and casts everything to `np.float64`:

``````print(np.dot([1], [max_int64]))
# 9223372036854775807
print(np.dot([1], [max_uint64]))
# 1.8446744073709552e+19
print(np.dot([1], [max_uint64 + 1]))
# 18446744073709551616
``````

``````print(np.dot(np.array([1], dtype=np.uint64), [max_uint64]))
# 18446744073709551616
print(np.dot(np.array([4321], dtype=np.uint64), [max_uint64]))
# 18446744073709547295  # wrong result
``````

which has its own overflow problems.

# Mitigation

One way to make sure the above code works all the time is to force the `dtype` of `uniq_tens` to `object`:

``````import numpy as np

test_array = np.array([[0, 1] + [0] * 17 + [1]])
uniq_tens = np.array([(10 ** i) for i in range(test_array.shape[1])], dtype=object)

print(test_array.shape[1], int(np.dot(test_array, uniq_tens)[0]))
# 20 10000000000000000010
``````

# Other approaches

If we are after the fastest way of computing the integer with a specific base, a number of approaches can be devised:

``````import numpy as np
import numba as nb

def bools_to_int(arr, base=2):
return sum(base ** i for i, x in enumerate(arr.tolist()) if x)

def bools_to_int_dot(arr, base=2):
pows = np.array([base ** i for i in range(len(arr))], dtype=object)
return np.dot(arr, pows)

def bools_to_int_mul_sum(arr, base=2):
pows = np.array([base ** i for i in range(len(arr))], dtype=object)
return np.sum(arr * pows)

@nb.njit
def bools_to_int_nb(arr, base=2):
n = arr.size
result = 0
for i in range(n):
if arr[i]:
result += base ** i
return result
``````

The looped approach can be accelerated also with Cython:

``````%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True

# cimport numpy as cnp
# cimport cython as ccy

# import numpy as np
# import cython as cy

cpdef bools_to_int_cy(arr, base=2):
cdef long n = arr.size
result = 0
for i in range(n):
if arr[i]:
result += base ** i
return result
``````

Note that the `bools_to_int_nb()` approach will work until the int64 limit.

Since the power operation is one of the most expensive in such computation, it can be precomputed to further speed-up multiple calls:

``````MAX_PRE_VAL = 256
BASES = list(range(2, 16))
POWS = {
b: np.array([b ** i for i in range(MAX_PRE_VAL)])
for b in BASES}

def bools_to_int_pre(arr, base=2, pows=POWS):
return sum(pows[base][i] for i, x in enumerate(arr.tolist()) if x)

def bools_to_int_dot_pre(arr, base=2, pows=POWS):
return np.dot(arr, pows[base][:len(arr)])

def bools_to_int_mul_sum_pre(arr, base=2, pows=POWS):
return np.sum(arr * pows[base][:len(arr)])
``````

It is easy to see that all methods produce the same result (save for the `bools_to_int_nb()` with the limitations already noted):

``````funcs = (
bools_to_int, bools_to_int_pre,
bools_to_int_dot, bools_to_int_dot_pre,
bools_to_int_mul_sum, bools_to_int_mul_sum_pre,
bools_to_int_cy, bools_to_int_nb)

rng = np.random.default_rng(42)
arr = rng.integers(0, 2, 112)
for func in funcs:
print(f"{func.__name__!s:>32}  {func(arr)}")
``````
``````                    bools_to_int  3556263535586786347937292461931686
bools_to_int_pre  3556263535586786347937292461931686
bools_to_int_dot  3556263535586786347937292461931686
bools_to_int_dot_pre  3556263535586786347937292461931686
bools_to_int_mul_sum  3556263535586786347937292461931686
bools_to_int_mul_sum_pre  3556263535586786347937292461931686
bools_to_int_cy  3556263535586786347937292461931686
bools_to_int_nb  -4825705174627124058
``````

With the following code to produce the timings:

``````rng = np.random.default_rng(42)

timings = {}
k = 16
for n in range(1, 128, 3):
arrs = rng.integers(0, 2, (k, n))
print(f"n = {n}")
timings[n] = []
base = [funcs[0](arr) for arr in arrs]
for func in funcs:
res = [func(arr) for arr in arrs]
is_good = base == res
timed = %timeit -r 8 -n 16 -q -o [func(arr) for arr in arrs]
timing = timed.best * 1e6 / k
timings[n].append(timing if is_good else None)
print(f"{func.__name__:>24}  {is_good}  {timing:10.3f} µs")
``````

to be plotted with:

``````import pandas as pd

df = pd.DataFrame(data=timings, index=[func.__name__ for func in funcs]).transpose()
df.plot(marker='o', xlabel='Input size / #', ylabel='Best timing / µs', figsize=(10, 8))
``````

Showing that before hitting the `int64` limit, the `bools_to_int_nb()` approach is the fastest by far and large.
Conversely, for larger values `np.dot()` with pre-computing is the fastest.
Without pre-computing, resorting to simple manual looping is the fastest, and the Cython acceleration provides a small but appreciable speed-up.

Beware that power-of-2 problem can probably be optimized more.