In this semester I’m taking machine learning course, in which I do my assignments with Python. Currently the major package I’ve used is Python, and I decide to write this post to note something I’ve made mistakes on and learnt about.

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Examples:

1
2
3
4
a = np.arange([0, 5])
b1 = (a > 1) & (a < 4) # correct
b2 = (a > 1) and (a < 4) # would throw the error
b3 = np.logical_and(a > 1, a < 4) # correct

According to this StackOverflow post, when use and to evaluate 2 boolean arrays (and still works for 2 boolean values), there is no fixed way to do this. So Numpy chooses to throw an ValueError in this case.

While for & and |, they are defined to be bitwise operators on two arrays, so they would work as expected in this case.

And Numpy also provides another way to realize this feature, which is element-wise function like logical_and (compute the truth value of x1 AND x2 element-wise) and logical_or (compute the truth value of x1 OR x2 element-wise).

Another set of functions similar to logical_and and logical_or is element-wise bitwise function like bitwise_and (compute the bit-wise AND of two arrays element-wise, and computes the bit-wise AND of the underlying binary representation of the integers in the input arrays.) and bitwise_or.

So for boolean arrays, the effects of logical_and and bitwise_and are same, while bitwise_and could also be applied to integer arrays.

Boolean Indexing

An example to predict yhat in classification problems:

1
2
3
yhat = np.dot(S[:, 0: -1], w)
yhat[yhat <= 0] = -1
yhat[yhat > 0] = 1

An example to calculate confusion matrix:

1
2
3
4
tp = y_yhat[(y_yhat[:, 0] == 1) & (y_yhat[:, 1] == 1)].shape[0]
fp = y_yhat[(y_yhat[:, 0] == -1) & (y_yhat[:, 1] == 1)].shape[0]
fn = y_yhat[(y_yhat[:, 0] == 1) & (y_yhat[:, 1] == -1)].shape[0]
tn = y_yhat[(y_yhat[:, 0] == -1) & (y_yhat[:, 1] == -1)].shape[0]

The results of yhat <= 0 and yhat > 0 are arrays of boolean values.

From manual:

Boolean arrays used as indices are treated in a different manner entirely than index arrays. Boolean arrays must be of the same shape as the initial dimensions of the array being indexed.

The result is a 1-D array containing all the elements in the indexed array corresponding to all the true elements in the boolean array. The elements in the indexed array are always iterated and returned in row-major (C-style) order.

The result will be multidimensional if y has more dimensions than b. When the boolean array has fewer dimensions than the array being indexed, this is equivalent to y[b, …], which means y is indexed by b followed by as many : as are needed to fill out the rank of y. Thus the shape of the result is one dimension containing the number of True elements of the boolean array, followed by the remaining dimensions of the array being indexed.

numpy.dot()

numpy.dot() performs dot product of 2 numpy arrays.

If the two arrays are 1D, it is dot product. Two input arrays should have same dimensions. The output is a number.

1
2
3
a1 = np.array([0, 1, 2])
a2 = np.array([1, 2, 3])
a = np.dot(a1, a2) # 8

If the two arrays are 2D, it is matrix multiplication. Dimension 1 of array 1 should be the same as dimension 0 of array 0.

1
2
3
4
a1 = np.array([[0, 1, 2], [2, 3, 4]]) # 2*3
a2 = np.array([[1, 2, 3], [3, 4, 5]]) # 2*3
a = np.dot(a1, a2.T) # 2*3, 3*2
print(a) #[[8, 14], [20, 38]]

Python function

return multiple values and whatever values

pass parameter by reference

Python pass parameter

To calculate ROC, I need to move the initial weight vector w in parallel (by adding different offsets to it). At first I found after I ran the function, the w vector passed in was also changed. So later I modified the code to following and solved the problem:

1
w_para = np.array(w)

numpy way in Winnow algorithm

The running speeds of the following two code snippets with the same function are very different:

1
2
3
4
5
6
7
8
# numpy way
exponent_p = eta * S[i, -1] * S[i, 0: p]
wp = np.multiply(wp, np.exp(exponent_p))
exponent_n = (-1) * eta * S[i, -1] * S[i, 0: p]
wn = np.multiply(wn, np.exp(exponent_n))
s = np.sum(wp) + np.sum(wn)
wp = wp / s
wn = wn / s
1
2
3
4
5
6
7
8
9
# normal python loop
s = 0
for j in range(0, p):
wp[0, j] = wp[0, j] * math.exp(eta * S[i, -1] * S[i, j])
wn[0, j] = wn[0, j] * math.exp(-1 * eta * S[i, -1] * S[i, j])
s += wp[0, j]
s += wn[0, j]
wp[:] = wp[:] / s
wn[:] = wn[:] / s

The function numpy.multiply() multiplies 2 input arrays element-wise.

The function numpy.exp() calculates the exponential of all input elements (also element-wise).

numpy 1D and 2D arrays

1
2
3
4
5
6
7
8
9
10
11
12
13
a = np.ones(5)
b = np.ones([5, 1])
print(a.ndim) # 1
print(b.ndim) # 2
print(a) # [1. 1. 1. 1. 1.]
print(b)
# [[1.]
# [1.]
# [1.]
# [1.]
# [1.]]
print(a.shape) # (5,)
print(b.shape) # (5, 1)

Both a and b seem exactly same (both 51 vector) in terms of real world linear algebra. But for Numpy they are actually *DIFFERENT.

And the difference in shape can lead to corresponding difference in calculation results. Whenever we want to define “vector” in Numpy, we need to think carefully about its dimension and shape.

One bug that I encountered related to this issue is to compute model prediction error (mean squared error). And it took me really long time to fix this bug!!

1
2
3
4
# inconsistent dimensions between 2 arrays
yhat = np.ones(100)
y = np.zeros([100, 1])
error = np.sum((yhat - y) ** 2) / 100 # 100, which is WRONG!
1
2
3
yhat = np.ones(100)
y = np.zeros(100)
error = np.sum((yhat - y) ** 2) / 100 # 1, correct
1
2
3
yhat = np.ones([100, 1])
y = np.zeros([100, 1])
error = np.sum((yhat - y) ** 2) / 100 # 1, correct

numpy bootstrap sampling

Definition of bootstrap sampling: random sampling with replacement

1
2
3
4
5
6
7
8
9
np.random.seed(111) # set seed for each forest
for i in range(M): # each tree
# generate index, which can be applied to both X and y later
n_idx = np.random.choice(n, B, replace = False) # right now without replacement
y_sample = y[n_idx, :]
X_sample = X[n_idx, :]
n_oobidx = list(set(range(n)) - set(n_idx))
y_oob = y[n_oobidx, :]
X_oob = X[n_oobidx, :]

numpy OOB permuted error

1
2
3
X_oob = X[n_oobidx, :]
X_oob = X_oob[:, best]
np.random.shuffle(X_oob) # the shuffle function only work on 1d array

The np.random.shuffle(x) function modifies a sequence in-place by shuffling its contents. It doesn’t have return values.

This function only shuffles the array along the first axis of a multi-dimensional array. The order of sub-arrays is changed but their contents remains the same.

Docker

  • Why it is so popular?
    Make shipping code to server easier. For multiple softwares on multiple machines it is a pain to set the environment and dependencies up on every intersection. It is like the traditional shipping industry, in which you have to send different goods by different transportation means. At the beginning each transportation company needs to have their own experts to deal with the packing and shipping details. Then the standardized shipping container becomes a solution that all shipping companies agree on so that they can get separate from sending the containers. They only need to care about how to put goods into containers. A variety of infrastructure providers can organize containers. Later new infrastructure tools can be added or updated but the containers do not need to be repackaged. This standardization makes shipping easier and cheaper.
    For softwares, developers have a standard way to pack their softwares into a container with standard properties. Then developers send containers to tool makers/Ops teams/infrastructure providers and these makers know how to handler the containers in a particular way.