Vectors and Matrices

Explaining the concepts of vectors and matrices, and how they help in Data Science.
tutorial
Published

February 9, 2022

Vectors

Vectors are ordered arrays of numbers. The elements of a vector are all the same type. A vector does not, for example, contain both characters and numbers. The number of elements in the array is often referred to as the dimension or rank. The elements of a vector can be referenced with an index. In math settings, indexes typically run from 1 to n. In computer science, indexing will typically run from 0 to n-1.

import numpy as np
import time

NumPy Arrays

NumPy’s basic data structure is an indexable, n-dimensional array containing elements of the same type (dtype). Above, dimension was the number of elements in the vector; here, dimension refers to the number of indexes of an array. A one-dimensional or 1-D array has one index.

  • 1-D array, shape (n,): n elements indexed [0] through [n-1]

Vector Creation

Data creation routines in NumPy will generally have a first parameter which is the shape of the object. This can either be a single value for a 1-D result or a tuple (n,m,…) specifying the shape of the result.

# NumPy routines which allocate memory and fill arrays with value
a = np.zeros(4);                print(f"np.zeros(4) :   a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
a = np.zeros((4,));             print(f"np.zeros(4,) :  a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
a = np.random.random_sample(4); print(f"np.random.random_sample(4): a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
np.zeros(4) :   a = [0. 0. 0. 0.], a shape = (4,), a data type = float64
np.zeros(4,) :  a = [0. 0. 0. 0.], a shape = (4,), a data type = float64
np.random.random_sample(4): a = [0.60191145 0.65435818 0.27028017 0.69083142], a shape = (4,), a data type = float64

Some data creation routines do not take a shape tuple.

# NumPy routines which allocate memory and fill arrays with value but do not accept shape as input argument
a = np.arange(4.);              print(f"np.arange(4.):     a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
a = np.random.rand(4);          print(f"np.random.rand(4): a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
np.arange(4.):     a = [0. 1. 2. 3.], a shape = (4,), a data type = float64
np.random.rand(4): a = [0.88573378 0.88693937 0.55170127 0.23140494], a shape = (4,), a data type = float64

Values can be specified manually as well.

# NumPy routines which allocate memory and fill with user specified values
a = np.array([5,4,3,2]);  print(f"np.array([5,4,3,2]):  a = {a},     a shape = {a.shape}, a data type = {a.dtype}")
a = np.array([5.,4,3,2]); print(f"np.array([5.,4,3,2]): a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
np.array([5,4,3,2]):  a = [5 4 3 2],     a shape = (4,), a data type = int32
np.array([5.,4,3,2]): a = [5. 4. 3. 2.], a shape = (4,), a data type = float64

These have all created one-dimensional vector a with four elements. a.shape returns the dimensions. Here we see a.shape = (4,) indicating a 1-D array with 4 elements.

Operations on Vectors

Elements of vectors can be accessed via indexing and slicing.
Indexing means referring to an element of an array by its position within the array.
Slicing means getting a subset of elements from an array based on their indices.
NumPy starts indexing at zero so the 3rd element of a vector a is a[2].

Indexing

# vector indexing operations on 1-D vectors
a = np.arange(10)
print(a)

# access an element
print(f"a[2].shape: {a[2].shape} a[2]  = {a[2]}, Accessing an element returns a scalar")

# access the last element, negative indexes count from the end
print(f"a[-1] = {a[-1]}")

# indexs must be within the range of the vector or they will produce and error
try:
    c = a[10]
except Exception as e:
    print("The error message you'll see is:")
    print(e)
[0 1 2 3 4 5 6 7 8 9]
a[2].shape: () a[2]  = 2, Accessing an element returns a scalar
a[-1] = 9
The error message you'll see is:
index 10 is out of bounds for axis 0 with size 10

Slicing

Slicing creates an array of indices using a set of three values (start:stop:step). A subset of values is also valid.

# vector slicing operations
a = np.arange(10)
print(f"a        =  {a}")

# access 5 consecutive elements (start:stop:step)
c = a[2:7:1];     print("a[2:7:1] = ", c)

# access 3 elements separated by two 
c = a[2:7:2];     print("a[2:7:2] = ", c)

# access all elements index 3 and above
c = a[3:];        print("a[3:]    = ", c)

# access all elements below index 3
c = a[:3];        print("a[:3]    = ", c)

# access all elements
c = a[:];         print("a[:]     = ", c)
a        =  [0 1 2 3 4 5 6 7 8 9]
a[2:7:1] =  [2 3 4 5 6]
a[2:7:2] =  [2 4 6]
a[3:]    =  [3 4 5 6 7 8 9]
a[:3]    =  [0 1 2]
a[:]     =  [0 1 2 3 4 5 6 7 8 9]

Single Vector Operations

There are a number of useful operations that involve operations on a single vector.

a = np.array([1,2,3,4])
print(f"a             : {a}")
# negate elements of a
b = -a 
print(f"b = -a        : {b}")

# sum all elements of a, returns a scalar
b = np.sum(a) 
print(f"b = np.sum(a) : {b}")

# get the average of elements of a
b = np.mean(a)
print(f"b = np.mean(a): {b}")

# square elements of a
b = a**2
print(f"b = a**2      : {b}")
a             : [1 2 3 4]
b = -a        : [-1 -2 -3 -4]
b = np.sum(a) : 10
b = np.mean(a): 2.5
b = a**2      : [ 1  4  9 16]

Vector Vector element-wise operations

Most of the NumPy arithmatic, logical and comparison operations apply to vectors. These operators work on an element-by-element basis. For example \[ c_i = a_i + b_i \]

a = np.array([ 1, 2, 3, 4])
b = np.array([-1,-2, 3, 4])
print(f"Binary operators work element wise: {a + b}")
Binary operators work element wise: [0 0 6 8]

Of course, for this to work correctly, the vectors must be of the same size.

# try a mismatched vector operation
c = np.array([1, 2])
try:
    d = a + c
except Exception as e:
    print("The error message you'll see is:")
    print(e)
The error message you'll see is:
operands could not be broadcast together with shapes (4,) (2,) 

Scalar Vector operations

Vectors can be ‘scaled’ by scalar values. A scalar value is just a number. The scalar multiplies all the elements of the vector.

a = np.array([1, 2, 3, 4])

# multiply a by a scalar
b = 5 * a
print(f"b = 5 * a : {b}")
b = 5 * a : [ 5 10 15 20]

Vector Vector dot product

The dot product multiplies the values in two vectors element-wise and then sums the result. Vector dot product requires the dimensions of the two vectors to be the same.

Let’s implement our own version of the dot product below:

Using a for loop, implement a function which returns the dot product of two vectors. The function to return given inputs \(a\) and \(b\): \[ x = \sum_{i=0}^{n-1} a_i b_i \] Assume both a and b are the same shape.

def my_dot(a, b): 
    """
   Compute the dot product of two vectors
 
    Args:
      a (ndarray (n,)):  input vector 
      b (ndarray (n,)):  input vector with same dimension as a
    
    Returns:
      x (scalar): 
    """
    x = 0
    for i in range(a.shape[0]):
        x = x + a[i] * b[i]
    return x
# test 1-D
a = np.array([1, 2, 3, 4])
print(f"a : {a}")
b = np.array([-1, 4, 3, 2])
print(f"b : {b}")
print(f"my_dot(a, b) = {my_dot(a, b)}")
a : [1 2 3 4]
b : [-1  4  3  2]
my_dot(a, b) = 24

Note, the dot product is expected to return a scalar value.

Let’s try the same operations using np.dot.

# test 1-D
a = np.array([1, 2, 3, 4])
b = np.array([-1, 4, 3, 2])
c = np.dot(a, b)
print(f"NumPy 1-D np.dot(a, b) = {c}, np.dot(a, b).shape = {c.shape} ") 
c = np.dot(b, a)
print(f"NumPy 1-D np.dot(b, a) = {c}, np.dot(a, b).shape = {c.shape} ")
NumPy 1-D np.dot(a, b) = 24, np.dot(a, b).shape = () 
NumPy 1-D np.dot(b, a) = 24, np.dot(a, b).shape = () 

Above, the results for 1-D matched our implementation.

Vectors vs. For Loop

The NumPy library improves speed and memory efficiency as compared to using a for loop for the same task.

np.random.seed(1)
a = np.random.rand(10000000)  # very large arrays
b = np.random.rand(10000000)

tic = time.time()  # capture start time
c = np.dot(a, b)
toc = time.time()  # capture end time

print(f"np.dot(a, b) =  {c:.4f}")
print(f"Vectorized version duration: {1000*(toc-tic):.4f} ms ")

tic = time.time()  # capture start time
c = my_dot(a,b)
toc = time.time()  # capture end time

print(f"my_dot(a, b) =  {c:.4f}")
print(f"loop version duration: {1000*(toc-tic):.4f} ms ")

del(a); del(b)  #remove these big arrays from memory
np.dot(a, b) =  2501072.5817
Vectorized version duration: 69.2592 ms 
my_dot(a, b) =  2501072.5817
loop version duration: 5942.2548 ms 

So, vectorization provides a large speed up in this example. This is because NumPy makes better use of available data parallelism in the underlying hardware. GPU’s and modern CPU’s implement Single Instruction, Multiple Data (SIMD) pipelines allowing multiple operations to be issued in parallel. This is critical in Machine Learning where the data sets are often very large.


Matrices

Matrices are two-dimensional arrays. The elements of a matrix are all of the same type. In notation, matrices are denoted with a capital, bold letter such as \(\mathbf{X}\). In general, m is often the number of rows and n the number of columns. The elements of a matrix can be referenced with a two dimensional index. In math settings, numbers in the index typically run from 1 to n. In computer science and these labs, indexing will run from 0 to n-1.

NumPy Arrays

NumPy’s basic data structure is an indexable, n-dimensional array containing elements of the same type (dtype). Matrices have a two-dimensional (2-D) index [m,n].

In machine learning, 2-D matrices are used to hold training data. Training data is \(m\) examples by \(n\) features creating an (m,n) array.

Matrix Creation

The same function that created 1-D vectors will create 2-D or n-D arrays.
Below, the shape tuple is provided to achieve a 2-D result. Notice how NumPy uses brackets to denote each dimension. Notice further than NumPy, when printing, will print one row per line.

a = np.zeros((1, 5))                                       
print(f"a shape = {a.shape}, a = {a}")                     

a = np.zeros((2, 1))                                                                   
print(f"a shape = {a.shape}, a = {a}") 

a = np.random.random_sample((1, 1))  
print(f"a shape = {a.shape}, a = {a}") 
a shape = (1, 5), a = [[0. 0. 0. 0. 0.]]
a shape = (2, 1), a = [[0.]
 [0.]]
a shape = (1, 1), a = [[0.44236513]]

One can also manually specify data. Dimensions are specified with additional brackets matching the format in the printing above.

# NumPy routines which allocate memory and fill with user specified values
a = np.array([[5], [4], [3]]);   
print(f" a shape = {a.shape}, np.array: a = {a}")

a = np.array([[5],   # One can also
              [4],   # separate values
              [3]]); #into separate rows
print(f" a shape = {a.shape}, np.array: a = {a}")
 a shape = (3, 1), np.array: a = [[5]
 [4]
 [3]]
 a shape = (3, 1), np.array: a = [[5]
 [4]
 [3]]

Operations on Matrices

Indexing

Matrices include a second index. The two indexes describe [row, column]. Access can either return an element or a row/column.

# vector indexing operations on matrices
a = np.arange(6).reshape(-1, 2)   #reshape is a convenient way to create matrices
print(f"a.shape: {a.shape}, \na= {a}")

# access an element
print(f"\na[2,0].shape: {a[2, 0].shape},   a[2,0] = {a[2, 0]},     type(a[2,0]) = {type(a[2, 0])} \nAccessing an element returns a scalar\n")

# access a row
print(f"a[2].shape: {a[2].shape},   a[2]   = {a[2]}, type(a[2])   = {type(a[2])} \nAccessing a matrix by just specifying the row will return a 1-D vector")
a.shape: (3, 2), 
a= [[0 1]
 [2 3]
 [4 5]]

a[2,0].shape: (),   a[2,0] = 4,     type(a[2,0]) = <class 'numpy.int32'> 
Accessing an element returns a scalar

a[2].shape: (2,),   a[2]   = [4 5], type(a[2])   = <class 'numpy.ndarray'> 
Accessing a matrix by just specifying the row will return a 1-D vector

Slicing

Slicing creates an array of indices using a set of three values (start:stop:step). A subset of values is also valid.

# vector 2-D slicing operations
a = np.arange(20).reshape(-1, 10)
print(f"a = \n{a}")

# access 5 consecutive elements (start:stop:step)
print("a[0, 2:7:1] = ", a[0, 2:7:1], ",  a[0, 2:7:1].shape =", a[0, 2:7:1].shape, "a 1-D array")

# access 5 consecutive elements (start:stop:step) in two rows
print("a[:, 2:7:1] = \n", a[:, 2:7:1], ",  a[:, 2:7:1].shape =", a[:, 2:7:1].shape, "a 2-D array")

# access all elements
print("a[:,:] = \n", a[:,:], ",  a[:,:].shape =", a[:,:].shape)

# access all elements in one row (very common usage)
print("a[1,:] = ", a[1,:], ",  a[1,:].shape =", a[1,:].shape, "a 1-D array")
# same as
print("a[1]   = ", a[1],   ",  a[1].shape   =", a[1].shape, "a 1-D array")
a = 
[[ 0  1  2  3  4  5  6  7  8  9]
 [10 11 12 13 14 15 16 17 18 19]]
a[0, 2:7:1] =  [2 3 4 5 6] ,  a[0, 2:7:1].shape = (5,) a 1-D array
a[:, 2:7:1] = 
 [[ 2  3  4  5  6]
 [12 13 14 15 16]] ,  a[:, 2:7:1].shape = (2, 5) a 2-D array
a[:,:] = 
 [[ 0  1  2  3  4  5  6  7  8  9]
 [10 11 12 13 14 15 16 17 18 19]] ,  a[:,:].shape = (2, 10)
a[1,:] =  [10 11 12 13 14 15 16 17 18 19] ,  a[1,:].shape = (10,) a 1-D array
a[1]   =  [10 11 12 13 14 15 16 17 18 19] ,  a[1].shape   = (10,) a 1-D array