by Erich Owens

*Optimizing the memory footprint of a classifier used here at Newsle set us down a rabbit hole of rewriting a basic Scipy function with Cython, something that only became a problem when our high-dimensional text spaces grew to a cartoonish size, thanks to the hashing trick. Here I motivate the use of the hashing trick, how we use sparse matrix-vector multiplication for text classification, and how we derived and wrote the new implementation.*

**The Hashing Trick: Why, How, What**

In many information retrieval, natural language processing, or machine learning contexts, it is standard to work with a large spaces of words/n-grams, sometimes on the order of a few million unique observations. This bag-of-words model treats a document as a point in a high-dimensional space; a sparse vector in which the non-negative entries line up with the terms that document has.

Creating this corpus model is typically done in a stateful, incremental way, where terms are assigned a coordinate in the order in which that word is seen. For instance, the sentence “The dog is a nice dog, but cats aren’t so nice.” might be tokenized, filtered for stopwords, counted, and embedded into its own space as the vector where the ordered dimensions correspond to {dog, nice, cat}.

Keeping this ordering around in a production environment (usually in the form of a hash table or dictionary with constant lookup time) can have a huge memory footprint. This also requires that we process the corpus in serial, or share state across processes, complicating parallelization. Enter the hash trick.

Rather than derive the feature coordinates from the order seen, we can generate a column index for a given string with a hash function, mapping to enough (prime-numbered) buckets to avoid collision for a corpus with even a million unique terms. For this exercise, we’ll use MurmurHash3 (MMH3.) Choosing a 32-bit hash, our domain of possible column incides is now up to 2^{31} – 1 (modding out the sign to ensure we have sensible column indices. This is also, incidentally, the eighth Mersenne prime!) On a corpus of nearly 1M Newsle industry bigrams (explained below), we encounter just 17 collisions, so we’ll call this “good enough.”

**Hashing Trick in Practice: Scipy Sparse and Text Classifiers **

Let’s try a simple classification task. We have a corpus of bigrams used across news articles flagged as representing one of nearly two hundred industries (e.g., banking, software, management consulting, fisheries), with each industry compactly represented by the centroid of its many representative articles. We could then assert that a new query document is described by whichever centroid is closest to it. We’ll define closeness by use of cosine similarity. The underlying space can then be thought of as a Voronoi tessellation shattered into a few hundred pieces (each corresponding to a pane centered on a centroid vector), and this method may often be called a “nearest-centroid” or “1NN” classifier.

Classification is then as simple as erecting a matrix representation of the corpus, called , (with rows as industry centroid vectors), l_{2}-normalizing its rows, and constructing a vector of the new document, called , with the same method of (TF-IDF) weighting and with unit l_{2}-norm. Computing cosine similarity against every category is then just computing the matrix-vector multiplication

.

Arg-maxing on will yield the closest industry, and our prediction. Let’s begin!

Let’s take our article to be a recent New York Times piece on BlackBerry’s announcement of its new phones and operating system. Without feature-hashing, we’d construct our vector by computing TF-IDF weights on tokenized terms by referencing a dictionary for both the index and inverse-document frequency (IDF) lookup. Here we use Python’s popular Scipy Sparse module, a library allows us to efficiently store matrices with mostly zero entries across a variety of sparse data representations, while supporting many important linear algebraic methods (e.g., back-substitutions, factorizations, and eigendecompositions by the Lanczos algorithm).

from scipy import sparse from math import sqrt tfs = nlp.tokenize_and_count(article) seen_tfs = filter(lambda t:t in feature_lookup, tfs) rows = [feature_lookup[term] for term in seen_tfs.iterkeys()] columns = [0 for _ in xrange(len(rows))] tfidfs = [tf * idf[term] for term, tf in seen_tfs.iteritems()] norm = sqrt(sum(w**2 for w in tfidfs)) normed_tfidfs = [w / norm for w in tfidfs] v = sparse.csc_matrix((normed_tfidfs, (rows, columns)), shape = (M.shape[0], 1)) r = M * v classification = industries[r.toarray().argmax()]

Here our classifier labels the Times article to be about “Wireless.” The next closest guesses were “Telecommunications” and “Consumer Electronics”. Pretty good, classifier! Let’s now assume we wanted this script to run across many child processes, the memory footprint as small as possible on each, and we’d like to **use the hashing trick instead of a dictionary for the feature lookup**.

Sparse CSR (compressed sparse row) data structures will play nice if the column indices can fit into a numpy.int32 data type, so a signed 32-bit hash function will work well here. We merely need to replace the matrix with , for which the column indices are now in instead of The data structure is the same size in memory as before, but now we need only have a hash function on hand instead of an additional hash table**.** Let’s make the new vector the same as before, this time drawing on a word hash, and pulling the IDFs from a C++-implemented MARISA trie. So our naive implementation of this procedure might be:

import mmh3 from math import sqrt, log MAX_INT = 2147483647 def word_hash(s): return (mmh3.hash(s) % MAX_INT) def calc_tfidf(tf, df, N = 200): return tf * log(N / (1.0 + df)) tfs = newsle_nlp.tokenize_and_count(article) hashed_tfs = {word_hash(term):tf for term, tf in tfs.iteritems()} rows = hashed_tfs.keys() columns = [0 for _ in xrange(len(rows))] tfidfs = [calc_tfidf(tf, df = df_trie.get(h_term,[(0]])[0][0]) for h_term, tf in hashed_tfs.iteritems()] norm = sqrt(sum(w**2 for w in tfidfs)) normed_tfidfs = [w / norm for w in tfidfs] v_hashed = sparse.csc_matrix((normed_tfidfs, (rows, columns)), shape = (MAX_INT, 1)) r = M_hashed * v_hashed classification = industries[r.toarray().argmax()]

**This code will fail!** The problem is in the penultimate line. It was natural for us to structure the matrix `M_hashed`

as a “CSR” data structure (there is sparse data on every row, but not for every column) and the vector v_hashed as a “CSC” (Compressed Sparse Column, it’s a single column vector, but nearly 2^{31} of its rows are zero.) The problem is in how Scipy Sparse does multiplication– it requires both matrices in the operation `A*B`

to be either CSR or CSC, and will convert the second to the format of the first if this is not the case. **Forcing our sparse vector v_hashed to become CSR, however, will blow up our memory**.

To see why, let’s try a toy example. Consider the following vector in which has nonzero entries in only four places. One way of constructing this sparse vector is just by just passing in the ordered coordinates:

I, J, V = [2, 5, 6, 12], [0, 0, 0, 0], [0.1, 0.2, 0.3, 0.9] v = sparse.csc_matrix((V, (I, J)), shape = (15, 1))

The CSC vector we constructed is now uniquely determined by three core numpy arrays– `v.data`

, the nonzero values; `v.indices`

, the rows corresponding to the values in the previous array; and `v.indptr`

, a pointer to the values in the previous two arrays, telling us where the values and row-indices for the contents of the i^{th} column lie.

`>>> print v.todense().T`

[[ 0. 0. 0.1 0. 0. 0.2 0.3 0. 0. 0. 0. 0. 0.9 0. 0. 0. 0. ]]

>>> print v.data, v.indices, v.indptr

[0.1 0.2 0.3 0.9] [2 5 6 12] [0 4]

CSR is a very similar data structure as CSC, but its `indices`

array points to the nonzero value’s columns, and its `indptr`

has an entry whenever our data skips to the next array. For a column vector, you can imagine this to be redundant:

`>>>v_csr = v.tocsr()`

>>> print v_csr.todense().T

[[ 0. 0. 0.1 0. 0. 0.2 0.3 0. 0. 0. 0. 0. 0.9 0. 0. ]]

>>> print v_csr.data, v_csr.indices, v_csr.indptr

[ 0.1 0.2 0.3 0.9] [0 0 0 0] [0 0 0 1 1 1 2 3 3 3 3 3 3 4 4 4]

And so it is! Though this vector is the same mathematical object as it would be in a CSC representation, its row pointer is forced to have an entry for every (nominally) new row in the vector. This requires an entry for the length of the array, defeating the point of a sparse structure. Now you can imagine why our classification code above failed– coercing `v_hashed`

to CSR will construct a structure with an `indptr`

array that has nearly 2^{31} numpy long integers, and that’s going to cause a segfault.

So, we clearly need a CSR times CSC multiplication method. Not finding one out there on the internet, we will write one of our own.

**Sparse CSR/CSC Multiplication in Python**

Computing this sparse matrix/vector product means we want to scan rows of our matrix and whenever it shares nonzero columns with the vector factor, we’ll multiply the result, and the sum of all such overlaps is the inner product on that dimension. Trying out this naive logic as Pythonically as possible:

from scipy import sparse import numpy as np def pythonic_mult(M, v): assert isinstance(M, sparse.csr.csr_matrix), "matrix M must be CSR format." assert isinstance(v, sparse.csc.csc_matrix), "vector v must be CSC format." assert M.shape[1] == v.shape[0], "Inner dimensions must Agree." assert v.shape[1] == 1, "v must be a column vector." v_indices = set(v.indices) v_values = dict(zip(v.indices, v.data)) checker = lambda j: j[0] in v_indices num_rows = M.shape[0] res = np.zeros(num_rows) for i in xrange(num_rows): a, b = M.indptr[i], M.indptr[i+1] M_indices_data = zip(M.indices[a:b], M.data[a:b]) matching = dict(filter(checker, M_indices_data)) ip = sum(val*v_values.get(k, 0.0) for k, val in M_indices_data) res[i] = ip return res

Testing this out…

`>>> res_1 = pythonic_mult(M_hash, v_hash)`

1 loops, best of 3: 5.48 s per loop

>>> print industries[res_1.argmax()]

Wireless

Well, **we** **got the right answer and didn’t segfault, but the running time it is just terrible**. Let’s start over. We know we’re only going to use the columns of `M_hash`

that are nonzero in the rows of `v_hash`

, so why don’t we just scan through the structure of `M_hash`

, throw away all that we don’t need, and multiply the result by the dense array left over in `v_hash`

? Let’s give that a try:

def pythonic_mult_2(M, v): assert isinstance(M, sparse.csr.csr_matrix), "matrix M must be CSR format." assert isinstance(v, sparse.csc.csc_matrix), "vector v must be CSC format." assert M.shape[1] == v.shape[0], "Inner dimensions must agree." assert v.shape[1] == 1, "v must be a column vector." kept_columns = {x:i for i,x in enumerate(v.indices)} x_values = dict(zip(v.indices, v.data)) checker = lambda j: j[0] in kept_columns num_rows = M.shape[0] indices, data, indptr = [], [], [0] for i in xrange(num_rows): a, b = M.indptr[i], M.indptr[i+1] for index, d in izip(M.indices[a:b], M.data[a:b]): if index in kept_columns: indices.append(kept_columns[index]) data.append(d) indptr.append(len(data)) red_mat = sparse.csr_matrix((np.array(data), np.array(indices), np.array(indptr)), shape = (M.shape[0], len(kept_columns))) return red_mat.dot(v.data)

>>> res_2 = pythonic_mult_2(M_hash, v_hash)

>>> %timeit res_2 = pythonic_mult_2(M_hash, v_hash)

1 loops, best of 3: 778 ms per loop

>>> print industries[res.argmax()]

Wireless

>>> print np.allclose(res_1, res_2)

True

Our answers agree, and this was substantially faster than the first try. But Scipy’s implementation in the unhashed space takes only 28 milliseconds on the same machine. Time to get closer to the metal.

**Sparse CSR/CSC Multiplication in Cython**

Cython’s a superset of Python that can compile to high-performance C from almost-standard Python with very little effort. This is my first attempt optimizing code with Cython, so I first wrote a matrix-vector multiplication as I might’ve in MATLAB or C– using pointers and flags. For a given row of a matrix, I’ll iterate over its `indices`

array, as well as the indices array of the vector. Assuming they’re sorted (something you’ll have to check when you pull these arrays from the Scipy Sparse structure), I’ll iterate one forward whenever the other’s ahead, and if their index ever agrees, I know the two arrays have an entry in common, and I’ll increment the inner product by that value. Simple logic flow, if a bit convoluted in its book-keeping:

import numpy as np from scipy import sparse def py_ptr_multiply(m_indptr, m_indices, m_data, v_indices, v_data): """ ASSUMPTION: CSR structure of input matrix has sorted indices. m_indptr, matrix's pointer to row start in indices/data m_indices, non-negative column indices for matrix m_data, non-negative data values for matrix v_indices, non-negative column indices for vector v_data, non-negative data values for vector """ M = m_indptr.shape[0] - 1 v_nnz = v_indices.shape[0] output_vector = np.empty(M) for count in range(M): inner_product = 0.0 v_pointer = 0 increase_v = 0 exhausted_v = 0 v_index = v_indices[v_pointer] row_start = m_indptr[count] row_end = m_indptr[count+1] for m_pointer in range(row_start, row_end): if exhausted_v == 1: exhausted_v = 0 break increase_m = 0 while increase_m == 0: if increase_v == 1: v_pointer = v_pointer + 1 if v_pointer >= v_nnz: exhausted_v = 1 break v_index = v_indices[v_pointer] increase_v = 0 col_index = m_indices[m_pointer] if col_index < v_index: increase_m = 1 continue elif col_index == v_index: inner_product = inner_product + m_data[m_pointer]*v_data[v_pointer] increase_v = 1 increase_m = 1 elif col_index > v_index: increase_v = 1 output_vector[count] = inner_product return output_vector

Testing this out, we see it’s a bit slower than our Pythonic attempts before:

>>> assert M_hash.has_sorted_indices, "M must have sorted indices along its rows."

>>> assert v_hash.has_sorted_indices, "v must have sorted indices along its column."

>>> m_indptr, m_indices, m_data = M_hash.indptr, M_hash.indices, M_hash.data

>>> v_indices, v_data = v_hash.indices, v_hash.data

>>> res_3 = py_ptr_multiply(m_indptr, m_indices, m_data, v_indices, v_data)

>>> print industries[res_3.argmax()]

Wireless

>>> %timeit res_3 = py_ptr_multiply(m_indptr, m_indices, m_data, v_indices, v_data)

1 loops, best of 3: 1.12 s per loop

Saving this as a separate file in the .pyx format, we can compile it without any changes, and then import it directly into a Python script or within the shell:

>>> from cy_ptr_multiply_1 import ptr_multiply_1

>>> res_4 = ptr_multiply_1(m_indptr, m_indices, m_data, v_indices, v_data)

>>> print industries[res_4.argmax()]

Wireless

>>> %timeit res_4 = ptr_multiply_1(m_indptr, m_indices, m_data, v_indices, v_data)

1 loops, best of 3: 512 ms per loop

**It’s twice as fast with no changes!** But let’s be smart. Cython has special support for NumPy arrays (by way of a ‘cimport numpy’), and we can declare types on our variables and arrays. We can additionally tell Cython to avoid checking for improper indices on arrays by way of a boundscheck decorator (which means you’ll need to check your input!):

import numpy as np from scipy import sparse cimport numpy as np cimport cython DTYPE_INT = np.int32 DTYPE_FLT = np.float64 ctypedef np.int32_t DTYPE_INT_t ctypedef np.float64_t DTYPE_FLT_t @cython.boundscheck(False) def sp_matrix_vector_rmult(np.ndarray[DTYPE_INT_t] m_indptr, np.ndarray[DTYPE_INT_t] m_indices, np.ndarray[DTYPE_FLT_t] m_data, np.ndarray[DTYPE_INT_t] v_indices, np.ndarray[DTYPE_FLT_t] v_data): """ ASSUMPTION: CSR structure of input matrix has sorted indices. m_indptr, matrix's pointer to row start in indices/data m_indices, non-negative column indices for matrix m_data, non-negative data values for matrix v_indices, non-negative column indices for vector v_data, non-negative data values for vector """ assert m_indptr.dtype == DTYPE_INT assert m_indices.dtype == DTYPE_INT assert m_data.dtype == DTYPE_FLT assert v_indices.dtype == DTYPE_INT assert v_data.dtype == DTYPE_FLT cdef int M = m_indptr.shape[0] - 1 cdef int v_nnz = v_indices.shape[0] cdef np.ndarray[DTYPE_FLT_t] output_vector = np.empty(M, dtype=DTYPE_FLT) cdef int count, v_pointer, increase_v, exhausted_v, v_index, row_start cdef int row_end, m_pointer, increase_m, col_index cdef DTYPE_FLT_t inner_product for count in range(M): inner_product = 0.0 v_pointer = 0 increase_v = 0 exhausted_v = 0 v_index = v_indices[v_pointer] row_start = m_indptr[count] row_end = m_indptr[count+1] for m_pointer in range(row_start, row_end): if exhausted_v == 1: exhausted_v = 0 break increase_m = 0 while increase_m == 0: if increase_v == 1: v_pointer = v_pointer + 1 if v_pointer >= v_nnz: exhausted_v = 1 break v_index = v_indices[v_pointer] increase_v = 0 col_index = m_indices[m_pointer] if col_index < v_index: increase_m = 1 continue elif col_index == v_index: inner_product = inner_product + m_data[m_pointer]*v_data[v_pointer] increase_v = 1 increase_m = 1 elif col_index > v_index: increase_v = 1 output_vector[count] = inner_product return output_vector

Compiling and running this code, we see **our Cython-aided implementation of sparse-matrix multiplication is actually twice as fast as the Scipy-computed multiplication** of the same matrices in the unhashed space! And there’s no memory-bloating inefficient CSRCSC conversions in the process.

# testing out type-declared Cython method on hash-tricked data structures

>>> from newsle.nlp.linalg.sparse import sp_matrix_vector_rmult

>>> print M_hash.shape, v_hash.shape

(144, 2147483647) (2147483647, 1)

>>> print M_hash.nnz, v_hash.nnz

3211379 464

>>> res_5 = sp_matrix_vector_rmult(m_indptr, m_indices, m_data, v_indices, v_data)

>>> print industries[res_5.argmax()]

Wireless

>>> %timeit res_5 = sp_matrix_vector_rmult(m_indptr, m_indices, m_data, v_indices, v_data)

100 loops, best of 3: 11.1 ms per loop

# the following is the method using multiplication within Scipy on unhashed space

>>> print M.shape, v.shape

(144, 995887) (995887, 1)

>>> print M.nnz, v.nnz

3211379 352

>>> res_0 = M * v

>>> np.allclose(res_0.T.toarray()[0], res_5)

True

In [37]: %timeit res_0 = M*v

10 loops, best of 3: 27.4 ms per loop

(Notice that are actually more nonzero entries in `v_hash`

than in `v`

. This is because we no longer have a dictionary for seeing if a word’s been seen before or not. If it’s not in the corpus, though, we needn’t worry about computing the inner product, as those terms will vanish.)

*Erich is Newsle’s sole Machine Learning Engineer. He works on fun problems of entity disambiguation, story clustering, and topic modeling. Follow him on Twitter: @erich_owens*