Text Classification and Feature Hashing: Sparse Matrix-Vector Multiplication with Cython

by Erich Owens

predictions

Optimizing the memory footprint of a classifier used here at Newsle set us down a rabbit hole of rewriting a basic Scipy function with Cython, something that only became a problem when our high-dimensional text spaces grew to a cartoonish size, thanks to the hashing trick. Here I motivate the use of the hashing trick, how we use sparse matrix-vector multiplication for text classification, and how we derived and wrote the new implementation.

The Hashing Trick: Why, How, What

In many information retrieval, natural language processing, or machine learning contexts, it is standard to work with a large spaces of words/n-grams, sometimes on the order of a few million unique observations. This bag-of-words model treats a document as a point in a high-dimensional space; a sparse vector in which the non-negative entries line up with the terms that document has.

Creating this corpus model is typically done in a stateful, incremental way, where terms are assigned a coordinate in the order in which that word is seen. For instance, the sentence “The dog is a nice dog, but cats aren’t so nice.” might be tokenized, filtered for stopwords, counted, and embedded into its own space as the vector \lbrack 2, 2, 1 \rbrack, where the ordered dimensions correspond to {dog, nice, cat}.

Keeping this ordering around in a production environment (usually in the form of a hash table or dictionary with constant lookup time) can have a huge memory footprint. This also requires that we process the corpus in serial, or share state across processes, complicating parallelization. Enter the hash trick.

Rather than derive the feature coordinates from the order seen, we can generate a column index for a given string with a hash function, mapping to enough (prime-numbered) buckets to avoid collision for a corpus with even a million unique terms. For this exercise, we’ll use MurmurHash3 (MMH3.) Choosing a 32-bit hash, our domain of possible column incides is now up to 231 – 1 (modding out the sign to ensure we have sensible column indices. This is also, incidentally, the eighth Mersenne prime!) On a corpus of nearly 1M Newsle industry bigrams (explained below), we encounter just 17 collisions, so we’ll call this “good enough.”

Hashing Trick in Practice: Scipy Sparse and Text Classifiers

Let’s try a simple classification task. We have a corpus of bigrams used across news articles flagged as representing one of nearly two hundred industries (e.g., banking, software, management consulting, fisheries), with each industry compactly represented by the centroid of its many representative articles. We could then assert that a new query document is described by whichever centroid is closest to it. We’ll define closeness by use of cosine similarity. The underlying space can then be thought of as a Voronoi tessellation shattered into a few hundred pieces (each corresponding to a pane centered on a centroid vector), and this method may often be called a “nearest-centroid” or “1NN” classifier.

Classification is then as simple as erecting a matrix representation of the corpus, called M, (with rows as industry centroid vectors), l2-normalizing its rows, and constructing a vector of the new document, called v, with the same method of (TF-IDF) weighting and with unit l2-norm. Computing cosine similarity against every category is then just computing the matrix-vector multiplication

r = M \cdot v.

Arg-maxing on r will yield the closest industry, and our prediction. Let’s begin!

Let’s take our article to be a recent New York Times piece on BlackBerry’s announcement of its new phones and operating system. Without feature-hashing, we’d construct our vector by computing TF-IDF weights on tokenized terms by referencing a dictionary for both the index and inverse-document frequency (IDF) lookup. Here we use Python’s popular Scipy Sparse module, a library allows us to efficiently store matrices with mostly zero entries across a variety of sparse data representations, while supporting many important linear algebraic methods (e.g., back-substitutions, factorizations, and eigendecompositions by the Lanczos algorithm).

from scipy import sparse
from math import sqrt

tfs = nlp.tokenize_and_count(article)
seen_tfs = filter(lambda t:t in feature_lookup, tfs)

rows = [feature_lookup[term] for term in seen_tfs.iterkeys()]
columns = [0 for _ in xrange(len(rows))]
tfidfs = [tf * idf[term] for term, tf in seen_tfs.iteritems()]

norm = sqrt(sum(w**2 for w in tfidfs))
normed_tfidfs = [w / norm for w in tfidfs]

v = sparse.csc_matrix((normed_tfidfs, (rows, columns)), shape = (M.shape[0], 1))
r = M * v
classification = industries[r.toarray().argmax()]

Here our classifier labels the Times article to be about “Wireless.” The next closest guesses were “Telecommunications” and “Consumer Electronics”. Pretty good, classifier! Let’s now assume we wanted this script to run across many child processes, the memory footprint as small as possible on each, and we’d like to use the hashing trick instead of a dictionary for the feature lookup.

Sparse CSR (compressed sparse row) data structures will play nice if the column indices can fit into a numpy.int32 data type, so a signed 32-bit hash function will work well here. We merely need to replace the matrix M with {\hat M}, for which the column indices are now in \{0,1,\dots,2^{31} - 1\} instead of \{0,1,\dots,900000\}. The data structure is the same size in memory as before, but now we need only have a hash function on hand instead of an additional hash table. Let’s make the new vector {\hat v} the same as before, this time drawing on a word hash, and pulling the IDFs from a C++-implemented MARISA trie. So our naive implementation of this procedure might be:

import mmh3
from math import sqrt, log

MAX_INT = 2147483647

def word_hash(s):
    return (mmh3.hash(s) % MAX_INT)

def calc_tfidf(tf, df, N = 200):
    return tf * log(N / (1.0 + df))

tfs        = newsle_nlp.tokenize_and_count(article)
hashed_tfs = {word_hash(term):tf for term, tf in tfs.iteritems()}

rows    = hashed_tfs.keys()
columns = [0 for _ in xrange(len(rows))]
tfidfs  = [calc_tfidf(tf, df = df_trie.get(h_term,[(0]])[0][0]) for h_term, tf in hashed_tfs.iteritems()]

norm    = sqrt(sum(w**2 for w in tfidfs))
normed_tfidfs = [w / norm for w in tfidfs]

v_hashed = sparse.csc_matrix((normed_tfidfs, (rows, columns)), shape = (MAX_INT, 1))

r = M_hashed * v_hashed
classification = industries[r.toarray().argmax()]

This code will fail! The problem is in the penultimate line. It was natural for us to structure the matrix M_hashed as a “CSR” data structure (there is sparse data on every row, but not for every column) and the vector v_hashed as a “CSC” (Compressed Sparse Column, it’s a single column vector, but nearly 231 of its rows are zero.) The problem is in how Scipy Sparse does multiplication– it requires both matrices in the operation A*B to be either CSR or CSC, and will convert the second to the format of the first if this is not the case. Forcing our sparse vector v_hashed to become CSR, however, will blow up our memory.

To see why, let’s try a toy example. Consider the following vector in {\mathbf R}^{15} which has nonzero entries in only four places. One way of constructing this sparse vector is just by just passing in the ordered coordinates:

I, J, V = [2, 5, 6, 12], [0, 0, 0, 0], [0.1, 0.2, 0.3, 0.9]
v = sparse.csc_matrix((V, (I, J)), shape = (15, 1))

The CSC vector we constructed is now uniquely determined by three core numpy arrays– v.data, the nonzero values; v.indices, the rows corresponding to the values in the previous array; and v.indptr, a pointer to the values in the previous two arrays, telling us where the values and row-indices for the contents of the ith column lie.

>>> print v.todense().T
[[ 0. 0. 0.1 0. 0. 0.2 0.3 0. 0. 0. 0. 0. 0.9 0. 0. 0. 0. ]]
>>> print v.data, v.indices, v.indptr
[0.1 0.2 0.3 0.9] [2 5 6 12] [0 4]

CSR is a very similar data structure as CSC, but its indices array points to the nonzero value’s columns, and its indptr has an entry whenever our data skips to the next array. For a column vector, you can imagine this to be redundant:

>>>v_csr = v.tocsr()
>>> print v_csr.todense().T
[[ 0. 0. 0.1 0. 0. 0.2 0.3 0. 0. 0. 0. 0. 0.9 0. 0. ]]
>>> print v_csr.data, v_csr.indices, v_csr.indptr
[ 0.1 0.2 0.3 0.9] [0 0 0 0] [0 0 0 1 1 1 2 3 3 3 3 3 3 4 4 4]

And so it is! Though this vector is the same mathematical object as it would be in a CSC representation, its row pointer is forced to have an entry for every (nominally) new row in the vector. This requires an entry for the length of the array, defeating the point of a sparse structure. Now you can imagine why our classification code above failed– coercing v_hashed to CSR will construct a structure with an indptr array that has nearly 231 numpy long integers, and that’s going to cause a segfault.

So, we clearly need a CSR times CSC multiplication method. Not finding one out there on the internet, we will write one of our own.

Sparse CSR/CSC Multiplication in Python

Computing this sparse matrix/vector product means we want to scan rows of our matrix {\hat M} and whenever it shares nonzero columns with the vector factor, we’ll multiply the result, and the sum of all such overlaps is the inner product on that dimension. Trying out this naive logic as Pythonically as possible:

from scipy import sparse
import numpy as np

def pythonic_mult(M, v):
    assert isinstance(M, sparse.csr.csr_matrix), "matrix M must be CSR format."
    assert isinstance(v, sparse.csc.csc_matrix), "vector v must be CSC format."
    assert M.shape[1] == v.shape[0], "Inner dimensions must Agree."
    assert v.shape[1] == 1, "v must be a column vector."

    v_indices = set(v.indices)
    v_values  = dict(zip(v.indices, v.data))
    checker = lambda j: j[0] in v_indices

    num_rows = M.shape[0]
    res      = np.zeros(num_rows)

    for i in xrange(num_rows):
        a, b = M.indptr[i], M.indptr[i+1]
        M_indices_data  = zip(M.indices[a:b], M.data[a:b])
        matching = dict(filter(checker, M_indices_data))
        ip = sum(val*v_values.get(k, 0.0) for k, val in M_indices_data)
        res[i] = ip

    return res

Testing this out…

>>> res_1 = pythonic_mult(M_hash, v_hash)
1 loops, best of 3: 5.48 s per loop
>>> print industries[res_1.argmax()]
Wireless

Well, we got the right answer and didn’t segfault, but the running time it is just terrible. Let’s start over. We know we’re only going to use the columns of M_hash that are nonzero in the rows of v_hash, so why don’t we just scan through the structure of M_hash, throw away all that we don’t need, and multiply the result by the dense array left over in v_hash? Let’s give that a try:

def pythonic_mult_2(M, v):
    assert isinstance(M, sparse.csr.csr_matrix), "matrix M must be CSR format."
    assert isinstance(v, sparse.csc.csc_matrix), "vector v must be CSC format."
    assert M.shape[1] == v.shape[0], "Inner dimensions must agree."
    assert v.shape[1] == 1, "v must be a column vector."

    kept_columns = {x:i for i,x in enumerate(v.indices)}
    x_values  = dict(zip(v.indices, v.data))
    checker = lambda j: j[0] in kept_columns
    num_rows = M.shape[0]

    indices, data, indptr = [], [], [0]
    for i in xrange(num_rows):
        a, b = M.indptr[i], M.indptr[i+1]
        for index, d in izip(M.indices[a:b], M.data[a:b]):
            if index in kept_columns:
                indices.append(kept_columns[index])
                data.append(d)
        indptr.append(len(data))

    red_mat = sparse.csr_matrix((np.array(data), np.array(indices),
                                 np.array(indptr)),
                                 shape = (M.shape[0], len(kept_columns)))
    return red_mat.dot(v.data)


>>> res_2 = pythonic_mult_2(M_hash, v_hash)
>>> %timeit res_2 = pythonic_mult_2(M_hash, v_hash)
1 loops, best of 3: 778 ms per loop
>>> print industries[res.argmax()]
Wireless
>>> print np.allclose(res_1, res_2)
True

Our answers agree, and this was substantially faster than the first try. But Scipy’s implementation in the unhashed space takes only 28 milliseconds on the same machine. Time to get closer to the metal.

Sparse CSR/CSC Multiplication in Cython

Cython’s a superset of Python that can compile to high-performance C from almost-standard Python with very little effort. This is my first attempt optimizing code with Cython, so I first wrote a matrix-vector multiplication as I might’ve in MATLAB or C– using pointers and flags. For a given row of a matrix, I’ll iterate over its indices array, as well as the indices array of the vector. Assuming they’re sorted (something you’ll have to check when you pull these arrays from the Scipy Sparse structure), I’ll iterate one forward whenever the other’s ahead, and if their index ever agrees, I know the two arrays have an entry in common, and I’ll increment the inner product by that value. Simple logic flow, if a bit convoluted in its book-keeping:

import numpy as np
from scipy import sparse

def py_ptr_multiply(m_indptr, m_indices, m_data, v_indices, v_data):
    """
    ASSUMPTION: CSR structure of input matrix has sorted indices.

    m_indptr,       matrix's pointer to row start in indices/data
    m_indices,      non-negative column indices for matrix
    m_data,         non-negative data values for matrix
    v_indices,      non-negative column indices for vector
    v_data,         non-negative data values for vector
    """

    M = m_indptr.shape[0] - 1
    v_nnz = v_indices.shape[0]
    output_vector = np.empty(M)

    for count in range(M):
        inner_product = 0.0

        v_pointer = 0
        increase_v = 0
        exhausted_v = 0

        v_index   = v_indices[v_pointer]
        row_start = m_indptr[count]
        row_end = m_indptr[count+1]

        for m_pointer in range(row_start, row_end):
            if exhausted_v == 1:
                exhausted_v = 0
                break

            increase_m = 0
            while increase_m == 0:
                if increase_v == 1:
                    v_pointer = v_pointer + 1
                    if v_pointer >= v_nnz:
                        exhausted_v = 1
                        break
                    v_index = v_indices[v_pointer]
                    increase_v = 0

                col_index = m_indices[m_pointer]

                if col_index < v_index:
                    increase_m = 1
                    continue

                elif col_index == v_index:
                    inner_product = inner_product + m_data[m_pointer]*v_data[v_pointer]
                    increase_v = 1
                    increase_m = 1

                elif col_index > v_index:
                    increase_v = 1

        output_vector[count] = inner_product

    return output_vector

Testing this out, we see it’s a bit slower than our Pythonic attempts before:


>>> assert M_hash.has_sorted_indices, "M must have sorted indices along its rows."
>>> assert v_hash.has_sorted_indices, "v must have sorted indices along its column."
>>> m_indptr, m_indices, m_data = M_hash.indptr, M_hash.indices, M_hash.data
>>> v_indices, v_data = v_hash.indices, v_hash.data
>>> res_3 = py_ptr_multiply(m_indptr, m_indices, m_data, v_indices, v_data)
>>> print industries[res_3.argmax()]
Wireless
>>> %timeit res_3 = py_ptr_multiply(m_indptr, m_indices, m_data, v_indices, v_data)
1 loops, best of 3: 1.12 s per loop

Saving this as a separate file in the .pyx format, we can compile it without any changes, and then import it directly into a Python script or within the shell:


>>> from cy_ptr_multiply_1 import ptr_multiply_1
>>> res_4 = ptr_multiply_1(m_indptr, m_indices, m_data, v_indices, v_data)
>>> print industries[res_4.argmax()]
Wireless
>>> %timeit res_4 = ptr_multiply_1(m_indptr, m_indices, m_data, v_indices, v_data)
1 loops, best of 3: 512 ms per loop

It’s twice as fast with no changes! But let’s be smart. Cython has special support for NumPy arrays (by way of a ‘cimport numpy’), and we can declare types on our variables and arrays. We can additionally tell Cython to avoid checking for improper indices on arrays by way of a boundscheck decorator (which means you’ll need to check your input!):

import numpy as np
from scipy import sparse
cimport numpy as np
cimport cython

DTYPE_INT = np.int32
DTYPE_FLT = np.float64

ctypedef np.int32_t DTYPE_INT_t
ctypedef np.float64_t DTYPE_FLT_t

@cython.boundscheck(False)
def sp_matrix_vector_rmult(np.ndarray[DTYPE_INT_t] m_indptr, np.ndarray[DTYPE_INT_t] m_indices,
                   np.ndarray[DTYPE_FLT_t] m_data, np.ndarray[DTYPE_INT_t] v_indices,
                   np.ndarray[DTYPE_FLT_t] v_data):
    """
    ASSUMPTION: CSR structure of input matrix has sorted indices.

    m_indptr,       matrix's pointer to row start in indices/data
    m_indices,      non-negative column indices for matrix
    m_data,         non-negative data values for matrix
    v_indices,      non-negative column indices for vector
    v_data,         non-negative data values for vector
    """

    assert m_indptr.dtype == DTYPE_INT
    assert m_indices.dtype == DTYPE_INT
    assert m_data.dtype == DTYPE_FLT
    assert v_indices.dtype == DTYPE_INT
    assert v_data.dtype == DTYPE_FLT

    cdef int M = m_indptr.shape[0] - 1
    cdef int v_nnz = v_indices.shape[0]
    cdef np.ndarray[DTYPE_FLT_t] output_vector = np.empty(M, dtype=DTYPE_FLT)

    cdef int count, v_pointer, increase_v, exhausted_v, v_index, row_start
    cdef int row_end, m_pointer, increase_m, col_index

    cdef DTYPE_FLT_t inner_product

    for count in range(M):
        inner_product = 0.0

        v_pointer = 0
        increase_v = 0
        exhausted_v = 0

        v_index   = v_indices[v_pointer]
        row_start = m_indptr[count]
        row_end = m_indptr[count+1]

        for m_pointer in range(row_start, row_end):
            if exhausted_v == 1:
                exhausted_v = 0
                break

            increase_m = 0
            while increase_m == 0:
                if increase_v == 1:
                    v_pointer = v_pointer + 1
                    if v_pointer >= v_nnz:
                        exhausted_v = 1
                        break
                    v_index = v_indices[v_pointer]
                    increase_v = 0

                col_index = m_indices[m_pointer]

                if col_index < v_index:
                    increase_m = 1
                    continue

                elif col_index == v_index:
                    inner_product = inner_product + m_data[m_pointer]*v_data[v_pointer]
                    increase_v = 1
                    increase_m = 1

                elif col_index > v_index:
                    increase_v = 1

        output_vector[count] = inner_product

    return output_vector

Compiling and running this code, we see our Cython-aided implementation of sparse-matrix multiplication is actually twice as fast as the Scipy-computed multiplication of the same matrices in the unhashed space! And there’s no memory-bloating inefficient CSR\leftrightarrowCSC conversions in the process.

 

# testing out type-declared Cython method on hash-tricked data structures
>>> from newsle.nlp.linalg.sparse import sp_matrix_vector_rmult
>>> print M_hash.shape, v_hash.shape
(144, 2147483647) (2147483647, 1)
>>> print M_hash.nnz, v_hash.nnz
3211379 464
>>> res_5 = sp_matrix_vector_rmult(m_indptr, m_indices, m_data, v_indices, v_data)
>>> print industries[res_5.argmax()]
Wireless
>>> %timeit res_5 = sp_matrix_vector_rmult(m_indptr, m_indices, m_data, v_indices, v_data)
100 loops, best of 3: 11.1 ms per loop

# the following is the method using multiplication within Scipy on unhashed space
>>> print M.shape, v.shape
(144, 995887) (995887, 1)
>>> print M.nnz, v.nnz
3211379 352
>>> res_0 = M * v
>>> np.allclose(res_0.T.toarray()[0], res_5)
True
In [37]: %timeit res_0 = M*v
10 loops, best of 3: 27.4 ms per loop

(Notice that are actually more nonzero entries in v_hash than in v. This is because we no longer have a dictionary for seeing if a word’s been seen before or not. If it’s not in the corpus, though, we needn’t worry about computing the inner product, as those terms will vanish.)

Erich is Newsle’s sole Machine Learning Engineer. He works on fun problems of entity disambiguation, story clustering, and topic modeling. Follow him on Twitter: @erich_owens