Artificial Intelligence 16 min read

How to Make NLP 100× Faster with Cython and spaCy

This article explains how Cython—a Python superset that compiles to C—can accelerate natural‑language‑processing tasks, showcases the NeuralCoref v3.0 project, provides profiling tips, demonstrates Python‑to‑Cython code transformations, and presents benchmark results showing up to a hundred‑fold speedup.

MaGe Linux Operations

Jun 6, 2019

How to Make NLP 100× Faster with Cython and spaCy

Cython is a toolkit that lets you compile C code within Python, which is why libraries like NumPy and pandas are fast; it is essentially a superset of Python.

In this article we introduce the GitHub project NeuralCoref v3.0 , which combines spaCy and Cython to achieve roughly a hundred‑fold speed increase for coreference resolution while preserving accuracy.

Typical scenarios that benefit from such acceleration include building production NLP modules in Python, analyzing large NLP datasets, and preprocessing massive training corpora for deep‑learning frameworks such as PyTorch or TensorFlow.

First, profile your Python code to locate bottlenecks; a simple way is using cProfile:

import cProfile
import pstats
import myslowmodule
cProfile.run('myslowmodule.run()', 'restats')
p = pstats.Stats('restats')
p.sortstats('cumulative').printstats(30)

When the bottleneck is a tight loop over many Python objects, Cython can dramatically speed it up. Cython distinguishes between regular Python objects and Cython C objects (e.g., int, float, struct), which can be compiled to fast native code.

Consider a simple example that counts rectangles whose area exceeds a threshold:

from random import random

class Rectangle:
    def __init__(self, w, h):
        self.w = w
        self.h = h
    def area(self):
        return self.w * self.h

def check_rectangles(rectangles, threshold):
    n_out = 0
    for rectangle in rectangles:
        if rectangle.area() > threshold:
            n_out += 1
    return n_out

def main():
    n_rectangles = 10000000
    rectangles = [Rectangle(random(), random()) for i in range(n_rectangles)]
    n_out = check_rectangles(rectangles, threshold=0.25)
    print(n_out)

The check_rectangles function is the performance bottleneck because each iteration incurs Python‑level overhead. Re‑implement it in Cython using a C struct and a cdef function:

from cymem.cymem cimport Pool

cdef struct Rectangle:
    float w
    float h

cdef int check_rectangles(Rectangle* rectangles, int n_rectangles, float threshold):
    cdef int n_out = 0
    cdef int i
    for i in range(n_rectangles):
        if rectangles[i].w * rectangles[i].h > threshold:
            n_out += 1
    return n_out

def main():
    cdef int n_rectangles = 10000000
    cdef float threshold = 0.25
    cdef Pool mem = Pool()
    cdef Rectangle* rectangles = <Rectangle*>mem.alloc(n_rectangles, sizeof(Rectangle))
    cdef int i
    for i in range(n_rectangles):
        rectangles[i].w = random()
        rectangles[i].h = random()
    n_out = check_rectangles(rectangles, n_rectangles, threshold)
    print(n_out)

spaCy’s internal data structures make it easy to apply the same technique to NLP. All Unicode strings (tokens, lemmas, POS tags, etc.) are stored in a StringStore and accessed via 64‑bit hash codes, allowing C‑level loops to work with integers instead of Python strings.

Example of a fast NLP task: counting occurrences of the word “run” tagged as a noun (NN) in a large corpus.

def slow_loop(doc_list, word, tag):
    n_out = 0
    for doc in doc_list:
        for tok in doc:
            if tok.lower_ == word and tok.tag_ == tag:
                n_out += 1
    return n_out

def main_nlp_slow(doc_list):
    n_out = slow_loop(doc_list, 'run', 'NN')
    print(n_out)

The pure‑Python version takes about 1.4 seconds on a modest notebook for ten documents (~170 k characters each). The Cython version runs in roughly 20 ms, an ~80× speedup, processing about 17 million characters in 30 ms (≈80 million characters per second).

%%cython -+
import numpy
from cymem.cymem cimport Pool
from spacy.tokens.doc cimport Doc
from spacy.typedefs cimport hash_t
from spacy.structs cimport TokenC

cdef struct DocElement:
    TokenC* c
    int length

cdef int fast_loop(DocElement* docs, int n_docs, hash_t word, hash_t tag):
    cdef int n_out = 0
    cdef int i
    for i in range(n_docs):
        cdef TokenC* token_arr = docs[i].c
        cdef int j
        for j in range(docs[i].length):
            if token_arr[j].lex.lower == word and token_arr[j].tag == tag:
                n_out += 1
    return n_out

def main_nlp_fast(doc_list):
    cdef int n_docs = len(doc_list)
    cdef Pool mem = Pool()
    cdef DocElement* docs = <DocElement*>mem.alloc(n_docs, sizeof(DocElement))
    cdef int i
    for i, doc in enumerate(doc_list):
        docs[i].c = doc.c
        docs[i].length = (<Doc>doc).length
    word_hash = doc.vocab.strings.add('run')
    tag_hash = doc.vocab.strings.add('NN')
    n_out = fast_loop(docs, n_docs, word_hash, tag_hash)
    print(n_out)

For further reading, see the official Cython tutorials and spaCy’s Cython API documentation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python spaCy Cython

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.