Matthew Honnibal's spaCy bills itself as an "industrial-strength" tool for natural language processing.

Thus far I have used spaCy (henceforth "Spacy") only for named entity recognition (NER), and for this purpose Spacy definitely keeps its promise. This page will help you get the tool up and running and give you some basic code for extracting entities from your documents.

Setting Spacy Up

The folks at Spacy have worked hard on making it easier to install their tools—so much so that I think you'll need very little help at all. At least in Ubuntu. If you have problems, see my old notes, and good luck.

Create a Virtual Environment

Serious work in Python requires a virtual environment. See my Using virtualenv for an easy way to manage this.

Install Spacy

Assuming you have a working virtual environment, activate it before you install Spacy.

$ source venv/bin/activate

(venv) $ 

Go to https://spacy.io/usage/ and follow the directions. Here are the steps I took on both Mac and Ubuntu.

(venv) $ pip install -U spacy

(venv) $ python -m spacy download en

Troubleshooting

Validation

Run validate to check for issues.

The following example shows a successful validation. Note the checkmarks on the right.

(venv) $ python -m spacy validate

    Installed models (spaCy v2.0.18)
    /Users/jkurlandski/workspace/DeepLearning/venv/lib/python3.7/site-packages/spacy

    TYPE        NAME                  MODEL                 VERSION                                   
    package     en-core-web-sm        en_core_web_sm        2.0.0    ✔      
    link        en_core_web_sm        en_core_web_sm        2.0.0    ✔      
    link        en                    en_core_web_sm        2.0.0    ✔      

This next example shows an issue discovered by the validate command.

(venv) $ python -m spacy validate

    Installed models (spaCy v2.0.18)
    /Users/jkurlandski/workspace/DeepLearning/venv/lib/python3.7/site-packages/spacy

    TYPE        NAME                  MODEL                 VERSION                                   
    package     en-core-web-sm        en_core_web_sm        2.1.0    --> 2.0.0           
    link        en                    en_core_web_sm        2.1.0    --> 2.0.0           

    Use the following commands to update the model packages:
    python -m spacy download en_core_web_sm

    You may also want to overwrite the incompatible links using the `python
    -m spacy link` command with `--force`, or remove them from the data
    directory. Data path:
    /Users/jkurlandski/workspace/DeepLearning/venv/lib/python3.7/site-packages/spacy/data

The output is telling us that we have spaCy version 2.0.18 installed, and it should have en-core-web-sm v. 2.0.0—but it actually has v. 2.1.0. The output says we can fix this with the following:

(venv) $ python -m spacy download en_core_web_sm
Collecting en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)
     |████████████████████████████████| 37.4MB 944kB/s 
Building wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... done
  Created wheel for en-core-web-sm: filename=en_core_web_sm-2.0.0-cp37-none-any.whl size=37405977 sha256=b5e28721ea14ee32ff078193066dfbf0060782cfb6b6c1bfc09da0e49e03fe60
  Stored in directory: /private/var/folders/c5/t56f2m1s6rq90qlsfk6wnjl80000gn/T/pip-ephem-wheel-cache-2yx9unop/wheels/54/7c/d8/f86364af8fbba7258e14adae115f18dd2c91552406edc3fdaa
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
  Found existing installation: en-core-web-sm 2.1.0
    Uninstalling en-core-web-sm-2.1.0:
      Successfully uninstalled en-core-web-sm-2.1.0
Successfully installed en-core-web-sm-2.0.0

    Linking successful
    /Users/jkurlandski/workspace/DeepLearning/venv/lib/python3.7/site-packages/en_core_web_sm
    -->
    /Users/jkurlandski/workspace/DeepLearning/venv/lib/python3.7/site-packages/spacy/data/en_core_web_sm

    You can now load the model via spacy.load('en_core_web_sm')

An error message such as the following indicates a language model link issue:

~/workspace/DeepLearning/NLP/Spacy/SpacyUtils.py in <module>
     12 Load the spaCy English language model one time for the entire application.
     13 """
---> 14 spacyEnglishModel = spacy.load('en')
     15 lightSpacyEnglishModel = spacy.load("en", disable=["tagger", "parser", "ner", "textcat"])
     16 

~/workspace/DeepLearning/venv/lib/python3.7/site-packages/spacy/__init__.py in load(name, **overrides)
     19     if depr_path not in (True, False, None):
     20         deprecation_warning(Warnings.W001.format(path=depr_path))
---> 21     return util.load_model(name, **overrides)
     22 
     23 

~/workspace/DeepLearning/venv/lib/python3.7/site-packages/spacy/util.py in load_model(name, **overrides)
    117     elif hasattr(name, 'exists'):  # Path or Path-like to model data
    118         return load_model_from_path(name, **overrides)
--> 119     raise IOError(Errors.E050.format(name=name))
    120 
    121 

OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

You can fix it with the link command.

venv) $ python -m spacy link en_core_web_sm en

    Linking successful
    /Users/jkurlandski/workspace/DeepLearning/venv/lib/python3.7/site-packages/en_core_web_sm
    -->
    /Users/jkurlandski/workspace/DeepLearning/venv/lib/python3.7/site-packages/spacy/data/en

    You can now load the model via spacy.load('en')

Sample Code

Here's some code to get you started with exploring named entity recognition with Spacy.

NerUtils

We'll start with a utility class offering functionality frequently used in named entity recognition in general.

from collections import namedtuple
from enum import Enum

NerEntity = namedtuple("NerEntity", "eType, offset, content")

# Person, Organization, Location
EntityType = Enum('EntityType', ['Person', 'Organization', 'Location'])

NerEntity creates a Python namedtuple having the three properties most named entity recognizers assign to an entity: an entity type, an offset, and the string from the text that refers to the named entity (the "content").

The NerUtils file also creates an EntityType enum on the three most-commonly identified entity types—Person, Organization and Location. One advantage of the enum in our code is that we'll never make the mistake of using a different string to refer to the same entity type, for example "PERSON" instead of "Person".

SpacyEntityExtractor

Here's the workhorse of Spacy entity extraction.

from spacy.en import English

from ner import NerUtils
from ner.spacy.SpacyEntityTypeMapper import SpacyEntityTypeMapper


class SpacyEntityExtractor(object):
    """"""

    # One-time initialization of Spacy.
    nlp = English()

    def __init__(self):
        """"""
        self.reinitialize()


    def reinitialize(self):
        # Final output.
        self.entities = {}
        for entity in SpacyEntityTypeMapper.allEntityTypes:
            self.entities[entity] = []


    def readInput(self, text):
        self.reinitialize()

        self.text = text

        doc = SpacyEntityExtractor.nlp(text)

        self.extractEntityNames(doc)


    def extractEntityNames(self, doc):

        for entity in doc.ents:
            # Append to the entity list an entity-offset pair.
            self.entities[entity.label_].append((str(entity), entity.root.idx))


    def getEntitiesDict(self, filter=False):
        """
        Return a dictionary of entities where the keys are the entity types.

        :param filter: if True, return only the targeted entity types
        :return:
        """
        if not filter:
            return self.entities

        result = {}
        for eType in SpacyEntityTypeMapper.targetedEntityTypes:
            result[eType] = self.entities[eType]

        return result


    def getNerEntities(self, filter=True):
        """
        Return a list of NerEntity objects extracted.
        :param filter: if True, return only the targeted entity types
        :return:
        """
        entityDict = self.getEntitiesDict(filter)

        result = []

        for key in sorted(entityDict.keys()):
            for (content, offset) in entityDict[key]:
                mappedKey = SpacyEntityTypeMapper.mapEntity(key)
                entity = NerUtils.NerEntity(mappedKey, offset, content)
                result.append(entity)


        return result


if __name__ == '__main__':
    extractor = SpacyEntityExtractor()
    extractor.readInput("John Smith wrote to Mary Jones.")

    # Print an intermediate result.
    print(str(extractor.entities))

    print(str(extractor.getNerEntities()))

I'm the author, so of course the code seems pretty clear and well-documented to me. One thing worth noting, however, is the static declaration of the nlp property in the SpacyEntityExtractor class. This means that your application will incur the cost of Spacy initialization only once, when the first object of the class is instantiated. After that, the Spacy resources can be used for every document fed in.

SpacyEntityTypeMapper

You see from the import statement in the SpacyEntityExtractor that it expects a SpacyEntityTypeMapper class, which is designed to let you control specifically which entity types you want to extract. This class also maps the different LOC and GPE (geo-political entity) types into a single Location type.

'''
Map SpaCy entity types to the standard Person/Organization/Location types.
Use static methods for default mapping. To extend with additional mappings,
allow instantiation of an object and provide a means of creating a separate
mapEntity() method for the object.
'''
from ner import NerUtils


class SpacyEntityTypeMapper(object):
    allEntityTypes = ['PERSON', 'NORP', 'FAC', 'ORG', 'GPE', 'LOC', 'PRODUCT', 'EVENT', 'WORK_OF_ART',
                      'LAW', 'LANGUAGE', 'DATE', 'TIME', 'PERCENT', 'MONEY', 'QUANTITY', 'ORDINAL', 'CARDINAL']

    targetedEntityTypes = ['ORG', 'PERSON', 'LOC', 'GPE']


    personTypes = ['PERSON']

    organizationTypes = ['ORG']

    locationTypes = ['LOC', 'GPE']


    @staticmethod
    def mapEntity(eType):
        if eType in SpacyEntityTypeMapper.personTypes:
            return NerUtils.EntityType.Person
        elif eType in SpacyEntityTypeMapper.organizationTypes:
            return NerUtils.EntityType.Organization
        elif eType in SpacyEntityTypeMapper.locationTypes:
            return NerUtils.EntityType.Location
        elif eType in SpacyEntityTypeMapper.allEntityTypes:
            return eType
        else:
            raise ValueError("Input param not mappable. Input param: " + eType)

The getNerEntities( ) method is where SpacyEntityExtractor invokes the mapper's mapEntity( ) method, in the line reading
mappedKey = SpacyEntityTypeMapper.mapEntity(key).

In other words, to go from a GPE Spacy entity type to a standard Location entity type, your call looks like this: SpacyEntityTypeMapper.mapEntity('GPE').

External Resources

Some useful links: