Some notes for troubleshooting the Spacy install.

The folks at spaCy have made their install a lot easier than it was in early 2017. At least on Ubuntu. Here are my old notes on various install issues from that time, in case they still exist on Mac or other versions of Linux.

Setting Spacy Up

So far I have installed Spacy four times: on Ubuntu v. 12, on CentOS 6.4, on Ubuntu 14, and on Mac 10.11.6. Not one install was problem-free, and not one install was the same as another.

  • The difference between the two Ubuntu installs might have been due to the version of Ubuntu; more likely it was because the first install was for a 2.7 project, while the second was for a 3.5 project. See the "One Ubuntu Install" sub-section below.
  • The Mac installation was the easiest, though even that gave me problems when it came to the language model download and load. See the "One Mac Install" sub-section below.

Even though the creators of Spacy have worked hard to make the installation simple, the problem seems to be that Spacy requires many different resources to run. But if you're committed to getting good-quality NER done, I think you'll find the installation worth the hour or so it could take you.

Create a Virtual Environment

Serious work in Python requires a virtual environment. See my Using virtualenv for an easy way to manage this.

Install Spacy

Assuming you have a working virtual environment, activate it before you install Spacy.

$ source venv/bin/activate

(venv) $ 

At this point your best bet is to begin with Spacy's own Getting Started page. With any luck, you'll have no problems and be able to skip down to the Sample Code section of this page. Read on if you do have issues.

In my experience, the algorithm for a Spacy install goes like this:

- try Spacy install
- if it doesn't work:
    notWorking = True

while notWorking:
    - read error message
    - google error message
    - follow advice
    - if it works:
        notWorking = False

As mentioned above, it generally seems to take about an hour to get out of the while loop. Not a big deal, but worth mentioning.

It's possible that your error messages will be the same as the ones in the two installations described below—one on Ubuntu, the other on a Mac. In that case, follow my steps. Otherwise you'll have to look elsewhere on the Web for a solution to your particular problem.

One Ubuntu Install

Here are some details what happened the last time I installed Spacy on Ubuntu, which happened to be v. 14. The hope is that, if your own installation has problems, some of the solutions I discovered may speed your resolution of the problems.

I began by trying the install command.

(venv) $ pip install -U spacy
[...]
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

I searched the Internet and came up with one solution, which didn't work. A second solution also didn't work. By "didn't work," I mean that the solution, whatever it was, didn't let me successfully run the command to install Spacy. A third solution, however, did work.

Outside of the virtual environment, I installed this set of Python 3.x development tools.

$ sudo apt-get install python3-dev

Then, back in the virtual environment for the project:

(venv) $ pip install -U spacy
[... First error:]
File "/home/jk/workspaceWebsite/PythonCode/venv/lib/python3.5/site-packages/wheel/bdist_wheel.py", line 161, in get_archive_basename
  impl_tag, abi_tag, plat_tag = self.get_tag()
File "/home/jk/workspaceWebsite/PythonCode/venv/lib/python3.5/site-packages/wheel/bdist_wheel.py", line 155, in get_tag
  assert tag == supported_tags[0]
AssertionError
[...]
----------------------------------------
Failed building wheel for spacy
Running setup.py clean for spacy

In other words, what had happened was that the python3-dev installation got me past the first Spacy install error, but the installation still failed because of some other problem. Updating the pip "wheel" fixed this second problem:

(venv)$ wheel version
wheel 0.24.0

(venv)$ pip install wheel --upgrade
Collecting wheel
[...]
Successfully installed wheel-0.29.0

(venv)$ pip install -U spacy
Collecting spacy
[...]
Installing collected packages: thinc, cloudpickle, pathlib, semver, sputnik, ujson, spacy
Successfully installed cloudpickle-0.2.2 pathlib-1.0.1 semver-2.7.6 spacy-1.6.0 sputnik-0.9.3 thinc-6.2.0 ujson-1.35

Successfully installed! Alas, when I tried to run the code, I got a runtime error:

Model 'en>=1.1.0,<1.2.0' not installed. 
Please run 'python -m spacy.en.download' to install latest compatible model.

I needed the English language models, which I downloaded with the command the message instructed me to use—python -m spacy.en.download.

One Mac Install

Here are some details on what happened when I installed Spacy on a Mac (version 10.11.6— El Capitan).

In my first attempt, the installation itself was problem-free, but I had a runtime error when I tried to run the Spacy tools, resulting in this message:

Model 'en>=1.1.0,<1.2.0' not installed. Please run 'python -m spacy.en.download' to install latest compatible model.

This was a wacky claim on Spacy's part, because I could clearly see the English language model in the virtual environment. So rather than try to work that out, I just deleted everything and started from scratch. From this point on, the installation proceeded swimmingly.

Here are the steps I took. For the installation, I simply followed Spacy's Getting Started page. The important thing is to remember to first activate your virtual environment.

(venv) $ pip install -U spacy
Collecting spacy
[...]
Installing collected packages: murmurhash, cymem, preshed, wrapt, tqdm, toolz, cytoolz, plac, dill, pathlib, thinc, ujson, requests, spacy
Successfully installed cymem-1.31.2 cytoolz-0.8.2 dill-0.2.6 murmurhash-0.26.4 pathlib-1.0.1 plac-0.9.6 preshed-1.0.0 requests-2.13.0 spacy-1.7.0 thinc-6.5.0 toolz-0.8.2 tqdm-4.11.2 ujson-1.35 wrapt-1.10.10

Next I had to install the English language model.

(venv) $ python -m spacy download en

    Downloading en_core_web_sm-1.2.0/en_core_web_sm-1.2.0.tar.gz
    [...]
Successfully installed en-core-web-sm-1.2.0

Linking successful

/Users/jk/workspace/PythonCode/venv/lib/python3.4/site-packages/en_core_web_sm/en_core_web_sm-1.2.0
-->
/Users/jk/workspace/PythonCode/venv/lib/python3.4/site-packages/spacy/data/en

You can now load the model via spacy.load('en').

(venv) $ 

Spacy's Language Models

In the course of working with Spacy, you may want to find where your language models are installed. The way to do this is a little tricky: the command actually runs Python, imports Spacy and invokes the Spacy module before it can determine where the data are located.

The command to use goes like this: python -c "import spacy; import os; print(os.path.join(os.path.dirname(spacy.__file__), 'en', 'data'))"

The next listing shows where this command found the language models to be located on an Ubuntu install for Python 2.7, and then in a virtual environment for a Python 3.5 project, also on Ubuntu.

$ python -c "import spacy; import os; print(os.path.join(os.path.dirname(spacy.__file__), 'en', 'data'))"
/usr/local/lib/python2.7/dist-packages/spacy/en/data

$ source venv/bin/activate

(venv)$ python -c "import spacy; import os; print(os.path.join(os.path.dirname(spacy.__file__), 'en', 'data'))"
/home/jk/workspaceWebsite/PythonCode/venv/lib/python3.5/site-packages/spacy/en/data

Sample Code

Here's some code to get you started with exploring named entity recognition with Spacy.

NerUtils

We'll start with a utility class offering functionality frequently used in named entity recognition in general.

from collections import namedtuple
from enum import Enum

NerEntity = namedtuple("NerEntity", "eType, offset, content")

# Person, Organization, Location
EntityType = Enum('EntityType', ['Person', 'Organization', 'Location'])

NerEntity creates a Python namedtuple having the three properties most named entity recognizers assign to an entity: an entity type, an offset, and the string from the text that refers to the named entity (the "content").

The NerUtils file also creates an EntityType enum on the three most-commonly identified entity types—Person, Organization and Location. One advantage of the enum in our code is that we'll never make the mistake of using a different string to refer to the same entity type, for example "PERSON" instead of "Person".

SpacyEntityExtractor

Here's the workhorse of Spacy entity extraction.

from spacy.en import English

from ner import NerUtils
from ner.spacy.SpacyEntityTypeMapper import SpacyEntityTypeMapper


class SpacyEntityExtractor(object):
    """"""

    # One-time initialization of Spacy.
    nlp = English()

    def __init__(self):
        """"""
        self.reinitialize()


    def reinitialize(self):
        # Final output.
        self.entities = {}
        for entity in SpacyEntityTypeMapper.allEntityTypes:
            self.entities[entity] = []


    def readInput(self, text):
        self.reinitialize()

        self.text = text

        doc = SpacyEntityExtractor.nlp(text)

        self.extractEntityNames(doc)


    def extractEntityNames(self, doc):

        for entity in doc.ents:
            # Append to the entity list an entity-offset pair.
            self.entities[entity.label_].append((str(entity), entity.root.idx))


    def getEntitiesDict(self, filter=False):
        """
        Return a dictionary of entities where the keys are the entity types.

        :param filter: if True, return only the targeted entity types
        :return:
        """
        if not filter:
            return self.entities

        result = {}
        for eType in SpacyEntityTypeMapper.targetedEntityTypes:
            result[eType] = self.entities[eType]

        return result


    def getNerEntities(self, filter=True):
        """
        Return a list of NerEntity objects extracted.
        :param filter: if True, return only the targeted entity types
        :return:
        """
        entityDict = self.getEntitiesDict(filter)

        result = []

        for key in sorted(entityDict.keys()):
            for (content, offset) in entityDict[key]:
                mappedKey = SpacyEntityTypeMapper.mapEntity(key)
                entity = NerUtils.NerEntity(mappedKey, offset, content)
                result.append(entity)


        return result


if __name__ == '__main__':
    extractor = SpacyEntityExtractor()
    extractor.readInput("John Smith wrote to Mary Jones.")

    # Print an intermediate result.
    print(str(extractor.entities))

    print(str(extractor.getNerEntities()))

I'm the author, so of course the code seems pretty clear and well-documented to me. One thing worth noting, however, is the static declaration of the nlp property in the SpacyEntityExtractor class. This means that your application will incur the cost of Spacy initialization only once, when the first object of the class is instantiated. After that, the Spacy resources can be used for every document fed in.

SpacyEntityTypeMapper

You see from the import statement in the SpacyEntityExtractor that it expects a SpacyEntityTypeMapper class, which is designed to let you control specifically which entity types you want to extract. This class also maps the different LOC and GPE (geo-political entity) types into a single Location type.

'''
Map SpaCy entity types to the standard Person/Organization/Location types.
Use static methods for default mapping. To extend with additional mappings,
allow instantiation of an object and provide a means of creating a separate
mapEntity() method for the object.
'''
from ner import NerUtils


class SpacyEntityTypeMapper(object):
    allEntityTypes = ['PERSON', 'NORP', 'FAC', 'ORG', 'GPE', 'LOC', 'PRODUCT', 'EVENT', 'WORK_OF_ART',
                      'LAW', 'LANGUAGE', 'DATE', 'TIME', 'PERCENT', 'MONEY', 'QUANTITY', 'ORDINAL', 'CARDINAL']

    targetedEntityTypes = ['ORG', 'PERSON', 'LOC', 'GPE']


    personTypes = ['PERSON']

    organizationTypes = ['ORG']

    locationTypes = ['LOC', 'GPE']


    @staticmethod
    def mapEntity(eType):
        if eType in SpacyEntityTypeMapper.personTypes:
            return NerUtils.EntityType.Person
        elif eType in SpacyEntityTypeMapper.organizationTypes:
            return NerUtils.EntityType.Organization
        elif eType in SpacyEntityTypeMapper.locationTypes:
            return NerUtils.EntityType.Location
        elif eType in SpacyEntityTypeMapper.allEntityTypes:
            return eType
        else:
            raise ValueError("Input param not mappable. Input param: " + eType)

The getNerEntities( ) method is where SpacyEntityExtractor invokes the mapper's mapEntity( ) method, in the line reading
mappedKey = SpacyEntityTypeMapper.mapEntity(key).

In other words, to go from a GPE Spacy entity type to a standard Location entity type, your call looks like this: SpacyEntityTypeMapper.mapEntity('GPE').

External Resources

Some useful links: