Matthew Honnibal's spaCy bills itself as an "industrial-strength" tool for natural language processing.
Thus far I have used spaCy (henceforth "Spacy") only for named entity recognition (NER), and for this purpose Spacy definitely keeps its promise. This page will help you get the tool up and running and give you some basic code for extracting entities from your documents.
Setting Spacy Up
The folks at Spacy have worked hard on making it easier to install their tools—so much so that I think you'll need very little help at all. At least in Ubuntu. If you have problems, see my old notes, and good luck.
Create a Virtual Environment
Serious work in Python requires a virtual environment. See my Using virtualenv for an easy way to manage this.
Install Spacy
Assuming you have a working virtual environment, activate it before you install Spacy.
$ source venv/bin/activate (venv) $
Go to https://spacy.io/usage/ and follow the directions. Here are the steps I took on both Mac and Ubuntu.
(venv) $ pip install -U spacy (venv) $ python -m spacy download en
Troubleshooting
Validation
Run validate
to check for issues.
The following example shows a successful validation. Note the checkmarks on the right.
(venv) $ python -m spacy validate Installed models (spaCy v2.0.18) /Users/jkurlandski/workspace/DeepLearning/venv/lib/python3.7/site-packages/spacy TYPE NAME MODEL VERSION package en-core-web-sm en_core_web_sm 2.0.0 ✔ link en_core_web_sm en_core_web_sm 2.0.0 ✔ link en en_core_web_sm 2.0.0 ✔
This next example shows an issue discovered by the validate
command.
(venv) $ python -m spacy validate Installed models (spaCy v2.0.18) /Users/jkurlandski/workspace/DeepLearning/venv/lib/python3.7/site-packages/spacy TYPE NAME MODEL VERSION package en-core-web-sm en_core_web_sm 2.1.0 --> 2.0.0 link en en_core_web_sm 2.1.0 --> 2.0.0 Use the following commands to update the model packages: python -m spacy download en_core_web_sm You may also want to overwrite the incompatible links using the `python -m spacy link` command with `--force`, or remove them from the data directory. Data path: /Users/jkurlandski/workspace/DeepLearning/venv/lib/python3.7/site-packages/spacy/data
The output is telling us that we have spaCy version 2.0.18 installed, and it should have en-core-web-sm v. 2.0.0—but it actually has v. 2.1.0. The output says we can fix this with the following:
(venv) $ python -m spacy download en_core_web_sm Collecting en_core_web_sm==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm==2.0.0 Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB) |████████████████████████████████| 37.4MB 944kB/s Building wheels for collected packages: en-core-web-sm Building wheel for en-core-web-sm (setup.py) ... done Created wheel for en-core-web-sm: filename=en_core_web_sm-2.0.0-cp37-none-any.whl size=37405977 sha256=b5e28721ea14ee32ff078193066dfbf0060782cfb6b6c1bfc09da0e49e03fe60 Stored in directory: /private/var/folders/c5/t56f2m1s6rq90qlsfk6wnjl80000gn/T/pip-ephem-wheel-cache-2yx9unop/wheels/54/7c/d8/f86364af8fbba7258e14adae115f18dd2c91552406edc3fdaa Successfully built en-core-web-sm Installing collected packages: en-core-web-sm Found existing installation: en-core-web-sm 2.1.0 Uninstalling en-core-web-sm-2.1.0: Successfully uninstalled en-core-web-sm-2.1.0 Successfully installed en-core-web-sm-2.0.0 Linking successful /Users/jkurlandski/workspace/DeepLearning/venv/lib/python3.7/site-packages/en_core_web_sm --> /Users/jkurlandski/workspace/DeepLearning/venv/lib/python3.7/site-packages/spacy/data/en_core_web_sm You can now load the model via spacy.load('en_core_web_sm')
Language Model Link Issue
An error message such as the following indicates a language model link issue:
~/workspace/DeepLearning/NLP/Spacy/SpacyUtils.py in <module> 12 Load the spaCy English language model one time for the entire application. 13 """ ---> 14 spacyEnglishModel = spacy.load('en') 15 lightSpacyEnglishModel = spacy.load("en", disable=["tagger", "parser", "ner", "textcat"]) 16 ~/workspace/DeepLearning/venv/lib/python3.7/site-packages/spacy/__init__.py in load(name, **overrides) 19 if depr_path not in (True, False, None): 20 deprecation_warning(Warnings.W001.format(path=depr_path)) ---> 21 return util.load_model(name, **overrides) 22 23 ~/workspace/DeepLearning/venv/lib/python3.7/site-packages/spacy/util.py in load_model(name, **overrides) 117 elif hasattr(name, 'exists'): # Path or Path-like to model data 118 return load_model_from_path(name, **overrides) --> 119 raise IOError(Errors.E050.format(name=name)) 120 121 OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
You can fix it with the link
command.
venv) $ python -m spacy link en_core_web_sm en Linking successful /Users/jkurlandski/workspace/DeepLearning/venv/lib/python3.7/site-packages/en_core_web_sm --> /Users/jkurlandski/workspace/DeepLearning/venv/lib/python3.7/site-packages/spacy/data/en You can now load the model via spacy.load('en')
Sample Code
Here's some code to get you started with exploring named entity recognition with Spacy.
NerUtils
We'll start with a utility class offering functionality frequently used in named entity recognition in general.
from collections import namedtuple from enum import Enum NerEntity = namedtuple("NerEntity", "eType, offset, content") # Person, Organization, Location EntityType = Enum('EntityType', ['Person', 'Organization', 'Location'])
NerEntity creates a Python namedtuple having the three properties most named entity recognizers assign to an entity: an entity type, an offset, and the string from the text that refers to the named entity (the "content").
The NerUtils file also creates an EntityType enum on the three most-commonly identified entity types—Person, Organization and Location. One advantage of the enum in our code is that we'll never make the mistake of using a different string to refer to the same entity type, for example "PERSON" instead of "Person".
SpacyEntityExtractor
Here's the workhorse of Spacy entity extraction.
from spacy.en import English from ner import NerUtils from ner.spacy.SpacyEntityTypeMapper import SpacyEntityTypeMapper class SpacyEntityExtractor(object): """""" # One-time initialization of Spacy. nlp = English() def __init__(self): """""" self.reinitialize() def reinitialize(self): # Final output. self.entities = {} for entity in SpacyEntityTypeMapper.allEntityTypes: self.entities[entity] = [] def readInput(self, text): self.reinitialize() self.text = text doc = SpacyEntityExtractor.nlp(text) self.extractEntityNames(doc) def extractEntityNames(self, doc): for entity in doc.ents: # Append to the entity list an entity-offset pair. self.entities[entity.label_].append((str(entity), entity.root.idx)) def getEntitiesDict(self, filter=False): """ Return a dictionary of entities where the keys are the entity types. :param filter: if True, return only the targeted entity types :return: """ if not filter: return self.entities result = {} for eType in SpacyEntityTypeMapper.targetedEntityTypes: result[eType] = self.entities[eType] return result def getNerEntities(self, filter=True): """ Return a list of NerEntity objects extracted. :param filter: if True, return only the targeted entity types :return: """ entityDict = self.getEntitiesDict(filter) result = [] for key in sorted(entityDict.keys()): for (content, offset) in entityDict[key]: mappedKey = SpacyEntityTypeMapper.mapEntity(key) entity = NerUtils.NerEntity(mappedKey, offset, content) result.append(entity) return result if __name__ == '__main__': extractor = SpacyEntityExtractor() extractor.readInput("John Smith wrote to Mary Jones.") # Print an intermediate result. print(str(extractor.entities)) print(str(extractor.getNerEntities()))
I'm the author, so of course the code seems pretty clear and well-documented to me. One thing worth noting, however, is the static declaration of the nlp property in the SpacyEntityExtractor class. This means that your application will incur the cost of Spacy initialization only once, when the first object of the class is instantiated. After that, the Spacy resources can be used for every document fed in.
SpacyEntityTypeMapper
You see from the import statement in the SpacyEntityExtractor that it expects a SpacyEntityTypeMapper class, which is designed to let you control specifically which entity types you want to extract. This class also maps the different LOC and GPE (geo-political entity) types into a single Location type.
''' Map SpaCy entity types to the standard Person/Organization/Location types. Use static methods for default mapping. To extend with additional mappings, allow instantiation of an object and provide a means of creating a separate mapEntity() method for the object. ''' from ner import NerUtils class SpacyEntityTypeMapper(object): allEntityTypes = ['PERSON', 'NORP', 'FAC', 'ORG', 'GPE', 'LOC', 'PRODUCT', 'EVENT', 'WORK_OF_ART', 'LAW', 'LANGUAGE', 'DATE', 'TIME', 'PERCENT', 'MONEY', 'QUANTITY', 'ORDINAL', 'CARDINAL'] targetedEntityTypes = ['ORG', 'PERSON', 'LOC', 'GPE'] personTypes = ['PERSON'] organizationTypes = ['ORG'] locationTypes = ['LOC', 'GPE'] @staticmethod def mapEntity(eType): if eType in SpacyEntityTypeMapper.personTypes: return NerUtils.EntityType.Person elif eType in SpacyEntityTypeMapper.organizationTypes: return NerUtils.EntityType.Organization elif eType in SpacyEntityTypeMapper.locationTypes: return NerUtils.EntityType.Location elif eType in SpacyEntityTypeMapper.allEntityTypes: return eType else: raise ValueError("Input param not mappable. Input param: " + eType)
The getNerEntities( ) method is where SpacyEntityExtractor invokes the mapper's mapEntity( ) method, in the line reading
mappedKey = SpacyEntityTypeMapper.mapEntity(key)
.
In other words, to go from a GPE Spacy entity type to a standard Location entity type, your call looks like this: SpacyEntityTypeMapper.mapEntity('GPE')
.
External Resources
Some useful links:
- A blog entittled "How to create custom NER model in Spacy"
- Spacy's own Entity Recognition page
- The Spacy Home Page
- The Spacy Lightning Tour