Two Jupyter notebooks to serve as a starting point for all your preprocessing needs.
In a Nutshell
In my jkurlandski01/preprocessing project on GitHub I have two notebooks which provide solutions for two different kinds of preprocessing tasks. Use SimplePreprocessor for standard tokenization and lemmatization. Use EnhancedPreprocessor if you're looking for bells and whistles. If you need something really non-standard, EnhancedPreprocessor is a great starting point—follow the implementation of one of its features in order to add your own functionality.
Details
Not long ago I took a deep dive into NLP preprocessing—in other words, tokenization and lemmatization. I wanted a single Python class that could serve as a one-stop shop for all the different kinds of preprocessing I have needed in the past.
This meant, first of all, that the tokenizer had to do a good job with the following, which are often stumbling blocks for a tokenizer:
- runtime: first and foremost, the preprocessor's runtime had to be comparable to that of other preprocessors.
- abbreviations: the tokenizer should not let periods cause it to split, for example, "U.S.A."
- digit strings and numbers: the tokenizer should not split certain common digit strings, such as numbers like "96.8" or times of day like "8:00."
- punctuation: punctuation had to be retained and not eliminated, because for some tasks it could be important.
- stop words: the tokenizer had to come with a stop word list that matched the output of the tokenizer.
But the need for a class to serve as a one-stop preprocessing shop also meant the class had to offer additional functionality, all of which had to be optional.
- contractions and the possessive apostrophe 's': in English, the apostrophe conveys significant information. Depending on the letters it is adjoined to, the apostrophe can mark a verb (is, has, had and so on); it can indicate a negative (not); and it can mark possession (as in John's). Because of its importance, sometimes you don't want a tokenizer that splits the apostrophe from letters next to it.
- optional handling of stop words, punctuation and digit strings: Some users may want punctuation as part of tokenized output; some users may want it removed; and for others it may be very useful to replace all punctuation with the same dummy token. Ditto for stop words and digit strings.
- extensibility for other needs: Suppose that, in a particular corpus of documents, IP addresses play an important role. EnhancedPreprocessor does not offer the option of not splitting an IP address into its various tokens; but it does offer this option for telephone numbers. It would be very easy for a programmer to see how telephone numbers are handled and mimic that same behavior for IP addresses.
Lemmatization
If the two preprocessors have a failing, it concerns lemmatization. Both preprocessors use NLTK's WordNetLemmatizer, which does an acceptable job of lemmatizing most nouns but a very poor job at lemmatizing verbs. Spacy offers much better lemmatization—but with a significant runtime cost. The SimplePreprocessor notebook offers a brief discussion of how useful lemmatization really is.