NLP and data science news and resources.
Quick Summaries
"Your AI skills are worth less than you think"
by Ric Szopa
Thanks to open-source libraries and the "open culture of the AI community," machine learning skills are becoming less valuable as time goes by, not more. Freely available libraries and shared knowledge mean that a developer's background knowledge doesn't need to be as deep as it was in the last couple of years.
Taking the best performing architecture currently described in the literature and retraining it on your own data is a battle tested strategy if your goal is to solve a problem (as opposed to making an original contribution to science). If there’s nothing really good available right now, it’s often a matter of waiting a quarter or two until someone comes up with a solution. Especially that you can do things like host a Kaggle competition to incentivize researchers to look into your particular problem.
Szopa goes on to explain where, eventually, he feels the competitive advantage in the field is going to be:
So, how would you go about building a maintainable competitive advantage for an AI product? Some time ago I had the pleasure of talking to Antonio Criminisi from Microsoft Research. His idea is that the project’s secret sauce shouldn’t consist only of AI. For example, his InnerEye project uses AI and classical (not ML based) computer vision to analyze radiological images. To some extent, this may be at odds with why you are doing an AI startup in the first place. The ability to just throw data at a model and see it work is incredibly attractive. However, a traditional software component, the sort of which requires programmers to think about algorithms and utilize some hard to gain domain knowledge, is much more difficult to reproduce.
In other words, from a long-term perspective, this new approach to machine learning—just "throwing data at a problem" and getting good results—may be just a passing fad. To excel at what you're doing, you'll need domain expertise. Which is pretty much where we were before the deep learning craze.
"Will NLP Change Search as We Know It?"
by Paul Nelson
Maybe a better title would have been "Automatically Generating Technical Documentation."
Nelson discusses representing technical—e.g. software—documentation as knowledge graphs like DBpedia and ConceptNet that can be automatically generated. Here I've abridged one section of his article.
There are many good reasons why technical documentation may be the first to completely break away from paper-based documentation:
Technical documents are narrow in scope
Lots of technical documentation is already generated automatically — this is already (mostly) the case for Javadoc and similar sorts of library documentation. It’s a small step from this to creating machine-readable knowledge
People reading technical documentation don’t want long narratives — they want short paragraphs and lots of lists and examples
The last item is what particularly grabbed my attention. I've come to believe that one of the benefits of unit tests are that they can provide clients (that is, other programmers using the method or class) real-world examples of how to use the code. A unit test could easily be transformed into different kinds of lists: input-output pairs (for this input, you get this output), or a set of instructions (you need to take these steps in order to properly call this method). With @-marked metadata at the start or within a unit test, doing this automatically is not such a far-fetched idea.
Data Science/ML Links
- Glossary of common Machine Learning, Statistics and Data Science terms
- The 10 Statistical Techniques Data Scientists Need to Master
- Stephan Raaijmakers: Deep Learning for Natural Language Processing
- Keras Examples
- Is There a Smarter Path to Artificial Intelligence?: Don't let the hype over Deep Learning raise our expectations and lead to another round of AI disillusionment.
- Canada's AI Ecosystem: Montreal
- From big data to fast data
- NFL Salaries for AI Talent
- NY Times: The Great AI Awakening
- Google Translate and Deep Learning, Pt. 1
- Google Translate and Deep Learning, Pt. 2
- "The 5 Basic Statistics Concepts Data Scientists Need to Know"
- "Best Deals in Deep Learning Cloud Providers"
- "The Most in Demand Skills for Data Scientists": As of Oct. 23, 2018
NLP Links
- "The Big Bad NLP Database: Access Nearly 300 Datasets"
- Top 10 Books on NLP and Text Analysis
- Faster NLP in Python: Combining Spacy and Cython for faster NLP.
- A Framework for Approaching Textual Data Science Tasks: Includes a discussion of the differences between text mining and NLP.
- Chatbots: Theory and Practice
- "Natural Language Processing (NLP) Techniques for Extracting Information" by Paul Nelson
- Text Analytics: A Primer
- "What NLP & Text Analytics Companies Got Funded or Acquired in 2016?" by Seth Grimes
- Siri and autism: a mother's notes
- Types of Bots
NLP Resources
Parsing, Events, Etc.
- Penn Treebank Tag Set. As far as I can tell, the University of Pennsylvania doesn't have a permanent web page dedicated to their own tag set. Try this, though I can't guarantee the link will remain valid: Open American National Corpus
- Universal Dependencies
- VerbNet
- FrameNet
- EVITA - Events In Text Analyzer
- TimeML: Markup Language for Temporal and Event Expressions
- HeidelTime
Multi-Purpose Toolkits
Statistical Semantics
World Knowledge
Other
- The Big, Bad NLP Database See also the article linked to above.
- Pattern A utility for scraping websites.
- Natural Language Software Registry
- Natural Language Processing Group on LinkedIn