NLP and data science news and resources.

Quick Summaries

"Your AI skills are worth less than you think"

by Ric Szopa

Thanks to open-source libraries and the "open culture of the AI community," machine learning skills are becoming less valuable as time goes by, not more. Freely available libraries and shared knowledge mean that a developer's background knowledge doesn't need to be as deep as it was in the last couple of years.

Taking the best performing architecture currently described in the literature and retraining it on your own data is a battle tested strategy if your goal is to solve a problem (as opposed to making an original contribution to science). If there’s nothing really good available right now, it’s often a matter of waiting a quarter or two until someone comes up with a solution. Especially that you can do things like host a Kaggle competition to incentivize researchers to look into your particular problem.

Szopa goes on to explain where, eventually, he feels the competitive advantage in the field is going to be:

So, how would you go about building a maintainable competitive advantage for an AI product? Some time ago I had the pleasure of talking to Antonio Criminisi from Microsoft Research. His idea is that the project’s secret sauce shouldn’t consist only of AI. For example, his InnerEye project uses AI and classical (not ML based) computer vision to analyze radiological images. To some extent, this may be at odds with why you are doing an AI startup in the first place. The ability to just throw data at a model and see it work is incredibly attractive. However, a traditional software component, the sort of which requires programmers to think about algorithms and utilize some hard to gain domain knowledge, is much more difficult to reproduce.

In other words, from a long-term perspective, this new approach to machine learning—just "throwing data at a problem" and getting good results—may be just a passing fad. To excel at what you're doing, you'll need domain expertise. Which is pretty much where we were before the deep learning craze.

"Will NLP Change Search as We Know It?"

by Paul Nelson

Maybe a better title would have been "Automatically Generating Technical Documentation."

Nelson discusses representing technical—e.g. software—documentation as knowledge graphs like DBpedia and ConceptNet that can be automatically generated. Here I've abridged one section of his article.

There are many good reasons why technical documentation may be the first to completely break away from paper-based documentation:

  • Technical documents are narrow in scope

  • Lots of technical documentation is already generated automatically — this is already (mostly) the case for Javadoc and similar sorts of library documentation. It’s a small step from this to creating machine-readable knowledge

  • People reading technical documentation don’t want long narratives — they want short paragraphs and lots of lists and examples

The last item is what particularly grabbed my attention. I've come to believe that one of the benefits of unit tests are that they can provide clients (that is, other programmers using the method or class) real-world examples of how to use the code. A unit test could easily be transformed into different kinds of lists: input-output pairs (for this input, you get this output), or a set of instructions (you need to take these steps in order to properly call this method). With @-marked metadata at the start or within a unit test, doing this automatically is not such a far-fetched idea.

NLP Resources

Parsing, Events, Etc.

Multi-Purpose Toolkits

Statistical Semantics

World Knowledge

Other