The podcast about Python and the people who make it great
Polyglot: Multi-Lingual Natural Language Processing with Rami Al-Rfou
Summary
Using computers to analyze text can produce useful and inspirational insights. However, when working with multiple languages the capabilities of existing models are severely limited. In order to help overcome this limitation Rami Al-Rfou built Polyglot. In this episode he explains his motivation for creating a natural language processing library with support for a vast array of languages, how it works, and how you can start using it for your own projects. He also discusses current research on multi-lingual text analytics, how he plans to improve Polyglot in the future, and how it fits in the Python ecosystem.
Preface
- Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
- When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
- And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system designed by software engineers, for software engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. Podcast.__init__ listeners get 2 months free on any plan by going to pythonpodcast.com/clubhouse today and signing up for a trial.
- Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
- To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
- Your host as usual is Tobias Macey and today I’m interviewing Rami Al-Rfou about Polyglot, a natural language pipeline with support for an impressive amount of languages
Interview
- Introductions
- How did you get introduced to Python?
- Can you start by describing what Polyglot is and your reasons for starting the project?
- What are the types of use cases that Polyglot enables which would be impractical with something such as NLTK or SpaCy?
- A majority of NLP libraries have a limited set of languages that they support. What is involved in adding support for a given language to a natural language tool?
- What is involved in adding a new language to Polyglot?
- Which families of languages are the most challenging to support?
- What types of operations are supported and how consistently are they supported across languages?
- How is Polyglot implemented?
- Is there any capacity for integrating Polyglot with other tools such as SpaCy or Gensim?
- How much domain knowledge is required to be able to effectively use Polyglot within an application?
- What are some of the most interesting or unique uses of Polyglot that you have seen?
- What have been some of the most complex or challenging aspects of building Polyglot?
- What do you have planned for the future of Polyglot?
- What are some areas of NLP research that you are excited for?
Keep In Touch
Picks
- Tobias
- Rami
- The Wizard and the Prophet: Two Remarkable Scientists and Their Dueling Visions to Shape Tomorrow’s World by Charles C. Mann
Links
- Polyglot
- Polyglot-NER
- Jordan
- NLP (Natural Language Processing)
- Stony Brook University
- Arabic
- Sentiment Analysis
- Assembly Language
- C
- .NET
- Stack Overflow
- Deep Learning
- Word Embedding
- Wikipedia
- Word2Vec
- NLTK (Python Natural Language Toolkit)
- SpaCy
- Gensim
- Morphology
- Morpheme
- Transfer Learning
- Read The Docs
- BERT (Bidirectional Encoder Representations from Transformers)
- FastText
- data.world
- Quilt package management for data
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA