Natural Language Processing
in practice
Topics
* Overview of NLP
* Getting Data
* Models & Algorithms
* Building an NLP system
* A practical example
A bit about me
* Lisp programmer
* Architect and research lead at Grammarly
(3+ years of NLP work)
* Teacher at KPI: Opera...
A bit about Grammarly
(c) xkcd
The best English language writing
enhancement app:
Spellcheck - Grammar check - Style
impro...
What is NLP?
Transforming free-form text
into structured data and back
Intersection of Comp Sci &
Linguistics & Software E...
Popular NLP problems
* Spam Filtering
* Spelling Correction
* Sentiment Analysis
* Question Answering
* Machine Translatio...
Levels of NLP
* data & tools
* models
* production-ready systems
Role of Linguistics
NLP Data
structured semi-structured–
unstructured–
“Data is ten times more
powerful than algorithms.”
-- Peter Norvig
The ...
Kinds of data
* Dictionaries
* Corpora
* User Data
Where to get data?
* Linguistic Data Consortium
http://www.ldc.upenn.edu/
* Google ngrams, book ngrams,
syntactic ngrams
*...
Create your own!
* Linguists
* Crowdsourcing
* By-product
-- Johnatahn Zittrain
http://goo.gl/hs4qB
Tools
* analysis tools
* processing tools
* Unix command line
* XML processing
* Map-reduce systems
* R, Python, Lisp
(c) ...
Algorithms
* Dynamic Programming
* Search Algorithms
* Tree Algorithms
Beyond Algorithms
* CKY constituency parsing
* Noisy channel spelling
correction
* TF-IDF document
classification
* Bayesi...
Models
* generative vs discriminative
* statistical vs rule-based
Language Models
Ngrams
Generative ML models:
* Bayesian inference
(bag-of-words model)
* Hidden Markov model
(sequence mod...
Discriminative Models
* Heuristic
* Maximum Entropy
* “Advanced” LM Models
Going Into Prod
* Translate real-world requirements
into a measurable goal
* Pre- and post- processing
* Don't trust resea...
Practical Example:
Language Detection
Idea
Standard approach:
character LM
Let's try an alternative:
word LM
Data – from Wiktionary
Test data from Wikipedia–
Practical ML System
* Training
ML System
* Training
* Evaluation
ML System
* Training
* Evaluation
* Production
Thanks!
Questions?
Vsevolod Dyomkin
@vseloved
of 25

Всеволод Демкин "Natural language processing на практике"

Конференция "AI&BigData Lab", 12 апреля 2014
Published on: Mar 3, 2016
Published in: Data & Analytics      Technology      Education      
Source: www.slideshare.net


Transcripts - Всеволод Демкин "Natural language processing на практике"

  • 1. Natural Language Processing in practice
  • 2. Topics * Overview of NLP * Getting Data * Models & Algorithms * Building an NLP system * A practical example
  • 3. A bit about me * Lisp programmer * Architect and research lead at Grammarly (3+ years of NLP work) * Teacher at KPI: Operating Systems * Links: http://lisp-univ-etc.blogspot.com http://github.com/vseloved http://twitter.com/vseloved
  • 4. A bit about Grammarly (c) xkcd The best English language writing enhancement app: Spellcheck - Grammar check - Style improvement - Synonyms and word choice - Plagiarism check
  • 5. What is NLP? Transforming free-form text into structured data and back Intersection of Comp Sci & Linguistics & Software Eng Based on Algorithms, Machine Learning, and Statistics
  • 6. Popular NLP problems * Spam Filtering * Spelling Correction * Sentiment Analysis * Question Answering * Machine Translation * Text Summarization * Search (also IR) http://www.paulgraham.com/spam.html http://norvig.com/spell-correct.html (c) gettyimages
  • 7. Levels of NLP * data & tools * models * production-ready systems
  • 8. Role of Linguistics
  • 9. NLP Data structured semi-structured– unstructured– “Data is ten times more powerful than algorithms.” -- Peter Norvig The Unreasonable Effectiveness of Data. http://youtu.be/yvDCzhbjYWs
  • 10. Kinds of data * Dictionaries * Corpora * User Data
  • 11. Where to get data? * Linguistic Data Consortium http://www.ldc.upenn.edu/ * Google ngrams, book ngrams, syntactic ngrams * Wikimedia * Wordnet * APIs: Twitter, Wordnik, ... * University sites: Stanford, Oxford, CMU, ...
  • 12. Create your own! * Linguists * Crowdsourcing * By-product -- Johnatahn Zittrain http://goo.gl/hs4qB
  • 13. Tools * analysis tools * processing tools * Unix command line * XML processing * Map-reduce systems * R, Python, Lisp (c) O'Reilly Media
  • 14. Algorithms * Dynamic Programming * Search Algorithms * Tree Algorithms
  • 15. Beyond Algorithms * CKY constituency parsing * Noisy channel spelling correction * TF-IDF document classification * Bayesian filtering
  • 16. Models * generative vs discriminative * statistical vs rule-based
  • 17. Language Models Ngrams Generative ML models: * Bayesian inference (bag-of-words model) * Hidden Markov model (sequence model) * Neural networks (holistic model) LM + Domain Model
  • 18. Discriminative Models * Heuristic * Maximum Entropy * “Advanced” LM Models
  • 19. Going Into Prod * Translate real-world requirements into a measurable goal * Pre- and post- processing * Don't trust research results * Gather user feedback
  • 20. Practical Example: Language Detection
  • 21. Idea Standard approach: character LM Let's try an alternative: word LM Data – from Wiktionary Test data from Wikipedia–
  • 22. Practical ML System * Training
  • 23. ML System * Training * Evaluation
  • 24. ML System * Training * Evaluation * Production
  • 25. Thanks! Questions? Vsevolod Dyomkin @vseloved