NLTK Natural Language Processing made easy Elvis Joel D ’Souza Gopi Krishnan Nambiar Ashutosh Pandey
WHAT: Session Objective <ul><li>To introduce Natural Language Toolkit(NLTK), an open source library which simplifies the i...
HOW: Session Layout <ul><li>This session is divided into 3 parts: </li></ul><ul><ul><ul><li>Python – The programming langu...
 
Why Python?
Data Structures <ul><li>Python has 4 built-in data structures: </li></ul><ul><li>List </li></ul><ul><li>Tuple </li></ul><u...
List <ul><li>A list in Python is an ordered group of items (or elements ). </li></ul><ul><li>It is a very general st...
Tuple <ul><li>A tuple in Python is much like a list except that it is immutable (unchangeable) once created. </li></u...
Return a tuple <ul><li>def func (x,y): </li></ul><ul><li># code to compute a and b </li></ul><ul><li>return (a,b) </li...
Dictionary <ul><li>A dictionary in python is a collection of unordered values which are accessed by key . </li></ul><ul...
Sets <ul><li>Python also has an implementation of the mathematical set. </li></ul><ul><li>Unlike sequence objects such as...
Control Statements
Decision Control - If num = 3
Loop Control - While number = 10
Loop Control - For
Functions - Syntax <ul><li>def functionname (arg1, arg2, ...): </li></ul><ul><li>statement1 </li></ul><ul><li>statement...
Functions - Example
Modules <ul><li>A module is a file containing Python definitions and statements. </li></ul><ul><li>The file name is the m...
Import import math The import keyword is used to tell Python, that we need the ‘math’ module. This statement makes all ...
Using Modules – An Example print math. sqrt( 100 ) sqrt is a function math is a module math.sqrt(100) returns 10 This i...
Natural Language Processing (NLP)
Natural Language Processing <ul><li>The term natural language processing encompasses a broad set of techniques for autom...
Why NLP <ul><li>Applications for processing large amounts of texts require NLP expertise </li></ul><ul><li>Index and searc...
Stemming <ul><li>Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root f...
Part of speech tagging(POS Tagging) <ul><li>Part-of-speech (POS) tag: A word can be classified into one or more lexical or...
POS tagging - continued <ul><li>Given a sentence and a set of POS tags, a common language processing task is to automatica...
POS Tagging – An Example The ball is red NOUN VERB ADJECTIVE ARTICLE
Parsing <ul><li>Parsing a sentence involves the use of linguistic knowledge of a language to discover the way in which a s...
Parsing– An Example The boy went home NOUN VERB NOUN ARTICLE NP VP The boy went home
Challenges <ul><li>We will often imply additional information in spoken language by the way we place stress on words. </l...
<ul><li>Depending on which word the speaker places the stress, sentences could have several distinct meanings </li></ul>He...
<ul><li>&quot; I never said she stole my money“ Someone else said it, but I didn't. </li></ul><ul><li>&quot;I never ...
<ul><li>&quot;I never said she stole my money&quot; I just said she probably borrowed it. </li></ul><ul><li>&quot;I ...
NLTK Natural Language Toolkit
Design Goals
Exploring Corpora <ul><li>Corpus is a large collection of text which is used to either train an NLP program or is used as ...
 
Loading your own corpus <ul><li>>>> from nltk.corpus import PlaintextCorpusReader </li></ul><ul><li>corpus_root = ‘C:text...
NLTK Corpora <ul><li>Gutenberg corpus </li></ul><ul><li>Brown corpus </li></ul><ul><li>Wordnet </li></ul><ul><li>Stopwords...
Computing with Language: Simple Statistics <ul><li>Frequency Distributions </li></ul><ul><li>>>> fdist1 = FreqDist(text1) ...
Cumulative Frequency Plot for 50 Most Frequently Words in Moby Dick
POS tagging
WordNet Lemmatizer
Parsing <ul><li>>>> from nltk.parse import ShiftReduceParser </li></ul><ul><li>>>> sr = ShiftReduceParser(grammar) </li></...
Authorship Attribution An Example
Find nltk @ <python-installation>Libsite-packagesnltk
The Road Ahead <ul><li>Python: </li></ul><ul><ul><ul><li>http://www.python.org </li></ul></ul></ul><ul><ul><ul><li>A Byte...
of 47

Natural Language Processing made easy

Published on: Mar 3, 2016
Published in: Technology      
Source: www.slideshare.net


Transcripts - Natural Language Processing made easy

  • 1. NLTK Natural Language Processing made easy Elvis Joel D ’Souza Gopi Krishnan Nambiar Ashutosh Pandey
  • 2. WHAT: Session Objective <ul><li>To introduce Natural Language Toolkit(NLTK), an open source library which simplifies the implementation of Natural Language Processing(NLP) in Python. </li></ul>
  • 3. HOW: Session Layout <ul><li>This session is divided into 3 parts: </li></ul><ul><ul><ul><li>Python – The programming language </li></ul></ul></ul><ul><ul><ul><li>Natural Language Processing (NLP) – The concept </li></ul></ul></ul><ul><ul><ul><li>Natural Language Toolkit (NLTK) – The tool for NLP implementation in Python </li></ul></ul></ul>
  • 5. Why Python?
  • 6. Data Structures <ul><li>Python has 4 built-in data structures: </li></ul><ul><li>List </li></ul><ul><li>Tuple </li></ul><ul><li>Dictionary </li></ul><ul><li>Set </li></ul>
  • 7. List <ul><li>A list in Python is an ordered group of items (or elements ). </li></ul><ul><li>It is a very general structure, and list elements don't have to be of the same type. </li></ul>listOfWords = [‘this’,’is’,’a’,’list’,’of’,’words’] listOfRandomStuff = [1,’pen’,’costs’,’Rs.’,6.50]
  • 8. Tuple <ul><li>A tuple in Python is much like a list except that it is immutable (unchangeable) once created. </li></ul><ul><li>They are generally used for data which should not be edited. </li></ul>Example: ( 100 , 10 , 0.01 ,’ hundred ’ ) Number Square root Reciprocal Number in words
  • 9. Return a tuple <ul><li>def func (x,y): </li></ul><ul><li># code to compute a and b </li></ul><ul><li>return (a,b) </li></ul>One very useful situation is returning multiple values from a function. To return multiple values in many other languages requires creating an object or container of some type.
  • 10. Dictionary <ul><li>A dictionary in python is a collection of unordered values which are accessed by key . </li></ul><ul><li>Example: </li></ul><ul><li>Here, the key is the character and the value is its position in the alphabet </li></ul>{ 1 : ‘ one ’ , 2 : ‘ two ’ , 3 : ‘ three ’ }
  • 11. Sets <ul><li>Python also has an implementation of the mathematical set. </li></ul><ul><li>Unlike sequence objects such as lists and tuples, in which each element is indexed, a set is an unordered collection of objects. </li></ul><ul><li>Sets also cannot have duplicate members - a given object appears in a set 0 or 1 times. </li></ul>SetOfBrowsers=set([ ‘IE’,’Firefox’,’Opera’,’Chrome’])
  • 12. Control Statements
  • 13. Decision Control - If num = 3
  • 14. Loop Control - While number = 10
  • 15. Loop Control - For
  • 16. Functions - Syntax <ul><li>def functionname (arg1, arg2, ...): </li></ul><ul><li>statement1 </li></ul><ul><li>statement2 </li></ul><ul><li>return variable </li></ul>
  • 17. Functions - Example
  • 18. Modules <ul><li>A module is a file containing Python definitions and statements. </li></ul><ul><li>The file name is the module name with the suffix .py appended. </li></ul><ul><li>A module can be imported by another program to make use of its functionality. </li></ul>
  • 19. Import import math The import keyword is used to tell Python, that we need the ‘math’ module. This statement makes all the functions in this module accessible in the program.
  • 20. Using Modules – An Example print math. sqrt( 100 ) sqrt is a function math is a module math.sqrt(100) returns 10 This is being printed to the standard output
  • 21. Natural Language Processing (NLP)
  • 22. Natural Language Processing <ul><li>The term natural language processing encompasses a broad set of techniques for automated generation, manipulation, and analysis of natural or human languages </li></ul>
  • 23. Why NLP <ul><li>Applications for processing large amounts of texts require NLP expertise </li></ul><ul><li>Index and search large texts </li></ul><ul><li>Speech understanding </li></ul><ul><li>Information extraction </li></ul><ul><li>Automatic summarization </li></ul>
  • 24. Stemming <ul><li>Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form. </li></ul><ul><li>The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. </li></ul><ul><li>When you apply stemming on 'cats', the result is 'cat' </li></ul>
  • 25. Part of speech tagging(POS Tagging) <ul><li>Part-of-speech (POS) tag: A word can be classified into one or more lexical or part-of-speech categories </li></ul><ul><li>such as nouns, verbs, adjectives, and articles, to name a few. A POS tag is a symbol representing such a lexical category, e.g., NN (noun), VB (verb), JJ (adjective), AT (article). </li></ul>
  • 26. POS tagging - continued <ul><li>Given a sentence and a set of POS tags, a common language processing task is to automatically assign POS tags to each word in the sentence. </li></ul><ul><li>State-of-the-art POS taggers can achieve accuracy as high as 96%. </li></ul>
  • 27. POS Tagging – An Example The ball is red NOUN VERB ADJECTIVE ARTICLE
  • 28. Parsing <ul><li>Parsing a sentence involves the use of linguistic knowledge of a language to discover the way in which a sentence is structured </li></ul>
  • 29. Parsing– An Example The boy went home NOUN VERB NOUN ARTICLE NP VP The boy went home
  • 30. Challenges <ul><li>We will often imply additional information in spoken language by the way we place stress on words. </li></ul><ul><li>The sentence &quot;I never said she stole my money&quot; demonstrates the importance stress can play in a sentence, and thus the inherent difficulty a natural language processor can have in parsing it. </li></ul>
  • 31. <ul><li>Depending on which word the speaker places the stress, sentences could have several distinct meanings </li></ul>Here goes an example…
  • 32. <ul><li>&quot; I never said she stole my money“ Someone else said it, but I didn't. </li></ul><ul><li>&quot;I never said she stole my money“ I simply didn't ever say it. </li></ul><ul><li>&quot;I never said she stole my money&quot; I might have implied it in some way, but I never explicitly said it. </li></ul><ul><li>&quot;I never said she stole my money&quot; I said someone took it; I didn't say it was she. </li></ul>
  • 33. <ul><li>&quot;I never said she stole my money&quot; I just said she probably borrowed it. </li></ul><ul><li>&quot;I never said she stole my money&quot; I said she stole someone else's money. </li></ul><ul><li>&quot;I never said she stole my money &quot; I said she stole something, but not my money </li></ul>
  • 34. NLTK Natural Language Toolkit
  • 35. Design Goals
  • 36. Exploring Corpora <ul><li>Corpus is a large collection of text which is used to either train an NLP program or is used as input by an NLP program </li></ul><ul><li>In NLTK , a corpus can be loaded using the PlainTextCorpusReader Class </li></ul>
  • 38. Loading your own corpus <ul><li>>>> from nltk.corpus import PlaintextCorpusReader </li></ul><ul><li>corpus_root = ‘C:text’ </li></ul><ul><li>>>> wordlists = PlaintextCorpusReader(corpus_root, '.* ‘) </li></ul><ul><li>>>> wordlists.fileids() </li></ul><ul><li>['README', 'connectives', 'propernames', 'web2', 'web2a', 'words'] </li></ul><ul><li>>>> wordlists.words('connectives') </li></ul><ul><li>['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...] </li></ul>
  • 39. NLTK Corpora <ul><li>Gutenberg corpus </li></ul><ul><li>Brown corpus </li></ul><ul><li>Wordnet </li></ul><ul><li>Stopwords </li></ul><ul><li>Shakespeare corpus </li></ul><ul><li>Treebank </li></ul><ul><li>And many more… </li></ul>
  • 40. Computing with Language: Simple Statistics <ul><li>Frequency Distributions </li></ul><ul><li>>>> fdist1 = FreqDist(text1) </li></ul><ul><li>>>> fdist1 [2] </li></ul><ul><li><FreqDist with 260819 outcomes> </li></ul><ul><li>>>> vocabulary1 = fdist1.keys() </li></ul><ul><li>>>> vocabulary1[:50] </li></ul><ul><li>[',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', &quot;'&quot;, '-', </li></ul><ul><li>'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '&quot;', 'all', 'for', </li></ul><ul><li>'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', </li></ul><ul><li>'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were', </li></ul><ul><li>'now', 'which', '?', 'me', 'like'] </li></ul><ul><li>>>> fdist1['whale'] </li></ul><ul><li>906 </li></ul>
  • 41. Cumulative Frequency Plot for 50 Most Frequently Words in Moby Dick
  • 42. POS tagging
  • 43. WordNet Lemmatizer
  • 44. Parsing <ul><li>>>> from nltk.parse import ShiftReduceParser </li></ul><ul><li>>>> sr = ShiftReduceParser(grammar) </li></ul><ul><li>>>> sentence1 = 'the cat chased the dog'.split() </li></ul><ul><li>>>> sentence2 = 'the cat chased the dog on the rug'.split() </li></ul><ul><li>>>> for t in sr.nbest_parse(sentence1): </li></ul><ul><li>... print t </li></ul><ul><li>(S (NP (DT the) (N cat)) (VP (V chased) (NP (DT the) (N dog)))) </li></ul>
  • 45. Authorship Attribution An Example
  • 46. Find nltk @ <python-installation>Libsite-packagesnltk
  • 47. The Road Ahead <ul><li>Python: </li></ul><ul><ul><ul><li>http://www.python.org </li></ul></ul></ul><ul><ul><ul><li>A Byte of Python, Swaroop CH http://www.swaroopch.com/notes/python </li></ul></ul></ul><ul><li>Natural Language Processing: </li></ul><ul><ul><ul><li>Speech And Language Processing, Jurafsky and Martin </li></ul></ul></ul><ul><ul><ul><li>Foundations of Statistical Natural Language Processing, Manning and Schutze </li></ul></ul></ul><ul><li>Natural Language Toolkit: </li></ul><ul><ul><ul><li>http://www.nltk.org (for NLTK Book, Documentation) </li></ul></ul></ul><ul><ul><ul><li>Upcoming book by O'reilly Publishers </li></ul></ul></ul>

Related Documents