Is that Dothraki or Valyrian?
and other NLP tasks with Python and NLTK
Charlie Redmon | SupStat, Inc.
August 18, 2014
Dothraki
Astapori Valyrian
High Valyrian
Importing raw text
dothraki_f = codecs.open(
"/home/cr/Python/westeros/dothraki.txt",
encoding=’utf -8’)
dothraki_raw = do...
Text processing: Cleaning
punct_re = re.compile(
ur’[. ,;:?! u2014u2019u2026 []] ’,
re.UNICODE)
dothraki_proc = punct_re.s...
Text processing: Tokenizing
dothraki_tokens = re.split(ur’s+’, dothraki_proc)
dothraki_types = set(dothraki_tokens )
print...
Inspecting the lexical distribution in a text
dothraki_freqdist = FreqDist( dothraki_tokens)
print dothraki_freqdist
<Freq...
CFD of Dothraki words
Top 10: anha, vos, me, ma, zhey, mae, anni, hash, yer, khal
Valyrian vocabulary distribution
Astapori Valyrian (Top 10):
ji, me, do, espo, si, mysa, eji, ez, ivetr´a, sa
High Valyria...
Feature 1: Consonant proportion
def c_prop(word ):
c_num = 0
for letter in u’bcdfgjklmnpqrstvxz u00f1 ’:
c_num += word.cou...
Word-internal consonant proportions across languages
Feature 2: Obstruent proportion
def obstruent_prop (word ):
obstruent_num = 0
for letter in u’bcdfgjkpqstvxz ’
obstruent_n...
Word-internal obstruent proportions across languages
Feature 3: Coda presence
def c_coda(word ):
if word [-1] in u’bcdfgjklmnpqrstvxz u00f1 ’:
return 1
else:
return 0
def obst...
Mean coda consonant presence across languages
Mean coda obstruent presence across languages
Feature 4: Consonant clusters
regex = ur’[ bcdfghjklmnpqrstvxz u00f1]
[ bcdfghjklmnpqrstvxz u00f1 ]+’
def c_cluster(word )...
Mean consonant cluster frequency across languages
Feature 5: Obstruent clusters
regex1 = ur’[bcdfghjkpqstvxz ][ bcdfghjkpqstvxz ]+’
def obs_cluster(word ):
oo_set = re.find...
Mean obstruent cluster frequency across languages
Feature 6: Vowel clusters
regex2 = ur’[ bcdfghjklmnpqrstvxz u00f1 ]+’
def v_cluster(word ):
v_set = re.split(regex2 , word...
Mean vowel cluster frequency across languages
Data from real languages
TDIL Assamese Corpus
TDIL Assamese Corpus
Assamese corpus files
directory = "/home/cr/Documents/NLPwP_pres/
TDIL_assamese_corpus_data "
os.listdir(directory)
[’subj_...
Assamese sample: ‘lit5.txt’
Frequency of the sound /x/ in ’lit5.txt’
len(re.findall(ur’[ u09b6u09b7u09b8]’,
assamese_sample_raw , re.UNICODE ))
1313
l...
Positional restrictions
Beginning a word:
len(re.findall(ur’b[ u09b6u09b7u09b8]’,
assamese_sample_raw , re.UNICODE ))
1129...
Positional restrictions
Following /a/:
len(re.findall(ur’u09be [ u09b6u09b7u09b8]’,
assamese_sample_raw , re.UNICODE ))
57...
Further work
Incorporate segmental parameters into classifier (fix Unicode
issues with NLTK’s classify module)
Use classifier...
Thank you
of 33

Natural Language Processing(SupStat Inc)

SupStat Inc, Natural Language Processing, NYC data science academy
Published on: Mar 3, 2016
Published in: Engineering      
Source: www.slideshare.net


Transcripts - Natural Language Processing(SupStat Inc)

  • 1. Is that Dothraki or Valyrian? and other NLP tasks with Python and NLTK Charlie Redmon | SupStat, Inc. August 18, 2014
  • 2. Dothraki
  • 3. Astapori Valyrian
  • 4. High Valyrian
  • 5. Importing raw text dothraki_f = codecs.open( "/home/cr/Python/westeros/dothraki.txt", encoding=’utf -8’) dothraki_raw = dothraki_f.read () print dothraki_raw Athchomar chomakaan , [zhey] khal vezhven. Azha anhaan asshilat ... Itte oakah! Jadi , zhey Jora Andahli. Khal vezhven. Ajjalan anha zalat vitiherat yer hatif. Kash qoy qoyi thira disse. Hash shafka zali addrivat mae , zhey Khaleesi? Ishish chare ...
  • 6. Text processing: Cleaning punct_re = re.compile( ur’[. ,;:?! u2014u2019u2026 []] ’, re.UNICODE) dothraki_proc = punct_re.sub(’’, dothraki_raw) dothraki_proc = dothraki_proc.lower () print dothraki_proc athchomar chomakaan zhey khal vezhven azha anhaan asshilat itte oakah jadi zhey jora andahli khal vezhven ajjalan anha zalat vitiherat yer hatif kash qoy qoyi thira disse ...
  • 7. Text processing: Tokenizing dothraki_tokens = re.split(ur’s+’, dothraki_proc) dothraki_types = set(dothraki_tokens ) print dothraki_types set([u’izzi ’, u’ale’, u’morea ’, u’vesazhao ’, u’yeri ’, u’ishish ’, u’dalen ’, u’vesazhae ’, u’yera ’, u’afisi ’, u’rhae ’, u’mawizzi ’, u’vee’, u’arrisse ’, u’ti’, u’ven’, u’rizh ’, u’afichak ’, u’gache ’, u’zigerek ’, u’zigereo ’, u’drivoe ’, u’maaz ’, u’zigeree ’, u’ayyeyoon ’, u’maan ’, u’mahrazhi ’, u’ma’, u’vos’, u’movekkhi ’, u’mahrazhis ’, u’meshafka ’, u’qisi ’, u’sani ’, u’ville ’, u’vikeesi ’, u’ifak ’, u’javrathi ’, u’zisa ’, u’chek ’, u’nem’, ... ])
  • 8. Inspecting the lexical distribution in a text dothraki_freqdist = FreqDist( dothraki_tokens) print dothraki_freqdist <FreqDist: u’anha ’: 50, u’vos’: 40, u’me’: 39, u’ma’: 38, u’zhey ’: 29, u’mae’: 27, u’anni ’: 26, u’hash ’: 23, u’yer’: 23, u’khal ’: 16, u’khaleesi ’: 16, u’mori ’: 15, u’jin’: 13, u’kisha ’: 12, u’nem’: 11, u’vo’: 11, u’che’: 10, u’jini ’: 10, u’she’: 10, ... > dothraki_freqdist .plot (20, cumulative=True)
  • 9. CFD of Dothraki words Top 10: anha, vos, me, ma, zhey, mae, anni, hash, yer, khal
  • 10. Valyrian vocabulary distribution Astapori Valyrian (Top 10): ji, me, do, espo, si, mysa, eji, ez, ivetr´a, sa High Valyrian (Top 10): daor, se, issa, syt, ziry, hen, jem¯ele, lue, yne, avy
  • 11. Feature 1: Consonant proportion def c_prop(word ): c_num = 0 for letter in u’bcdfgjklmnpqrstvxz u00f1 ’: c_num += word.count(letter) return c_num / len(word) c_prop(u’zu016bgusy ’) 0.5
  • 12. Word-internal consonant proportions across languages
  • 13. Feature 2: Obstruent proportion def obstruent_prop (word ): obstruent_num = 0 for letter in u’bcdfgjkpqstvxz ’ obstruent_num += word.count(letter) return obstruent_num / len(word) obstruent_prop (u’u012blvi ’) 0.25
  • 14. Word-internal obstruent proportions across languages
  • 15. Feature 3: Coda presence def c_coda(word ): if word [-1] in u’bcdfgjklmnpqrstvxz u00f1 ’: return 1 else: return 0 def obstruent_coda (word ): if word [-1] in u’bcdfgjkpqstvxz ’: return 1 else: return 0 c_coda(u’lysoon ’) 1 obstruent_coda (u’lysoon ’) 0
  • 16. Mean coda consonant presence across languages
  • 17. Mean coda obstruent presence across languages
  • 18. Feature 4: Consonant clusters regex = ur’[ bcdfghjklmnpqrstvxz u00f1] [ bcdfghjklmnpqrstvxz u00f1 ]+’ def c_cluster(word ): cc_set = re.findall(regex , word , re.UNICODE) return len(cc_set) c_cluster(u’avvirsosh ’) 3
  • 19. Mean consonant cluster frequency across languages
  • 20. Feature 5: Obstruent clusters regex1 = ur’[bcdfghjkpqstvxz ][ bcdfghjkpqstvxz ]+’ def obs_cluster(word ): oo_set = re.findall(regex1 , word , re.UNICODE) return len(oo_set) obs_cluster(u’avvirsosh ’) 2
  • 21. Mean obstruent cluster frequency across languages
  • 22. Feature 6: Vowel clusters regex2 = ur’[ bcdfghjklmnpqrstvxz u00f1 ]+’ def v_cluster(word ): v_set = re.split(regex2 , word , re.UNICODE) vv_set = [v for v in v_set if len(v) > 1] return len(vv_set) v_cluster(u’haeshi ’) 1
  • 23. Mean vowel cluster frequency across languages
  • 24. Data from real languages
  • 25. TDIL Assamese Corpus
  • 26. TDIL Assamese Corpus
  • 27. Assamese corpus files directory = "/home/cr/Documents/NLPwP_pres/ TDIL_assamese_corpus_data " os.listdir(directory) [’subj_art2.txt’, ’subj_politics1 .txt’, ’lit3.txt’, ’drama.txt’, ’religion2.txt’, ’criticism2.txt’, ’criticism1.txt’, ’subj_science3.txt’, ’ref_encyclopaedia -entry.txt’, ’subj_science2.txt’, ’subj_social -studies.txt’, ’music.txt’, ’subj_art1.txt ’subj_science1.txt’, ’subj_art3.txt’, ’news.txt’, ’subj_sociology .txt’, ’criticism3.txt’, ’lit8.txt’, ’subj_history1.txt’, ’lit4.txt’, ’lit6.txt’, ’religion ’subj_law.txt’, ’lit7.txt’, ’religion1.txt’, ’criticis ’lit5.txt’, ’subj_math.txt’, ’lit1.txt’, ’subj_science ’subj_science_5 .txt’, ’subj_history2.txt’, ’lit2.txt’, ’subj_science4.txt’, ’letter.txt’]
  • 28. Assamese sample: ‘lit5.txt’
  • 29. Frequency of the sound /x/ in ’lit5.txt’ len(re.findall(ur’[ u09b6u09b7u09b8]’, assamese_sample_raw , re.UNICODE )) 1313 len(re.findall(ur’u09b6 ’, assamese_sample_raw , re.UNICODE )) 298 len(re.findall(ur’u09b7 ’, assamese_sample_raw , re.UNICODE )) 195 len(re.findall(ur’u09b8 ’, assamese_sample_raw , re.UNICODE )) 820
  • 30. Positional restrictions Beginning a word: len(re.findall(ur’b[ u09b6u09b7u09b8]’, assamese_sample_raw , re.UNICODE )) 1129 Ending a word: len(re.findall(ur’[ u09b6u09b7u09b8 ]b’, assamese_sample_raw , re.UNICODE )) 895
  • 31. Positional restrictions Following /a/: len(re.findall(ur’u09be [ u09b6u09b7u09b8]’, assamese_sample_raw , re.UNICODE )) 57 Following /i/: len(re.findall(ur’[ u09bfu09c0 ][ u09b6u09b7u09b8]’, ssamese_sample_raw , re.UNICODE )) 70 Following /u/: len(re.findall(ur’[ u09c1u09c2 ][ u09b6u09b7u09b8]’, assamese_sample_raw , re.UNICODE )) 10
  • 32. Further work Incorporate segmental parameters into classifier (fix Unicode issues with NLTK’s classify module) Use classifier to predict assignment of random words from Westeros to Dothraki, Astapori Valyrian, and High Valyrian languages Isolate most important word-internal parameters in classification model (log-likelihood ranking in Naive Bayes model) Use full distributional account of select Assamese consonants as priors in acoustic classification model
  • 33. Thank you

Related Documents