Natural  Language  Processing  Tools  for  the  Digital  Humanities   Christopher  Manning   ...
Commencement  2010  
My  humanities  qualifications  •  B.A.  (Hons),  Australian  National  University  •  Ph.D.  Linguistics,  Stanf...
SO,  FEEL  FREE  TO  ASK   QUESTIONS!  
Text  
The  promise   Phrase  Net  visualization  of     Pride  &  Prejudice  (*  (in|at)  *)   ...
“How  I  write”  [code]  •  I  think  you  tend  to  get  too  much  of  people   showing  the  glitzy ...
Outline  1.  Introduction  2.  Getting  some  text  3.  Words  4.  Collocations,  etc.  5.  NLP  Frameworks  and...
2.  GETTING  SOME  TEXT  
First  step:  Text  •  To  do  anything,  you  need  some  texts!   –  Many  sites  give  you  various  ...
First  step:  Text  •  People  in  the  audience  are  probably  more  familiar   with  the  state  of  p...
1.  Early  English  Books  Online  •  TEI-­‐compliant  XML  texts  •  http://eebo.chadwyck.com/  
2.  Old  Bailey  Online  
3.  Project  Gutenberg  
Running  example:  H.  Rider  Haggard  •  The  hugely  popular  King  Solomons  Mines  (1885)  by  H.   Ri...
Interfaces  to  tools   Web   Programming   applications   APIs   ...
You’ll  need  to  program  •  Lisa  Spiro,  TAMU  Digital  Scholarship  2009:   I’m a digital humanist with o...
You’ll  need  to  program  •  Program  in  what?   –  Perl   •  Traditional  seat-­‐of-­‐the-­‐pants ...
You’ll  need  to  program  •  Program  with  what?   –  There  are  some  general  skills  that  you’ll  ...
Grabbing  files  from  websites  •  wget  (Linux)  or  curl  (Mac  OS  X,  BSD)   –  wget  http://www.gute...
Grabbing  files  from  websites  #!/usr/bin/perl                                               ...
Grabbing  files  from  websites  wget  http://www.gutenberg.org/browse/authors/h  perl  getHaggard.pl  <  h  >  ...
Typical  text  problems  "Devilish  strange!"  thought  he,  chuckling  to  himself;  "queer  business!  Capit...
There  are  always  text-­‐processing   gotchas  …  •  …  and  not  dealing  with  them  ...
There  are  always  text-­‐processing   gotchas  …  #!/usr/bin/perl  $finishedHeader  =  0...
3.  WORDS  
In  the  beginning  was  the  word  •  Word  counts  •  Word  counts  are  the  basis  of  all  the  simpl...
She  (1887)   http://wordle.net/    Jonathan  Feinberg  
Ayesha:  The  Return  of  She  (1905)  
She  and  Allan  (1921)  
Wisdoms  Daughter:  The  Life  and  Love  Story  of  She-­‐Who-­‐Must-­‐Be-­‐Obeyed  (1923)  
Wisdoms  Daughter:  The  Life  and  Love  Story  of  She-­‐Who-­‐Must-­‐Be-­‐Obeyed  (1923)  
Google  Books  Ngram  Viewer   http://ngrams.googlelabs.com/  
Google  Books  Ngram  Viewer  •  …  you  have  to  be  the  most  jaded  or  cynical  scholar   not  to	...
Language  change:  as  least  as  C.  D.  Manning.  2003.  Probabilistic  Syntax    •  I  found  this  exam...
Language  change:  as  least  as  •  A  language  change  in  progress?  I  found  a  bunch  of  other   ...
Language  change:  as  least  as  
Language  change:  as  least  as  
4.  COLLOCATIONS,  ETC.  
Using  a  text  editor  •  You  can  get  a  fair  distance  with  a  text  editor  that   allows  multi...
Traditional  Concordancers  •  WordSmith  Tools        Commercial;  Windows   –  http://www.lexically.net/wor...
The  decline  of  honour  
5.  NLP  FRAMEWORKS   AND  TOOLS  
The  Big  3  NLP  Frameworks  •  GATE  –  General  Architecture  for  Text  Engineering  (U.  Sheffield)   ...
The  main  NLP  Packages  •  NLTK      Python   –  http://www.nltk.org/  •  OpenNLP   –  http://incubato...
NLP  tools:  Rules  of  thumb  for  2011  1.  Unless  you’re  unlucky,  the  tool  you  want  to  use   ...
GATE  
Rule-­‐based  NLP  and  Statistical/ Machine  Learning  NLP  •  Most  work  on  NLP  in  the  1960s,...
Rule-­‐based  NLP  and  Statistical/ Machine  Learning  NLP  •  Hand-­‐built  grammars  are  fine  for...
Rule-­‐based  NLP  and  Statistical/ Machine  Learning  NLP  •  In  Statistical  NLP:   –  You  ...
How  much  hardware  do  you  need?  •  NLP  software  often  needs  plenty  of  RAM  (especially)   and ...
How  much  hardware  do  you  need?  •  Why  do  you  need  more  hardware?   –  More  speed   •  ...
How  much  hardware  do  you  need?  •  Luckily,  most  of  our  problems  are  trivially   parallelizable	...
6.  PART-­‐OF-­‐SPEECH   TAGGING  
Part-­‐of-­‐Speech  Tagging  •  Part-­‐of-­‐speech  tagging  is  normally  done  by  a  sequence   model  (ac...
Stanford  POS  tagger   http://nlp.stanford.edu/software/tagger.shtml  $  j...
Stanford  POS  tagger  •  For  the  second  time  you  do  it…  $  alias  stanfordtag  "java  -­‐mx1g  -­‐c...
MorphAdorner   http://morphadorner.northwestern.edu/  •  MorphAdorner  is  a  set  of  NLP  tools  d...
MorphAdorner  $  ./adornplaintext  temp  temp/3155.txt  2011-­‐06-­‐15  20:30:52,111  INFO    -­‐  MorphAdorner	...
Ah,  the  old  days!  $  ./adornplaintext  temp  temp/Hunter  Quartermain.txt    2011-­‐06-­‐15  17:18:15,551 ...
Comparing  taggers:  Penn  Treebank  vs.   NUPOS  Holly  NNP  Holly  n1   ...
Comparing  taggers:  Penn  Treebank  vs.   NUPOS  Holly  NNP  Holly  n1   ...
Stylistic  factors  from  POS  14000  12000  10000   8000   ...
7.  NAMED  ENTITY   RECOGNITION   (NER)  
Named  Entity  Recognition     –  “the  Chad  problem”  Germanyʼ’s representative to theEuropean Unionʼ’s vete...
Conditional  Random  Fields  (CRFs)   O   PER   PER   O   O   O   O   O   ...
Stanford  NER  Features  •  Word  features:  current  word,  previous  word,  next   word,  a  word  is  a...
Stanford  NER   http://nlp.stanford.edu/software/CRF-­‐NER.shtml  $  java  -­‐mx500m  -­‐Dfile.encoding=u...
8.  PARSING  
Statistical  parsing  •  One  of  the  big  successes  of  1990s  statistical  NLP   was  the  development	...
Phrase  structure  Parsing  •  Phrase  structure  representations  have  dominated   American  linguistics  si...
Dependency  parsing  •  A  dependency  parse  shows  which  words  in  a  sentence  modify  other  words  ...
Stanford  Dependencies  •  SD  is  a  particular  dependency  representation  designed  for  easy   extractio...
Statistical  Parsers    •  There  are  now  many  good  statistical  parsers  that   are  freely  downloada...
Tregex/Tgrep2  –  Tools  for  searching   over  syntax    
dreadful  things  She   Ayesha  amod(day-­‐18,  dreadful-­‐17)   am...
Making  use  of  dependency  structure  J.  Engelberg  Costly  Information  Processing  (AFA,  2009):    •  A...
Evidence from earnings announcements [Engelberg AFA 2009]•  But  how  do  you  use  the	...
Evidence from earnings announcements [Engelberg 2009]•  In  a  regression  model	...
Syntactic Packaging and Implicit Sentiment [Greene 2007; Greene and Resnik 2009]•  Positive  or  negativ...
Predicting Opinions of the Death Penalty [Greene 2007; Greene and Resnik 2009]•  Collected  pro-­‐  an...
9.  COREFERENCE   RESOLUTION  
Coreference  resolution  •  The  goal  is  to  work  out  which  (noun)  phrases   refer  to  the  same  ...
Coreference  resolution  warnings  •  Warning:  The  tools  we  have  looked  at  so  far  work   one  sen...
Coreference  resolution  warnings  •  English-­‐only  for  the  moment….  •  While  there  are  some  papers  ...
Coreference  resolution  warnings  Nevertheless,  it’s  not  yet  known  to  the  State  of  California  to ...
Stanford  CoreNLP   http://nlp.stanford.edu/software/corenlp.shtml  •  Stanford  CoreNLP  is  our  new  ...
Stanford  CoreNLP  $  java  -­‐mx3g  -­‐Dfile.encoding=utf-­‐8  -­‐cp  "Software/stanford-­‐corenlp-­‐2011-­‐06-­‐08...
What  Stanford  CoreNLP  gives   –  Sarah  asked  her  father  to  look  at  her  .     –  He  apprecia...
What  Stanford  CoreNLP  gives   –  Sarah  asked  her  father  to  look  at  her  .     –  He  apprecia...
THE  REST  OF  THE  LANGUAGES  OF  THE   WORLD    
English-­‐only?  •  There  are  a  lot  of  languages  out  there  in  the  world!  •  But  there  are  a ...
POS  taggers  for  many  languages?  •  Two  choices:   1.  Find  a  tagger  with  an  existing  model  f...
POS  taggers  for  many  languages?  •  One  tagger  with  good  existing  multi-­‐lingual  support   –  Tr...
Chinese  example  •  Chinese  doesn’t  put  spaces  between  words   –  Nor  did  Ancient  Greek  •  So  a...
Chinese  example  •  $  ../Software/stanford-­‐chinese-­‐ segmenter-­‐2010-­‐03-­‐08/segment.sh  ctb   Xinhua.txt...
Chinese  example  #  space  before    below!  $  perl  -­‐pe  if  (  !  m/^s*$/  &&  !  m/^.{100}/)  { ...
Other  tools  •  Dependency  parsers  are  now  available  for  many   languages,  especially  via  MaltPars...
Data  sources  •  Parsers  depend  on  annotated  data  (treebanks)  •  You  can  use  a  parser  trained  o...
PARTING  WORDS  
Applications?  (beyond  word  counts)  •  There  are  starting  to  be  a  few  applications  in  the   hu...
Applications?  (beyond  word  counts)  –  Cameron  Blevins.  2011.  Topic  Modeling  Historical   Sources:  A...
Applications?  (beyond  word  counts)  –  David  K.  Elson,  Nicholas  Dames,  Kathleen  R.   McKeown.  2010...
Applications?  (beyond  word  counts)  –  Aditi  Muralidharan.  2011.  A  Visual  Interface  for   Exploring	...
Parting  words     This  talk  has  been  about  tools  –  	...
Natural Language Processing Tools for the Digital Humanities
Natural Language Processing Tools for the Digital Humanities
Natural Language Processing Tools for the Digital Humanities
Natural Language Processing Tools for the Digital Humanities
Natural Language Processing Tools for the Digital Humanities
of 110

Natural Language Processing Tools for the Digital Humanities

Natural Language Processing Tools for the Digital Humanities, published by Christopher Manning at Stanford University
Published on: Mar 3, 2016
Published in: Technology      
Source: www.slideshare.net


Transcripts - Natural Language Processing Tools for the Digital Humanities

  • 1. Natural  Language  Processing  Tools  for  the  Digital  Humanities   Christopher  Manning   Stanford  University   Digital  Humanities  2011   http://nlp.stanford.edu/~manning/courses/DigitalHumanities/
  • 2. Commencement  2010
  • 3. My  humanities  qualifications  •  B.A.  (Hons),  Australian  National  University  •  Ph.D.  Linguistics,  Stanford  University  •  But:   –  I’m  not  sure  I’ve  ever  taken  a  real  humanities  class   (if  you  discount  linguistics  classes  and  high  school   English…)
  • 4. SO,  FEEL  FREE  TO  ASK   QUESTIONS!
  • 5. Text
  • 6. The  promise   Phrase  Net  visualization  of     Pride  &  Prejudice  (*  (in|at)  *)   http://www-958.ibm.com/software/data/cognos/manyeyes/
  • 7. “How  I  write”  [code]  •  I  think  you  tend  to  get  too  much  of  people   showing  the  glitzy  output  of  something  •  So,  for  this  tutorial,  at  least  in  the  slides  I’m   trying  to  include  the  low-­‐level  hacking  and   plumbing  •  It’s  a  standard  truism  of  data  mining  that  more   time  goes  into  “data  preparation”  than  anything   else.  Definitely  goes  for  text  processing.
  • 8. Outline  1.  Introduction  2.  Getting  some  text  3.  Words  4.  Collocations,  etc.  5.  NLP  Frameworks  and  tools  6.  Part-­‐of-­‐speech  tagging  7.  Named  entity  recognition  8.  Parsing  9.  Coreference  resolution  10.  The  rest  of  the  languages  of  the  world  11.  Parting  words
  • 9. 2.  GETTING  SOME  TEXT
  • 10. First  step:  Text  •  To  do  anything,  you  need  some  texts!   –  Many  sites  give  you  various  sorts  of  search-­‐and-­‐ display  interfaces   –  But,  normally  you  just  can’t  do  what  you  want  in  NLP   for  the  Digital  Humanities  unless  you  have  a  copy  of   the  texts  sitting  on  your  computer   –  This  may  well  change  in  the  future:  There  is   increasing  use  of  cloud  computing  models  where  you   might  be  able  to  upload  code  to  run  it  on  data  on  a   server   •  or,  conversely,  upload  data  to  be  processed  by  code  on  a  server
  • 11. First  step:  Text  •  People  in  the  audience  are  probably  more  familiar   with  the  state  of  play  here  than  me,  but  my   impression  is:   1.  There  are  increasingly  good  supplies  of  critical  texts   in  well-­‐marked-­‐up  XML  available  commercially  for   license  to  university  libraries   2.  There  are  various,  more  community  efforts  to   produce  good  digitized  collections,  but  most  of   those  seem  to  be  available  to  “friends”  rather  than   to  anybody  with  a  web  browser   3.  There’s  Project  Gutenberg     •  Plain  text,  or  very  simple  HTML,  which  may  or  may  not  be   automatically  generated   •  Unicode  utf-­‐8  if  you’re  lucky,  US-­‐ASCII  if  you’re  not
  • 12. 1.  Early  English  Books  Online  •  TEI-­‐compliant  XML  texts  •  http://eebo.chadwyck.com/
  • 13. 2.  Old  Bailey  Online
  • 14. 3.  Project  Gutenberg
  • 15. Running  example:  H.  Rider  Haggard  •  The  hugely  popular  King  Solomons  Mines  (1885)  by  H.   Rider  Haggard  is  sometimes  considered  the  first  of  the   “Lost  World”  or  “Imperialist  Romance”  genres  •  Allan  Quatermain  (1887)  •  She  (1887)  •  Nada  the  Lily  (1892)  •  Ayesha:  The  Return  of  She   (1905)  •  She  and  Allan  (1921)  •  Zip  file  at:   http://nlp.stanford.edu/~manning/courses/DigitalHumanities/
  • 16. Interfaces  to  tools   Web   Programming   applications   APIs   Command-­‐ GUI   line  applications   applications
  • 17. You’ll  need  to  program  •  Lisa  Spiro,  TAMU  Digital  Scholarship  2009:   I’m a digital humanist with only limited programming skills (Perl & XSLT). Enhancing my programming skills would allow me to: •  Avoid so much tedious, manual work •  Do citation analysis •  Pre-process texts (remove the junk) •  Automatically download web pages •  And much more…
  • 18. You’ll  need  to  program  •  Program  in  what?   –  Perl   •  Traditional  seat-­‐of-­‐the-­‐pants  scripting  language  for    text   processing  (it  nailed  flexible  regex).    I  use  it  some  below….   –  Python   •  Cleaner,  more  modern  scripting  language  with  a  lot  of   energy,  and  the  best-­‐documented  NLP  framework,  NLTK.   –  Java   •  There  are  more  NLP  tools  for  Java  than  any  other  language.   And  it’s  one  of  those  most  popular  languages  in  general.   Good  regular  expressions,  Unicode,  etc.
  • 19. You’ll  need  to  program  •  Program  with  what?   –  There  are  some  general  skills  that  you’ll  want  the   cut  across  programming  languages   •  Regular  expressions   •  XML,  especially  XPath  and  XSLT   •  Unicode  •  But  I’m  wisely  not  going  to  try  to  teach   programming  or  these  skills  in  this  tutorial  
  • 20. Grabbing  files  from  websites  •  wget  (Linux)  or  curl  (Mac  OS  X,  BSD)   –  wget  http://www.gutenberg.org/browse/authors/h   –  curl  -­‐O  http://www.gutenberg.org/browse/authors/h  •  If  you  really  want  to  use  your  browser,  there  are   things  you  can  get  like  this  Firefox  plug-­‐in   –  DownThemAll    http://www.downthemall.net/            but  then  you  just  can’t  do  things  as  flexibly
  • 21. Grabbing  files  from  websites  #!/usr/bin/perl                                                                                                                                                                                                                                while  (<>)  {  last  if  (m/Haggard/);  }  while  (<>)  {          last  if  (m/Hague/);          if  (m!pgdbetext"><a  href="/ebooks/(d+)">(.*)</a>  (English)!)  {                  $title  =  $2;                  $num  =  $1;                  $title  =~  s/<br>/  /g;                  $title  =~  s/r//g;                  print  "curl  -­‐o  "$title  $num.txt"  http://www.gutenberg.org/cache/epub/$num/pg$num.txtn";                  #  Expect  only  one  of  the  html  to  exist                                                                                                                                                                                  print  "curl  -­‐o  "$title  $num.html"  http://www.gutenberg.org/files/$num/$num-­‐h/$num-­‐h.htmn";                  print  "curl  -­‐o  "$title  $num-­‐g.html"  http://www.gutenberg.org/cache/epub/$num/pg$num.htmln";          }  }
  • 22. Grabbing  files  from  websites  wget  http://www.gutenberg.org/browse/authors/h  perl  getHaggard.pl  <  h  >  h.sh  chmod  755  h.sh  ./h.sh  #  and  a  bit  of  futzing  by  hand  that  I  will  leave  out….    •  Often  you  want  the  90%  solution:  automating   nothing  would  be  slow  and  painful,  but  automating   everything  is  more  trouble  than  it’s  worth  for  a  one-­‐ off  process
  • 23. Typical  text  problems  "Devilish  strange!"  thought  he,  chuckling  to  himself;  "queer  business!  Capital  trick  of  the  cull  in  the  cloak  to  make  another  persons  brat  stand  the  brunt  for  his  own-­‐-­‐-­‐capital!  ha!  ha!  Wont  do,  though.  He  must  be  a  sly  fox  to  get  out  of  the  Mint  without  my    [Page  59  ]    knowledge.  Ive  a  shrewd  guess  where  hes  taken  refuge;  but  Ill  ferret  him  out.  These  bloods  will  pay  well  for  his  capture;  if  not,  hell  pay  well  to  get  out  of  their  hands;  so  Im  safe  either  way-­‐-­‐-­‐ha!  ha!  Blueskin,"  he  added  aloud,  and  motioning  that  worthy,  "follow  me."  Upon  which,  he  set  off  in  the  direction  of  the  entry.  His  progress,  however,  was  checked  by  loud  acclamations,  announcing  the  arrival  of  the  Master  of  the  Mint  and  his  train.  Baptist  Kettleby  (for  so  was  the  Master  named)  was  a  "goodly  portly  man,  and  a  corpulent,"  whose  fair  round  paunch  bespoke  the  affection  he  entertained  for  good  liquor  and  good  living.  He  had  a  quick,  shrewd,  merry  eye,  and  a  look  in  which  duplicity  was  agreeably  veiled  by  good  humour.  It  was  easy  to  discover  that  he  was  a  knave,  but  equally  easy  to  perceive  that  he  was  a  pleasant  fellow;  a  combination  of  qualities  by  no  means  of  rare  occurrence.  So  far  as  regards  his  attire,  Baptist  was  not  seen  to  advantage.  No  great  lover  of  state  or  state  costume  at  any  time,  he  was    [Page  60  ]    generally,  towards  the  close  of  an  evening,  completely  in  dishabille,  and  in  this  condition  he  now  presented  himself  to  his  subjects.  His  shirt  was  unfastened,  his  vest  unbuttoned,  his  hose  ungartered;  his  feet  were  stuck  into  a  pair  of  pantoufles,  his  arms  into  a  greasy  flannel  dressing-­‐gown,  his  head  into  a  thrum-­‐cap,  the  cap  into  a  tie-­‐periwig,  and  the  wig  into  a  gold-­‐edged  hat.  A  white  apron  was  tied  round  his  waist,  and  into  the  apron  was  thrust  a  short  thick  truncheon,  which  looked  very  much  like  a  rolling-­‐pin.  The  Master  of  the  Mint  was  accompanied  by  another  gentleman  almost  as  portly  as  himself,  and  quite  as  deliberate  in  his  movements.  The  costume  of  this  personage  was  somewhat  singular,  and  might  have  passed  for  a  masquerading  habit,  had  not  the  imperturbable  gravity  of  his  demeanour  forbidden  any  such  supposition.  It  consisted  of  a  close  jerkin  of  brown  frieze,  ornamented  with  a  triple  row  of  brass  buttons;  loose  Dutch  slops,  made  very  wide  in  the  seat  and  very  tight  at  the  knees;  red  stockings  with  black  clocks,  and    [Page  61  ]    a  fur  cap.  The  owner  of  this  dress  had  a  broad  weather-­‐beaten  face,  small  twinkling  eyes,  and  a  bushy,  grizzled  beard.  Though  he  walked  by  the  side  of  the  governor,  he  seldom  exchanged  a  word  with  him,  but  appeared  wholly  absorbed  in  the  contemplations  inspired  by  a  broad-­‐bowled  Dutch  pipe.
  • 24. There  are  always  text-­‐processing   gotchas  …  •  …  and  not  dealing  with  them  can  badly  degrade   the  quality  of  subsequent  NLP  processing.  1.  The  Gutenberg  *.txt  files  frequently  represent   italics  with  _underscores_.  2.  There  may  be  file  headers  and  footers  3.  Elements  like  headings  may  be  run  together   with  following  sentences  if  not  demarcated  or   eliminated  (example  later).
  • 25. There  are  always  text-­‐processing   gotchas  …  #!/usr/bin/perl  $finishedHeader  =  0;  $startedFooter  =  0;  while  ($line  =  <>)  {      if  ($line  =~  /^***s*END/  &&  $finishedHeader)  {          $startedFooter  =  1;      }      if  ($finishedHeader  &&  !  $startedFooter)  {          $line  =~  s/_//g;    #  minor  cleanup  of  italics          print  $line;      }      if  ($line  =~  /^***s*START/  &&  !  $finishedHeader)  {          $finishedHeader  =  1;      }  }  if  (  !  ($finishedHeader  &&  $startedFooter))  {      print  STDERR  "****  Probable  book  format  problem!n";  }
  • 26. 3.  WORDS
  • 27. In  the  beginning  was  the  word  •  Word  counts  •  Word  counts  are  the  basis  of  all  the  simple,  first   order  methods  of  text  analysis   –  tag  clouds,  collocations,  topic  models  •  Sometimes  you  can  get  a  fair  distance  with  word   counts
  • 28. She  (1887)   http://wordle.net/    Jonathan  Feinberg
  • 29. Ayesha:  The  Return  of  She  (1905)
  • 30. She  and  Allan  (1921)
  • 31. Wisdoms  Daughter:  The  Life  and  Love  Story  of  She-­‐Who-­‐Must-­‐Be-­‐Obeyed  (1923)
  • 32. Wisdoms  Daughter:  The  Life  and  Love  Story  of  She-­‐Who-­‐Must-­‐Be-­‐Obeyed  (1923)
  • 33. Google  Books  Ngram  Viewer   http://ngrams.googlelabs.com/
  • 34. Google  Books  Ngram  Viewer  •  …  you  have  to  be  the  most  jaded  or  cynical  scholar   not  to  be  excited  by  the  release  of  the   Google  Books  Ngram  Viewer  …  Digital  humanities   needs  gateway  drugs.  …  “Culturomics”   sounds  like  an  80s  new  wave  band.  If  we’re  going  to   coin  neologisms,  let’s  at  least  go  with  Sean  Gillies’   satirical  alternative:  Freakumanities.…  For  me,  the   biggest  problem  with  the  viewer  and  the  data  is  that   you  cannot  seamlessly  move  from  distant  reading  to   close  reading
  • 35. Language  change:  as  least  as  C.  D.  Manning.  2003.  Probabilistic  Syntax    •  I  found  this  example  in  Russo  R.,  2001,  Empire   Falls  (on  p.3!):   –  By  the  time  their  son  was  born,  though,  Honus   Whiting  was  beginning  to  understand  and   privately  share  his  wife’s  opinion,  as  least  as  it   pertained  to  Empire  Falls.  •  What’s  interesting  about  it?
  • 36. Language  change:  as  least  as  •  A  language  change  in  progress?  I  found  a  bunch  of  other   examples:   –  Indeed,  the  will  and  the  means  to  follow  through  are  as   least  as  important  as  the  initial  commitment  to  deficit   reduction.   –  As  many  of  you  know  he  had  his  boat  built  at  the  same   time  as  mine  and  it’s  as  least  as  well  maintained  and   equipped.  •  Apparently  not  a  “dialect”   –  Second,  if  the  required  disclosures  are  made  by  on-­‐screen   notice,  the  disclosure  of  the  vendor’s  legal  name  and  address   must  appear  on  one  of  several  specified  screens  on  the  vendor’s   electronic  site  and  must  be  at  least  as  legible  and  set  in  a  font   as  least  as  large  as  the  text  of  the  offer  itself.
  • 37. Language  change:  as  least  as
  • 38. Language  change:  as  least  as
  • 39. 4.  COLLOCATIONS,  ETC.
  • 40. Using  a  text  editor  •  You  can  get  a  fair  distance  with  a  text  editor  that   allows  multi-­‐file  searches,  regular  expressions,   etc.   –  It’s  like  a  little  concordancer  that’s  good  for  close   reading   •  jEdit        http://www.jedit.org/               •  BBedit  on  Windows
  • 41. Traditional  Concordancers  •  WordSmith  Tools        Commercial;  Windows   –  http://www.lexically.net/wordsmith/  •  Concordance          Commercial;  Windows   –  http://www.concordancesoftware.co.uk/  •  AntConc      Free;  Windows,  Mac  OS  X  (only  under  X11);  Linux   –  http://www.antlab.sci.waseda.ac.jp/antconc_index.html  •  CasualConc      Free;  Mac  OS  X   –  http://sites.google.com/site/casualconc/   •  by  Yasu  Imao
  • 42. The  decline  of  honour
  • 43. 5.  NLP  FRAMEWORKS   AND  TOOLS
  • 44. The  Big  3  NLP  Frameworks  •  GATE  –  General  Architecture  for  Text  Engineering  (U.  Sheffield)   •  http://gate.ac.uk/   •  Java,  quite  well  maintained  (now)   •  Includes  tons  of  components  •  UIMA  –  Unstructured  Information  Management  Architecture.   Originally  IBM;  now  Apache  project   •  http://uima.apache.org/   •  Professional,  scalable,  etc.   •  But,  unless  you’re  comfortable  with  Xml,  Eclipse,  Java  or  C++,  etc.,  I   think  it’s  a  non-­‐starter  •  NLTK  –  Natural  Language  To0lkit  (started  by  Steven  Bird)   •  http://www.nltk.org/   •  Big  community;  large  Python  package;  corpora  and  books  about  it   •  But  it’s  code  modules  and  API,  no  GUI  or  command-­‐line  tools   •  Like  R  for  NLP.    But,  hey,  R’s  becoming  very  successful….
  • 45. The  main  NLP  Packages  •  NLTK      Python   –  http://www.nltk.org/  •  OpenNLP   –  http://incubator.apache.org/opennlp/  •  Stanford  NLP   –  http://nlp.stanford.edu/software/  •  LingPipe   –  http://alias-­‐i.com/lingpipe/    •  More  one-­‐off  packages  than  I  can  fit  on  this  slide   –  http://nlp.stanford.edu/links/statnlp.html
  • 46. NLP  tools:  Rules  of  thumb  for  2011  1.  Unless  you’re  unlucky,  the  tool  you  want  to  use   will  work  with  Unicode  (at  least  BMP),  so  most   any  characters  are  okay  2.  Unless  you’re  lucky,  the  tool  you  want  to  use   will  work  only  on  completely  plain  text,  or   extremely  simple  XML-­‐style  mark-­‐up  (e.g.,  <s>   …  </s>  around  sentences,  recognized  by  regexp)  3.  By  default,  you  should  assume  that  any  tool  for   English  was  trained  on  American  newswire
  • 47. GATE
  • 48. Rule-­‐based  NLP  and  Statistical/ Machine  Learning  NLP  •  Most  work  on  NLP  in  the  1960s,  70s  and  80s  was   with  hand-­‐built  grammars  and  morphological   analyzers  (finite  state  transducers),  etc.   –  ANNIE  in  GATE  is  still  in  this  space  •  Most  academic  research  work  in  NLP  in  the   1990s  and  2000s  use  probabilistic  or  more   generally  machine  learning  methods  (“Statistical   NLP”)   –  The  Stanford  NLP  tools  and  MorphAdorner,   which  we  will  come  to  soon,  are  in  this  space
  • 49. Rule-­‐based  NLP  and  Statistical/ Machine  Learning  NLP  •  Hand-­‐built  grammars  are  fine  for  tasks  in  a  closed   space  which  do  not  involve  reasoning  about   contexts   –  E.g.,  finding  the  possible  morphological  parses  of  a   word  •  In  the  old  days  they  worked  really  badly  on  “real   text”     –  They  were  always  insufficiently  tolerant  of  the   variability  of  real  language   –  But,  built  with  modern,  empirical  approaches,  they   can  do  reasonably  well   •  ANNIE  is  an  example  of  this
  • 50. Rule-­‐based  NLP  and  Statistical/ Machine  Learning  NLP  •  In  Statistical  NLP:   –  You  gather  corpus  data,  and  usually  hand-­‐annotate  it  with  the   kind  of  information  you  want  to  provide,  such  as  part-­‐of-­‐speech   –  You  then  train  (or  “learn”)  a  model  that  learns  to  try  to  predict   annotations  based  on  features  of  words  and  their  contexts  via   numeric  feature  weights   –  You  then  apply  the  trained  model  to  new  text  •  This  tends  to  work  much  better  on  real  text   –  It  more  flexibly  handles  contextual  and  other  evidence  •  But  the  technology  is  still  far  from  perfect,  it  requires  annotated   data,  and  degrades  (sometimes  very  badly)  when  there  are   mismatches  between  the  training  data  and  the  runtime  data
  • 51. How  much  hardware  do  you  need?  •  NLP  software  often  needs  plenty  of  RAM  (especially)   and  processing  power  •  But  these  days  we  have  really  powerful  laptops!  •  Some  of  the  software  I  show  you  could  run  on  a   machine  with  256  MB  of  RAM  (e.g.,  Stanford   Parser),  but  much  of  it  requires  more  •  Stanford  CoreNLP  requires  a  machine  with  4GB  of   RAM  •  I  ran  everything  in  this  tutorial  on  the  laptop  I’m   presenting  on  …  4GB  RAM,  2.8  GHz  Core  2  Duo  •  But  it  wasn’t  always  pleasant  writing  the  slides  while   software  was  running….
  • 52. How  much  hardware  do  you  need?  •  Why  do  you  need  more  hardware?   –  More  speed   •  It  took  me  95  minutes  to  run  Ayesha,  the  Return  of  She   through  Stanford  CoreNLP  on  my  laptop….   –  More  scale   •  You’d  like  to  be  able  to  analyze  1  million  books  •  Order  of  magnitude  rules  of  thumb:   –  POS  tagging,  NER,  etc:  5–10,000  words/second   –  Parsing:  1–10  sentences  per  second
  • 53. How  much  hardware  do  you  need?  •  Luckily,  most  of  our  problems  are  trivially   parallelizable   –  Each  book/chapter  can  be  run  separately,  perhaps   on  a  separate  machine  •  What  do  we  actually  use?   –  We  do  most  of  our  computing  on  rack  mounted   Linux  servers   •  Currently  4  x  quad  core  Xeon  processors  with  24  GB  of   RAM  seem  about  the  sweet  spot   •  About  $3500  per  machine  …  not  like  the  old  days
  • 54. 6.  PART-­‐OF-­‐SPEECH   TAGGING
  • 55. Part-­‐of-­‐Speech  Tagging  •  Part-­‐of-­‐speech  tagging  is  normally  done  by  a  sequence   model  (acronyms:  HMM,  CRM,  MEMM/CMM)   –  A  POS  tag  is  to  be  placed  above  each  word   –  The  model  considers  a  local  context  of  possible  previous   and  following  POS  tags,  the  current  word,  neighboring   words,  and  features  of  them  (capitalized?,  ends  in  -­‐ing?)   –  Each  such  feature  has  a  weight,  and  the  evidence  is   combined,  and  the  most  likely  sequence  of  tags   (according  to  the  model)  is  chosen   RB   NNP   NNP   RB   VBD   ,   JJ   NNS   When   Mr.   Holly   last   wrote   ,   many   years
  • 56. Stanford  POS  tagger   http://nlp.stanford.edu/software/tagger.shtml  $  java  -­‐mx1g  -­‐cp  ../Software/stanford-­‐postagger-­‐full-­‐2011-­‐06-­‐19/stanford-­‐postagger.jar  edu.stanford.nlp.tagger.maxent.MaxentTagger  -­‐model  ../Software/stanford-­‐postagger-­‐full-­‐2011-­‐06-­‐19/models/left3words-­‐distsim-­‐wsj-­‐0-­‐18.tagger  -­‐outputFormat  tsv  -­‐tokenizerOptions  untokenizable=allKeep  -­‐textFile  She  3155.txt  >  She  3155.tsv  Loading  default  properties  from  trained  tagger  ../Software/stanford-­‐postagger-­‐full-­‐2011-­‐06-­‐19/models/left3words-­‐distsim-­‐wsj-­‐0-­‐18.tagger  Reading  POS  tagger  model  from  ../Software/stanford-­‐postagger-­‐full-­‐2011-­‐06-­‐19/models/left3words-­‐distsim-­‐wsj-­‐0-­‐18.tagger  ...  done  [2.2  sec].  Jun  15,  2011  8:17:15  PM  edu.stanford.nlp.process.PTBLexer  next   Greek  stand-­‐ alone  WARNING:  Untokenizable:  ?  (U+1FBD,  decimal:  8125)   Koronis   character  (a  Tagged  132377  words  at  5559.72  words  per  second.   little   obscure?)
  • 57. Stanford  POS  tagger  •  For  the  second  time  you  do  it…  $  alias  stanfordtag  "java  -­‐mx1g  -­‐cp  /Users/manning/Software/stanford-­‐postagger-­‐full-­‐2011-­‐06-­‐19/stanford-­‐postagger.jar  edu.stanford.nlp.tagger.maxent.MaxentTagger  -­‐model  /Users/manning/Software/stanford-­‐postagger-­‐full-­‐2011-­‐06-­‐19/models/left3words-­‐distsim-­‐wsj-­‐0-­‐18.tagger  -­‐outputFormat  tsv  -­‐tokenizerOptions  untokenizable=allKeep  -­‐textFile"  $  stanfordtag  RiderHaggard/King  Solomons  Mines  2166.txt  >  tagged/King  Solomons  Mines  2166.tsv  Reading  POS  tagger  model  from  /Users/manning/Software/stanford-­‐postagger-­‐full-­‐2011-­‐06-­‐19/models/left3words-­‐distsim-­‐wsj-­‐0-­‐18.tagger  ...  done  [2.1  sec].  Tagged  98178  words  at  9807.99  words  per  second.
  • 58. MorphAdorner   http://morphadorner.northwestern.edu/  •  MorphAdorner  is  a  set  of  NLP  tools  developed  at   Northwestern  by  Martin  Mueller  and  colleagues   specifically  for  English  language  fiction,  over  a   long  historical  period  from  EME  onwards   –  lemmatizer,  named  entity  recognizer,  POS   tagger,  spelling  standardizer,  etc.  •  Aims  to  deal  with  variation  in  word  breaking  and   spelling  over  this  period  •  Includes  its  own  POS  tag  set:  NUPOS
  • 59. MorphAdorner  $  ./adornplaintext  temp  temp/3155.txt  2011-­‐06-­‐15  20:30:52,111  INFO    -­‐  MorphAdorner  version  1.0  2011-­‐06-­‐15  20:30:52,111  INFO    -­‐  Initializing,  please  wait...  2011-­‐06-­‐15  20:30:52,318  INFO    -­‐  Using  Trigram  tagger.  2011-­‐06-­‐15  20:30:52,319  INFO    -­‐  Using  I  retagger.  2011-­‐06-­‐15  20:30:53,578  INFO    -­‐  Loaded  word  lexicon  with  151,922  entries  in  2  seconds.  2011-­‐06-­‐15  20:30:55,920  INFO    -­‐  Loaded  suffix  lexicon  with  214,503  entries  in  3  seconds.  2011-­‐06-­‐15  20:30:57,927  INFO    -­‐  Loaded  transition  matrix  in  3  seconds.  2011-­‐06-­‐15  20:30:58,137  INFO    -­‐  Loaded  162,248  standard  spellings  in  1  second.  2011-­‐06-­‐15  20:30:58,697  INFO    -­‐  Loaded  5,434  alternative  spellings  in  1  second.  2011-­‐06-­‐15  20:30:58,703  INFO    -­‐  Loaded  349  more  alternative  spellings  in  14  word  classes  in  1  second.  2011-­‐06-­‐15  20:30:58,713  INFO    -­‐  Loaded  0  names  into  name  standardizer  in  <  1  second.  2011-­‐06-­‐15  20:30:58,779  INFO    -­‐  1  file  to  process.  2011-­‐06-­‐15  20:30:58,789  INFO    -­‐  Before  processing  input  texts:  Free  memory:  105,741,696,  total  memory:  480,694,272  2011-­‐06-­‐15  20:30:58,789  INFO    -­‐  Processing  file  temp/3155.txt  .  2011-­‐06-­‐15  20:30:58,789  INFO    -­‐  Adorning  temp/3155.txt  with  parts  of  speech.  2011-­‐06-­‐15  20:30:58,832  INFO    -­‐  Loaded  text  from  temp/3155.txt  in  1  second.  2011-­‐06-­‐15  20:31:01,498  INFO    -­‐        Extracted  131,875  words  in  4,556  sentences  in  3  seconds.  2011-­‐06-­‐15  20:31:03,860  INFO    -­‐              lines:  1,000;  words:  27,756  2011-­‐06-­‐15  20:31:04,364  INFO    -­‐              lines:  2,000;  words:  58,728  2011-­‐06-­‐15  20:31:04,676  INFO    -­‐              lines:  3,000;  words:  84,735  2011-­‐06-­‐15  20:31:04,990  INFO    -­‐              lines:  4,000;  words:  115,396  2011-­‐06-­‐15  20:31:05,152  INFO    -­‐              lines:  4,556;  words:  131,875  2011-­‐06-­‐15  20:31:05,152  INFO    -­‐        Part  of  speech  adornment  completed  in  4  seconds.  36,100  words  adorned  per  second.  2011-­‐06-­‐15  20:31:05,152  INFO    -­‐        Generating  other  adornments.  2011-­‐06-­‐15  20:31:13,840  INFO    -­‐        Adornments  written  to  temp/3155-­‐005.txt  in  9  seconds.  2011-­‐06-­‐15  20:31:13,840  INFO    -­‐  All  files  adorned  in  16  seconds.
  • 60. Ah,  the  old  days!  $  ./adornplaintext  temp  temp/Hunter  Quartermain.txt    2011-­‐06-­‐15  17:18:15,551  INFO    -­‐  MorphAdorner  version  1.0  2011-­‐06-­‐15  17:18:15,552  INFO    -­‐  Initializing,  please  wait...  2011-­‐06-­‐15  17:18:15,730  INFO    -­‐  Using  Trigram  tagger.  2011-­‐06-­‐15  17:18:15,731  INFO    -­‐  Using  I  retagger.  2011-­‐06-­‐15  17:18:16,972  INFO    -­‐  Loaded  word  lexicon  with  151,922  entries  in  2  seconds.  2011-­‐06-­‐15  17:18:18,684  INFO    -­‐  Loaded  suffix  lexicon  with  214,503  entries  in  2  seconds.  2011-­‐06-­‐15  17:18:20,662  INFO    -­‐  Loaded  transition  matrix  in  2  seconds.  2011-­‐06-­‐15  17:18:20,887  INFO    -­‐  Loaded  162,248  standard  spellings  in  1  second.  2011-­‐06-­‐15  17:18:21,300  INFO    -­‐  Loaded  5,434  alternative  spellings  in  1  second.  2011-­‐06-­‐15  17:18:21,303  INFO    -­‐  Loaded  349  more  alternative  spellings  in  14  word  classes  in  1  second.  2011-­‐06-­‐15  17:18:21,312  INFO    -­‐  Loaded  0  names  into  name  standardizer  in  1  second.  2011-­‐06-­‐15  17:18:21,381  INFO    -­‐  No  files  found  to  process.  •  But  it  works  better  if  you  make  sure  the  filename  has   no  spaces  in  it  
  • 61. Comparing  taggers:  Penn  Treebank  vs.   NUPOS  Holly  NNP  Holly  n1   going  VBG          going  vvg  ,    ,    ,    ,   to    TO    to    pc-­‐acp  if    IN    if    cs   leave  VB    leave  vvi  you    PRP  you    pn22   you  PRP          you    pn22  will    MD    will    vmb   that  IN    that  d  accept  VB    accept  vvi   boy  NN    boys  ng1    the    DT    the    dt   s    POS  trust  NN    trust  n1   sole  JJ    sole  j  ,    ,    ,    ,   guardian  NN  guardian  n1  I    PRP  I    pns11   .    .    .    .  am    VBP  am    vbm
  • 62. Comparing  taggers:  Penn  Treebank  vs.   NUPOS  Holly  NNP  Holly  n1   going  VBG          going  vvg  ,    ,    ,    ,   to    TO    to    pc-­‐acp  if    IN    if    cs   leave  VB    leave  vvi  you    PRP  you    pn22   you  PRP          you    pn22  will    MD    will    vmb   that  IN    that  d  accept  VB    accept  vvi   boy  NN    boys  ng1    the    DT    the    dt   s    POS  trust  NN    trust  n1   sole  JJ    sole  j  ,    ,    ,    ,   guardian  NN  guardian  n1  I    PRP  I    pns11   .    .    .    .  am    VBP  am    vbm
  • 63. Stylistic  factors  from  POS  14000  12000  10000   8000   JJ   6000   MD   4000   DT   2000   0   She   Ayesha   She  and  Allan   Wisdoms   Daughter
  • 64. 7.  NAMED  ENTITY   RECOGNITION   (NER)
  • 65. Named  Entity  Recognition     –  “the  Chad  problem”  Germanyʼ’s representative to theEuropean Unionʼ’s veterinarycommittee Werner Zwingman said onWednesday consumers should …IL-2 gene expression and NF-kappa Bactivation through CD28 requiresreactive oxygen production by5-lipoxygenase.
  • 66. Conditional  Random  Fields  (CRFs)   O   PER   PER   O   O   O   O   O   When   Mr.   Holly   last   wrote   ,   many   years  •  We  again  use  a  sequence  model  –  different   problem,  but  same  technology   –  Indeed,  sequence  models  are  used  for  lots  of  tasks   that  can  be  construed  as  labeling  tasks  that   require  only  local  context  (to  do  quite  well)  •  There  is  a  background  label  –  O  –  and  labels  for   each  class  •  Entities  are  both  segmented  and  categorized
  • 67. Stanford  NER  Features  •  Word  features:  current  word,  previous  word,  next   word,  a  word  is  anywhere  in  a  +/–  4  word  window  •  Orthographic  features:     –  Jenny        Xxxx   –  IL-­‐2                XX-­‐#  •  Prefixes  and  Suffixes:   –  Jenny        <J,  <Je,  <Jen,  …,  nny>,  ny>,  y>  •  Label  sequences  •  Lots  of  feature  conjunctions
  • 68. Stanford  NER   http://nlp.stanford.edu/software/CRF-­‐NER.shtml  $  java  -­‐mx500m  -­‐Dfile.encoding=utf-­‐8  -­‐cp  Software/stanford-­‐ner-­‐2011-­‐06-­‐19/stanford-­‐ner.jar  edu.stanford.nlp.ie.crf.CRFClassifier  -­‐loadClassifier  Software/stanford-­‐ner-­‐2011-­‐06-­‐19/classifiers/all.3class.distsim.crf.ser.gz  -­‐textFile  RiderHaggard/She  3155.txt  >  ner/She  3155.ner    For  thou  shalt  rule  this  <LOCATION>England</LOCATION>-­‐-­‐-­‐-­‐”  "But  we  have  a  queen  already,"  broke  in  <LOCATION>Leo</LOCATION>,  hastily.  "It  is  naught,  it  is  naught,"  said  <PERSON>Ayesha</PERSON>;  "she  can  be  overthrown.”  At  this  we  both  broke  out  into  an  exclamation  of  dismay,  and  explained  that  we  should  as  soon  think  of  overthrowing  ourselves.  "But  here  is  a  strange  thing,"  said  <PERSON>Ayesha</PERSON>,  in  astonishment;  "a  queen  whom  her  people  love!  Surely  the  world  must  have  changed  since  I  dwelt  in  <LOCATION>Kôr</LOCATION>."
  • 69. 8.  PARSING
  • 70. Statistical  parsing  •  One  of  the  big  successes  of  1990s  statistical  NLP   was  the  development  of  statistical  parsers  •  These  are  trained  from  hand-­‐parsed  sentences   (“treebanks”),  and  know  statistics  about  phrase   structure  and  word  relationships,  and  use  them  to   assign  the  most  likely  structure  to  a  new  sentence  •  They  will  return  a  sentence  parse  for  any  sequence   of  words.  And  it  will  usually  be  mostly  right  •  There  are  many  opportunities  for  exploiting  this   richer  level  of  analysis,  which  have  only  been  partly   realized.
  • 71. Phrase  structure  Parsing  •  Phrase  structure  representations  have  dominated   American  linguistics  since  the  1930s  •  They  focus  on  showing  words  that  go  together  to  form   natural  groups  (constituents)  that  behave  alike  •  They  are  good  for  showing  and  querying  details  of   sentence  structure  and  embedding   S VP NP VBD VP NP PP VBN PP IN NP IN NP NNS NNS CC NN NNP NNP Bills on ports and immigration were submitted by Senator Brownback
  • 72. Dependency  parsing  •  A  dependency  parse  shows  which  words  in  a  sentence  modify  other  words  •  The  key  notion  are  governors  with  dependents  •  Widespread  use:  Pāṇini,  early  Arabic  grammarians,  diagramming  sentences,  …   submitted nsubjpass auxpass prep Bills were by prep pobj on Brownback pobj nn appos ports Senator Republican cc conj prep and immigration of pobj Kansas
  • 73. Stanford  Dependencies  •  SD  is  a  particular  dependency  representation  designed  for  easy   extraction  of  meaning  relationships    [de  Marneffe  &  Manning,  2008]   –  It’s  basic  form  in  the  last  slide  has  each  word  as  is   –  A  “collapsed”  form  focuses  on  relations  between  main  words   submitted nsubjpass auxpass Bills were agent prep_on Brownback nn appos ports Senator Republican conj_and prep_on prep_of immigration Kansas
  • 74. Statistical  Parsers    •  There  are  now  many  good  statistical  parsers  that   are  freely  downloadable   –  Constituency  parsers   •  Collins/Bikel  Parser   •  Berkeley  Parser   •  BLLIP  Parser  =  Charniak/Johnson  Parser   –  Dependency  parsers   •  MaltParser   •  MST  Parser  •  But  I’ll  show  the  Stanford  Parser  
  • 75. Tregex/Tgrep2  –  Tools  for  searching   over  syntax
  • 76. dreadful  things  She   Ayesha  amod(day-­‐18,  dreadful-­‐17)   amod(clouds-­‐5,  dreadful-­‐2)  amod(day-­‐45,  dreadful-­‐44)   amod(debt-­‐26,  dreadful-­‐25)  amod(feast-­‐33,  dreadful-­‐32)   amod(doom-­‐21,  dreadful-­‐20)  amod(fits-­‐51,  dreadful-­‐50)   amod(fashion-­‐50,  dreadful-­‐47)  amod(form-­‐59,  dreadful-­‐58)   amod(form-­‐10,  dreadful-­‐7)  amod(laugh-­‐9,  dreadful-­‐8)   amod(oath-­‐42,  dreadful-­‐41)  amod(manifestation-­‐9,  dreadful-­‐8)   amod(road-­‐23,  dreadful-­‐22)  amod(manner-­‐29,  dreadful-­‐28)   amod(silence-­‐5,  dreadful-­‐4)  amod(marshes-­‐17,  dreadful-­‐16)   amod(threat-­‐19,  dreadful-­‐18)  amod(people-­‐12,  dreadful-­‐11)  amod(people-­‐46,  dreadful-­‐45)  amod(place-­‐16,  dreadful-­‐15)  amod(place-­‐6,  dreadful-­‐5)  amod(sight-­‐5,  dreadful-­‐4)  amod(spot-­‐13,  dreadful-­‐12)  amod(thing-­‐41,  dreadful-­‐40)  amod(thing-­‐5,  dreadful-­‐4)  amod(tragedy-­‐22,  dreadful-­‐21)  amod(wilderness-­‐43,  dreadful-­‐42)
  • 77. Making  use  of  dependency  structure  J.  Engelberg  Costly  Information  Processing  (AFA,  2009):    •  An  efficient  market  should  immediately  incorporate  all   publicly  available  information.  •  But  many  studies  have  shown  there  is  a  lag   –  And  the  lag  is  greater  on  Fridays  (!)  •  An  explanation  for  this  is  that  there  is  a  cost  to  information   processing  •  Engelberg  tests  and  shows  that   soft  (textual)  information   takes  longer  to  be  absorbed  than   hard  (numeric)   information  …  it s  higher  cost  information  processing  •  But   soft  information  has  value  beyond   hard  information   –  It’s  especially  valuable  for  predicting  further  out  in  time
  • 78. Evidence from earnings announcements [Engelberg AFA 2009]•  But  how  do  you  use  the   soft  information?  •  Simply  using  proportion  of   negative  words  (from  the   Harvard  General  Inquirer  lexicon)  is  a  useful  predictive  feature   of  future  stock  behavior        Although  sales  remained  steady,  the  firm  continues  to   suffer  from  rising  oil  prices.  •  But  this  [or  text  categorization]  is  not  enough.  In  order  to   refine  my  analysis,  I  need  to  know  that  the  negative   sentiment  is  about  oil  prices.  •  He  thus  turns  to  use  of  the  typed  dependencies   representation  of  the  Stanford  Parser.   –  Words  that  negative  words  relate  to  are  grouped  into  1  of   6  categories  [5  word  lists  or   other ]
  • 79. Evidence from earnings announcements [Engelberg 2009]•  In  a  regression  model  with  many  standard  quantitative   predictors…   –  Just  the  negative  word  fraction  is  a  significant  predictor  of  3   day  or  80  day  post  earnings  announcement  abnormal   returns  (CAR)   •  Coefficient  −0.173,  p  <  0.05  for  80  day  CAR   –  Negative  sentiment  about  different  things  has  differential   effects   •  Fundamentals:  −0.198,  p  <  0.01  for  80  day  CAR   •  Future:  −0.356,  p  <  0.05  for  80  day  CAR   •  Other:  −0.023,  p  <  0.01  for  80  day  CAR   –  Only  some  of  which  analysts  pay  attention  to   •  Analyst  forecast-­‐for-­‐quarter-­‐ahead  earnings  is  predicted  by   negative  sentiment  on  Environment  and  Other  but  not   Fundamentals  or  Future!
  • 80. Syntactic Packaging and Implicit Sentiment [Greene 2007; Greene and Resnik 2009]•  Positive  or  negative  sentiment  can  be  carried  by  words  (e.g.,   adjectives),  but  often  it  isn’t….   –  These  sentences  differ  in  sentiment,  even  though  the   words  aren’t  so  different:   •  A  soldier  veered  his  jeep  into  a  crowded  market  and  killed   three  civilians   •  A  soldier s  jeep  veered  into  a  crowded  market  and  three   civilians  were  killed  •  As  a  measurable  version  of  such  issues  of  linguistic  perspective,   they  define  OPUS  features   –  For  domain  relevant  terms,  OPUS  features  pair  the  word  with  a   syntactic  Stanford  Dependency:   •  killed:DOBJ    NSUBJ:soldier    killed:NSUBJ
  • 81. Predicting Opinions of the Death Penalty [Greene 2007; Greene and Resnik 2009]•  Collected  pro-­‐  and  anti-­‐  death  penalty  texts  from  websites  with   manual  checking  •  Training  is  cross-­‐validation  of  training  on  some  pro-­‐  and  anti-­‐  sites   and  testing  on  documents  from  others                [can t  use  site-­‐specific   nuances]  •  Baseline  is  word  and  word  bigram  features  in  a  support  vector   machine          [SVM  =  good  classifier]   Condition SVM accuracy Baseline 72.0% With OPUS features 88.1%•  58%  error  reduction!
  • 82. 9.  COREFERENCE   RESOLUTION
  • 83. Coreference  resolution  •  The  goal  is  to  work  out  which  (noun)  phrases   refer  to  the  same  entities  in  the  world   –  Sarah  asked  her  father  to  look  at  her.  He   appreciated  that  his  eldest  daughter  wanted  to   speak  frankly.  •  ≈  anaphora  resolution  ≈  pronoun  resolution  ≈   entity  resolution
  • 84. Coreference  resolution  warnings  •  Warning:  The  tools  we  have  looked  at  so  far  work   one  sentence  at  a  time  –  or  use  the  whole   document  but  ignore  all  structure  and  just  count   –  but  coreference  uses  the  whole  document  •  The  resources  used  will  grow  with  the  document   size  –  you  might  want  to  try  a  chapter  not  a  novel  •  Coreference  systems  normally  require   processing  with  parsers,  NER,  etc.  first,  and  use   of  lexicons
  • 85. Coreference  resolution  warnings  •  English-­‐only  for  the  moment….  •  While  there  are  some  papers  on  coreference   resolution  in  other  languages,  I  am  aware  of  no   downloadable  coreference  systems  for  any   language  other  than  English  •  For  English,  there  are  a  good  number  of   downloadable  systems,  but  their  performance   remains  modest.    It’s  just  not  like  POS  tagging,   NER  or  parsing
  • 86. Coreference  resolution  warnings  Nevertheless,  it’s  not  yet  known  to  the  State  of  California  to  cause  cancer,  so  let’s  continue….
  • 87. Stanford  CoreNLP   http://nlp.stanford.edu/software/corenlp.shtml  •  Stanford  CoreNLP  is  our  new  package  that  ties   together  a  bunch  of  NLP  tools   –  POS  tagging   –  Named  Entity  Recognition   –  Parsing   –  and  Coreference  Resolution  •  Output  is  an  XML  representation  [only  choice  at  present]  •  Contains  a  state-­‐of-­‐the-­‐art  coreference  system!
  • 88. Stanford  CoreNLP  $  java  -­‐mx3g  -­‐Dfile.encoding=utf-­‐8  -­‐cp  "Software/stanford-­‐corenlp-­‐2011-­‐06-­‐08/stanford-­‐corenlp-­‐2011-­‐06-­‐08.jar:Software/stanford-­‐corenlp-­‐2011-­‐06-­‐08/stanford-­‐corenlp-­‐models-­‐2011-­‐06-­‐08.jar:Software/stanford-­‐corenlp-­‐2011-­‐06-­‐08/xom.jar:Software/stanford-­‐corenlp-­‐2011-­‐06-­‐08/jgrapht.jar"  edu.stanford.nlp.pipeline.StanfordCoreNLP  -­‐file  RiderHaggard/Hunter  Quatermains  Story  2728.txt  -­‐outputDirectory  corenlp
  • 89. What  Stanford  CoreNLP  gives   –  Sarah  asked  her  father  to  look  at  her  .     –  He  appreciated  that  his  eldest  daughter  wanted   to  speak  frankly  .  •  Coreference  resolution  graph   –  sentence  1,  headword  1  (gov)     –  sentence  1,  headword  3   –  sentence  1,  headword  4  (gov)     –  sentence  2,  headword  1   –  sentence  2,  headword  4
  • 90. What  Stanford  CoreNLP  gives   –  Sarah  asked  her  father  to  look  at  her  .     –  He  appreciated  that  his  eldest  daughter  wanted   to  speak  frankly  .  •  Coreference  resolution  graph   –  sentence  1,  headword  1  (gov)     –  sentence  1,  headword  3   –  sentence  1,  headword  4  (gov)     –  sentence  2,  headword  1   –  sentence  2,  headword  4
  • 91. THE  REST  OF  THE  LANGUAGES  OF  THE   WORLD
  • 92. English-­‐only?  •  There  are  a  lot  of  languages  out  there  in  the  world!  •  But  there  are  a  lot  more  NLP  tools  for  English  than   anything  else  •  However,  there  is  starting  to  be  fairly  reasonable   support  (or  the  ability  to  build  it)  for  most  of  the  top   50  or  so  languages…  •  I’ll  say  a  little  about  that,  since  some  people  are   definitely  interested,  even  if  I’ve  covered  mainly   English
  • 93. POS  taggers  for  many  languages?  •  Two  choices:   1.  Find  a  tagger  with  an  existing  model  for  the   language  (and  period)  of  interest   2.  Find  POS-­‐tagged  training  data  for  the  language   (and  period)  of  interest  and  train  your  own   tagger   •  Most  downloadable  taggers  allow  you  to  train  new   models  –  e.g.,  the  Stanford  POS  tagger     –  But  it  may  involve  considerable  data  preparation  work  and   understanding  and  not  be  for  the  faint-­‐hearted
  • 94. POS  taggers  for  many  languages?  •  One  tagger  with  good  existing  multi-­‐lingual  support   –  TreeTagger  (Helmut  Schmid)   •  http://www.ims.uni-­‐stuttgart.de/projekte/corplex/ TreeTagger/   •  Bulgarian,  Chinese,  Dutch,  English,  Estonian,  French,  Old   French,  Galician,  German,  Greek,  Italian,  Latin,  Portuguese,   Russian,  Spanish,  Swahili   •  Free  for  non-­‐commercial,  not  open  source;  Linux,  Mac,   Sparc  (not  Windows)   –  Stanford  POS  Tagger  presently  comes  with:   •  English,  Arabic,  Chinese,  German  •  One  place  to  look  for  more  resources:   –  http://nlp.stanford.edu/links/statnlp.html   •  But  it’s  always  out  of  date,  so  also  try  a  Google  search  
  • 95. Chinese  example  •  Chinese  doesn’t  put  spaces  between  words   –  Nor  did  Ancient  Greek  •  So  almost  all  tools  first  require  word   segmentation   •  I  demonstrate  the  Stanford  Chinese  Word  Segmenter     •  http://nlp.stanford.edu/software/segmenter.shtml    •  Even  in  English,  words  need  some  segmentation   –  often  called  tokenization   •  It  was  being  implicitly  done  before  further  processing   in  the  examples  till  now:    “I’ll  go.”            “      I      ’ll      go      .      ”
  • 96. Chinese  example  •  $  ../Software/stanford-­‐chinese-­‐ segmenter-­‐2010-­‐03-­‐08/segment.sh  ctb   Xinhua.txt  utf-­‐8  0  >  Xinhua.seg  •  $  java  -­‐mx300m  -­‐cp  ../Software/stanford-­‐ postagger-­‐full-­‐2011-­‐05-­‐18/stanford-­‐postagger.jar   edu.stanford.nlp.tagger.maxent.MaxentTagger  -­‐ model  ../Software/stanford-­‐postagger-­‐ full-­‐2011-­‐05-­‐18/models/chinese.tagger  -­‐textFile   Xinhua.seg  >  Xinhua.tag
  • 97. Chinese  example  #  space  before    below!  $  perl  -­‐pe  if  (  !  m/^s*$/  &&  !  m/^.{100}/)  {  s/$/   /;  }  <  Xinhua.seg  >  Xinhua.seg.fixed  $  java  -­‐mx600m  -­‐cp  ../Software/stanford-­‐parser-­‐2011-­‐06-­‐15/stford-­‐parser.jar  edu.stanford.nlp.parser.lexparser.LexicalizedParser  -­‐encoding  utf-­‐8  ../Software/stanford-­‐parser-­‐2011-­‐04-­‐17/chineseFactored.ser.gz  Xinhua.seg.fixed  >  Xinhua.parsed  $  java  -­‐mx1g  -­‐cp  ../Software/stanford-­‐parser-­‐2011-­‐06-­‐15/stanford-­‐parser.jar  edu.stanford.nlp.parser.lexparser.LexicalizedParser  -­‐encoding  utf-­‐8  -­‐outputFormat  typedDependencies  ../Software/stanford-­‐parser-­‐2011-­‐04-­‐17/chineseFactored.ser.gz  Xinhua.seg.fixed  >  Xinhua.sd
  • 98. Other  tools  •  Dependency  parsers  are  now  available  for  many   languages,  especially  via  MaltParser:   –  http://maltparser.org/  •  For  instance,  it’s  used  to  provide  a  Russian  parser   among  the  resources  here:   –  http://corpus.leeds.ac.uk/mocky/    •  The  OPUS  (Open  Parallel  Corpus)  collects  tools  for   various  languages:   –  http://opus.lingfil.uu.se/trac/wiki/Tagging%20and %20Parsing  •  Look  around!
  • 99. Data  sources  •  Parsers  depend  on  annotated  data  (treebanks)  •  You  can  use  a  parser  trained  on  news  articles,  but   better  resources  for  humanities  scholars  will   depend  on  community  efforts  to  produce  better   data  •  One  effort  is  the  construction  of  Greek  and  Latin   dependency  treebanks  by  the  Perseus  ProjectI:   –  http://nlp.perseus.tufts.edu/syntax/treebank/
  • 100. PARTING  WORDS
  • 101. Applications?  (beyond  word  counts)  •  There  are  starting  to  be  a  few  applications  in  the   humanities  using  richer  NLP  methods:  •  But  only  a  few….
  • 102. Applications?  (beyond  word  counts)  –  Cameron  Blevins.  2011.  Topic  Modeling  Historical   Sources:  Analyzing  the  Diary  of  Martha  Ballard.   DH  2011.   •  Uses  (latent  variable)  topic  models  (LDA  and  friends)   –  Topic  model  are  primarily  used  to  find  themes  or  topics   running  through  a  group  of  texts   –  But,  here,  also  helpful  for  dealing  with  spelling  variation  (!)   –  Uses  MALLET  (http://mallet.cs.umass.edu/),  a  toolkit  with  a   fair  amount  of  stuff  for  text  classification,  sequence  tagging   and  topic  models   »  We  also  have  the  Stanford  Topic  Modeling  Toolbox   •  http://nlp.stanford.edu/software/tmt/tmt-­‐0.3/   •  Examines  change  in  diary  entry  topics  over  time
  • 103. Applications?  (beyond  word  counts)  –  David  K.  Elson,  Nicholas  Dames,  Kathleen  R.   McKeown.  2010.  Extracting  Social  Networks  from   Literary  Fiction.  ACL  2010.   •  How  size  of  community  in  novel  or  world  relates  to   amount  of  conversation   –  (Stanford)  NER  tagger  to  identify  people  and  organizations   –  Heuristically  matching  to  name  variants/shortenings   –  System  for  speech  attribution  (Elson  &  McKeown  2010)   –  Social  network  construction   •  Results  showing  that  urban  novel  social  networks  are   not  richer  than  those  in  rural  settings,  etc.
  • 104. Applications?  (beyond  word  counts)  –  Aditi  Muralidharan.  2011.  A  Visual  Interface  for   Exploring  Language  Use  in  Slave  Narratives  DH   2011.  http://bebop.berkeley.edu/wordseer     •  A  visualization  and  reading  interface  to  American  Slae   Narratives   –  (Stanford)  Parser  used  to  allow  searching  of  particular   grammatical  relationships:  grammatical  search   –  Visualization  tools  to  show  a  word’s  distribution  in  text  and  to   provide  a  “collapsed  concordance”  view  –  and  for  close   reading   •   Example  application  is  exploring  relationship  with  God
  • 105. Parting  words     This  talk  has  been  about  tools  –     they’re  what  I  know     But  you  should  focus  on  disciplinary  insight  –   not  on  building  corpora  and  tools,  but  on  using    them  as  tools  for  producing  disciplinary  research

Related Documents