of 24

# Naive Bayesian Text Classifier Event Models

To download slides: http://www.intelligentmining.com/category/knowledge-base/ These are my notes for a presentation I did internally at IM. It covers both the multinomial and multi-variate Bernoulli event models in Naive Bayes text classification.
Published on: Mar 3, 2016
Source: www.slideshare.net

#### Transcripts - Naive Bayesian Text Classifier Event Models

• 1. NAÏVE BAYESIANTEXT CLASSIFIEREVENT MODELSNOTES AND HOW TO USE THEMAlex LinSenior ArchitectIntelligent Miningalin@intelligentmining.com
• 2. Outline  Quick refresher for   Naïve Bayesian Text Classifier  Event Models   Multinomial Event Model   Multi-variate Bernoulli Event Model  Performance characteristics  Why are they important? 2
• 3. Naïve Bayes Text Classifier  Supervised Learning  Labeled data set for training to classify the unlabeled data  Easy to implement and highly scalable  It is often the first thing to try  Successful cases: Email spam filtering, News article categorization, Product classification, Social content categorization 3
• 4. N.B. Text Classifier - Trainingd1= w1w9w213w748w985...wn|y=1d2= w12w19w417w578...wn|y=0d3= w53w62w23w78w105...wn|y=1 Train y=0 y=1d4= w46w58w293w348w965...wn|y=1d5= w8w9w33w678w967...wn|y=0d6= w71w79w529w779...wn|y=1……dn= w49w127w23w93...wn|y=0 P(w1|y=1) P(w1|y=0) P(w3|y=1) P(w2|y=0) P(wn|y=1) P(w3|y=0) w1 w2 w3 … wn 4
• 5. N.B. Text Classifier - Classifying d1= w1w3|y=? Classify w1 w2 w3 … wn To classify a document, pick the class with “maximum posterior probability” Argmax P(Y = c | w1,w 3 ) c = Argmax P(w1,w 3 |Y = c)P(Y = c) c P(w1|y=0) = Argmax P(Y = c)∏ P(w n |Y = c) P(w3|y=1) c n P(w3|y=0) P(w1|y=1) P(Y = 0)P(w1 |Y = 0)P(w 3 |Y = 0) P(Y = 1)P(w1 |Y = 1)P(w 3 |Y = 1) 5 y=0 y=1€€
• 6. Bayesian Theorem Likelihood Class Prior P(X |Y )P(Y ) P(Y | X) = P(X) Posterior Evidence€  Generative learning algorithms model “Likelihood” P(X|Y) and “Class Prior” P(Y) then make prediction based on Bayes Theorem. 6
• 7. Naïve Bayes Text Classifier P(X |Y )P(Y ) P(Y | X) = P(X) Y ∈ {0,1} d = w1,w 6 ,...,w n Apply Bayes’ Theorem: € P(w1,w 6 ,...,w n |Y = 0)P(Y = 0) €P(Y = 0 | w1,w 6 ,...,w n ) = P(w1,w 6 ,...,w n ) P(w1,w 6 ,...,w n |Y = 1)P(Y = 1) P(Y = 1 | w1,w 6 ,...,w n ) = P(w1,w 6 ,...,w n )€ €€ 7 €
• 8. Naïve Bayes Text Classifier P(X |Y )P(Y ) P(Y | X) = P(X) Y ∈ {0,1} d = w1,w 6 ,...,w n To find most likely class label for d: € Argmax P(Y = c | w1,w 6 ,...,w n )€ € c P(w1,w 6 ,...,w n |Y = c)P(Y = c) = Argmax c P(w1,w 6 ,...,w n ) € = Argmax P(w1,w 6 ,...,w n |Y = c)P(Y = c) c How do we estimate likelihood? € 8 €
• 9. Estimate Likelihood P(X|Y) How to estimate Likelihood P(w1,w 6 ,...,w n |Y = c) ? Naïve Bayes’ assumption - assume that the words (wn) are conditionally independent given y. P(w1,w 6 ,...,w n |Y € c) = = P(w1 |Y = c) * P(w 6 |Y = c) * ...* P(w n |Y = c) = ∏ P(w i |Y = c) i∈n Different event models€ = ∑ log(P(w i |Y = c)) i∈n 9€
• 10. Differences of Two Models ∏ P(w i |Y = c) Multinomial Event model i∈n tf wi ,c Term Freq. of wi in Class c P(w i |Y = c) = c € Sum of all Term Freq. in Class c Multi-variate Bernoulli Event model€ df wi ,c Doc Freq. of wi in Class c P(w i |Y = c) = # of Docs in Class c Nc when wi not exists in d 10 P(w i |Y = c) = 1− P(w i |Y = c)€
• 11. Comparison of Two Models tf wi ,cMultinomial: P(w i |Y = c) = c Label w1 w2 w3 w4 … wn Y=0 tfw1,0 tfw2,0 tfw2,0 tfw2,0 … tfwn,0 |c| € Y=1 tfw1,1 tfw2,1 tfw2,1 tfw2,1 … tfwn,1 |c| df wi ,cMulti-variate Bernoulli: P(w i |Y = c) = Nc Label w1 w2 w3 w4 … wn Y=0 dfw1,0 dfw2,0 dfw3,0 dfw4,0 … dfwn,0 N0 € Y=0 !dfw1,0 !dfw2,0 !dfw3,0 !dfw4,0 … !dfwn,0 N0 Y=1 dfw1,1 dfw2,1 dfw3,1 dfw4,1 … dfwn,1 N1 11 Y=1 !dfw1,1 !dfw2,1 !dfw3,1 !dfw4,1 … !dfwn,1 N1
• 12. Comparison of Two Models tf wi ,c ∏ P(w |Y = c)Multinomial: P(w i |Y = c) = c i∈n id = {w1,w 2 ,...,w n } n: # of words in d w n ∈ {1,2,3,..., V } c €Argmax ∑ log(P(w i |Y = c)) + log( ) € c i∈n D € df wi ,cMulti-variate Bernoulli: P(w i |Y = c) = Ncd = {w1,w 2 ,...,w n } n: # of words in vocabulary |V| w n ∈ {0,1} € cArgmax ∑ log(P(w i |Y = c) (1− P(w i |Y = c)) wi 1−w i ) + log( ) c i∈n D 12 €
• 13. Multinomial For each word in docw1 w2 w3 w4 .. wn w1 w2 w3 w4 .. w5 Y=0 Y=1 A% B% 13
• 14. Multi-variate Bernoulli For each word in doc Y=1 A% W Y=0 1-A% When W does exists in doc Y=1 B% W’ Y=0 1-B% When W does not exists in doc 14
• 15. Performance Characteristics  Multinomial would perform better Multi-variate Bernoulli in most text classification tasks, especially when vocabulary size |V| >= 1K  Multinomial perform better when handling data sets that have large variance in document length  Multivariate Bernoulli could have the advantage of dense data set  Non-text features could be added as additional Bernoulli variables. However it should not be added to the vocabulary in Multinomial model 15
• 16. Why are they interesting?  Certaindata points are more suitable for one event model than the other.  Example:   Web page text + “Domain” + “Author”   Social content text + “Who”   Product name / desc + “Brand”  We can create a Naïve Bayes Classifier that combines event models  Most importantly, try both on your data set 16
• 17. Thank you  Any question or comment? 17
• 18. Appendix  Laplace Smoothing  Generative vs Discriminative Learning  Multinomial Event Model  Multi-variate Bernoulli Event Model  Notation 18
• 19. Laplace Smoothing Multinomial: tf wi ,c + 1 P(w i |Y = c) = c+V Multi-variate Bernoulli:€ df wi ,c + 1 P(w i |Y = c) = Nc + 2 19€
• 20. Generative vs. DiscriminativeDiscriminative Learning Algorithm:Try to learn P(Y | X) directly or try to map input Xto labels Y directlyGenerative Learning Algorithm:Try to model P(X |Y) and € P(Y ) first, then useBayes theorem to find out P(X |Y )P(Y ) P(Y | X) = € P(X) € 20
• 21. Multinomial Event Model tf wi ,c P(w i |Y = c) = c d = {w1,w 2 ,...,w n } n: # of words in d w n ∈ {1,2,3,..., V } = Argmax P(w1,w 6 ,...,w n |Y = c)P(Y = c) c€ € c tf wi ,c = Argmax ∑ log( ) + log( ) c i∈n c D 21
• 22. Multi-variate Bernoulli Event Model df wi ,c P(w i |Y = c) = Nc d = {w1,w 2 ,...,w n } n: # of words in vocabulary |V| w n ∈ {0,1} = Argmax P(w1,w 6 ,...,w n |Y = c)P(Y = c) c€ df wi ,c df € wi ,c 1−wi c = Argmax ∑ log(( ) )(1− ( wi ) ) + log( ) c i∈n Nc Nc D 22
• 23. Notation  d: a document  w: a word in a document  X: observed data attributes  Y: Class Label  |V|: number of terms in vocabulary  |D|: number of docs in training set  |c|: number of docs in class c  tf: Term frequency  df: document frequency  !df: 23
• 24. References  McCallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classiﬁcation. In: Learning for Text Categorization: Papers from the AAAI Workshop, AAAI Press (1998) 41–48 Technical Report WS-98-05  Metsis, V., Androutsopoulos, I., Paliouras, G.: Spam Filtering with Naive Bayes – Which Naive Bayes (2006) Third Conference on Email and Anti-Spam (CEAS)  Schneider, K: On Word Frequency Information and Negative Evidence in Naive Bayes Text Classification (2004) España for Natural Language Processing, EsTAL 24