<ul><li>D-Confidence: an active learning strategy which efficiently identifies small classes </li></ul><ul><li>Learning fr...
<ul><li>Outline </li></ul><ul><ul><li>Motivations </li></ul></ul><ul><ul><li>D-Confidence </li></ul></ul><ul><ul><li>Evalu...
<ul><li>Fraud detection </li></ul><ul><li>Medical data, disease detection </li></ul><ul><li>Web page classification </li><...
<ul><li>Collecting and annotating exemplary cases </li></ul><ul><ul><ul><li>Critical </li></ul></ul></ul><ul><ul><ul><li>C...
<ul><li>Learning settings </li></ul><ul><ul><li>Supervised: high labeling effort </li></ul></ul><ul><ul><li>Unsupervised: ...
<ul><li>Active Learning </li></ul><ul><ul><li>Accuracy at low cost </li></ul></ul><ul><ul><li>from a complete specificat...
<ul><li>D-Confidence </li></ul><ul><ul><li>Active learning strategy selecting queries with: </li></ul></ul><ul><ul><ul><li...
<ul><li>Intuition </li></ul>NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Futu...
<ul><ul><li>Combines low-confidence with high-distance to produce a bias towards cases from unknown classes located...
<ul><li>Effect on (SVM) confidence </li></ul>NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | C...
<ul><li>D-Confidence </li></ul><ul><ul><li>Repository (UCI) datasets </li></ul></ul><ul><ul><li>Text corpora </li></ul></u...
NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Future Work     Class distributio...
<ul><li>D-Confidence </li></ul><ul><ul><li>Repository (UCI) datasets </li></ul></ul><ul><ul><li>Text corpora </li></ul></u...
<ul><li>Text corpora </li></ul><ul><ul><li>20 Newsgroups </li></ul></ul><ul><ul><li>500 cases, 20 classes </li></ul></ul><...
NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Future Work Confidence FarthestFi...
<ul><ul><li>D-Confidence identifies classes faster (lower cost) </li></ul></ul><ul><ul><li>This gain is bigger for minorit...
<ul><ul><li>Semi-supervised D-Confidence </li></ul></ul><ul><ul><li>Retrieve cases when representativeness assumption fail...
<ul><li>Thank you! </li></ul>Nuno Filipe Escudeiro [email_address] Alípio Mário Jorge [email_address]
<ul><li>D-Confidence </li></ul><ul><ul><li>Simulated datasets </li></ul></ul><ul><ul><li>Repository (UCI) datasets </li></...
NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Future Work Simulated datasets...
NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Future Work Colinear Imbalance...
<ul><li>Error </li></ul>NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Future Wo...
<ul><li>Finding cases from all classes </li></ul>NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation ...
NAACL HLT, 6 de Junho de 2010 Meta-Learning Colinearity – correlation coefficient, r , among cluster centroids – coline...
NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Future Work
of 25

NAACL HLT 2010 d-Confidence

Published on: Mar 3, 2016
Published in: Technology      
Source: www.slideshare.net


Transcripts - NAACL HLT 2010 d-Confidence

  • 1. <ul><li>D-Confidence: an active learning strategy which efficiently identifies small classes </li></ul><ul><li>Learning from Incomplete Specifications </li></ul>Nuno Filipe Escudeiro [email_address] Alípio Mário Jorge [email_address]
  • 2. <ul><li>Outline </li></ul><ul><ul><li>Motivations </li></ul></ul><ul><ul><li>D-Confidence </li></ul></ul><ul><ul><li>Evaluation </li></ul></ul><ul><ul><li>Conclusions </li></ul></ul><ul><ul><li>Future Work </li></ul></ul>NAACL HLT, 6 de Junho de 2010
  • 3. <ul><li>Fraud detection </li></ul><ul><li>Medical data, disease detection </li></ul><ul><li>Web page classification </li></ul><ul><li>Mail categorization </li></ul><ul><li>… </li></ul>Motivations | D-Confidence | Evaluation | Conclusions | Future Work <ul><li>Automatic resource organization </li></ul><ul><li>Large corpora </li></ul><ul><li>Unlabeled text documents </li></ul><ul><li>Labeling is expensive </li></ul><ul><li>Need to identify exemplary cases for all labels to learn </li></ul><ul><li>… fast (with few labels) </li></ul>NAACL HLT, 6 de Junho de 2010
  • 4. <ul><li>Collecting and annotating exemplary cases </li></ul><ul><ul><ul><li>Critical </li></ul></ul></ul><ul><ul><ul><li>Costly </li></ul></ul></ul><ul><li>Labeling effort related to: </li></ul><ul><ul><ul><li>Number of labels to learn </li></ul></ul></ul><ul><ul><ul><li>Class distribution in the working set </li></ul></ul></ul><ul><ul><ul><li>Sample representativeness </li></ul></ul></ul>NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Future Work
  • 5. <ul><li>Learning settings </li></ul><ul><ul><li>Supervised: high labeling effort </li></ul></ul><ul><ul><li>Unsupervised: low expressiveness </li></ul></ul><ul><ul><li>Semi-supervised: unable to deal with incomplete specifications </li></ul></ul><ul><ul><li>Active learning: criterious selection of cases to label </li></ul></ul><ul><ul><ul><li>Minimize error </li></ul></ul></ul><ul><ul><ul><li>Availability of pre-labeled examples on all classes </li></ul></ul></ul>NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Future Work
  • 6. <ul><li>Active Learning </li></ul><ul><ul><li>Accuracy at low cost </li></ul></ul><ul><ul><li>from a complete specification </li></ul></ul><ul><li>D-Confidence </li></ul><ul><ul><li>Accuracy and Representativeness at low cost </li></ul></ul><ul><ul><li>from incomplete specification </li></ul></ul>NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Future Work <ul><li>Active Learning </li></ul><ul><ul><li>Accuracy at low cost </li></ul></ul><ul><ul><li>from a complete specification </li></ul></ul><ul><li>D-Confidence </li></ul><ul><ul><li>Accuracy and Representativeness at low cost </li></ul></ul><ul><ul><li>from in complete specification </li></ul></ul>
  • 7. <ul><li>D-Confidence </li></ul><ul><ul><li>Active learning strategy selecting queries with: </li></ul></ul><ul><ul><ul><li>Low confidence </li></ul></ul></ul><ul><ul><ul><ul><li>exploitation / accuracy </li></ul></ul></ul></ul><ul><ul><ul><li>High distance to known classes </li></ul></ul></ul><ul><ul><ul><ul><li>exploration / representativeness </li></ul></ul></ul></ul>NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Future Work
  • 8. <ul><li>Intuition </li></ul>NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Future Work
  • 9. <ul><ul><li>Combines low-confidence with high-distance to produce a bias towards cases from unknown classes located in unexplored regions in case space </li></ul></ul>NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Future Work
  • 10. <ul><li>Effect on (SVM) confidence </li></ul>NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Future Work
  • 11. <ul><li>D-Confidence </li></ul><ul><ul><li>Repository (UCI) datasets </li></ul></ul><ul><ul><li>Text corpora </li></ul></ul>NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Future Work
  • 12. NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Future Work     Class distribution Dataset # 1 2 3 4 5 6 7 8 9 10 11 Iris 150 50 50 50                 Cleveland 298 161 53 36 35 13             Vowels 330 30 30 30 30 30 30 30 30 30 30 30 SatImg 500 125 48 96 46 67 118           Poker 500 270 170 34 12 4 3 3 2 1 1   Dataset ActiveLearn 1st hit Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7 Class 8 Class 9 Class 10 Class 11 iris Conf 1 7 3                 dConf 1 3 1                 cleveland Conf 3 7 8 19 40             dConf 3 15 8 5 8             vowels Conf 3 10 14 31 12 27 29 15 31 18 24 dConf 2 12 19 16 24 26 23 2 26 3 23 satimg Conf 12 28 34 23 32 5           dConf 9 1 4 10 3 10           poker Conf 1 3 20 43 113 112 147 223 279 277   dConf 3 2 5 9 45 97 98 68 100 65
  • 13. <ul><li>D-Confidence </li></ul><ul><ul><li>Repository (UCI) datasets </li></ul></ul><ul><ul><li>Text corpora </li></ul></ul>NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Future Work
  • 14. <ul><li>Text corpora </li></ul><ul><ul><li>20 Newsgroups </li></ul></ul><ul><ul><li>500 cases, 20 classes </li></ul></ul><ul><ul><li>most frequent class 35 </li></ul></ul><ul><ul><li>least frequent class 20 </li></ul></ul><ul><ul><li>Reuters-21578 </li></ul></ul><ul><ul><li>1000 cases, 52 classes </li></ul></ul><ul><ul><li>most frequent class 435 </li></ul></ul><ul><ul><li>least frequent class 2 </li></ul></ul><ul><ul><li>42 out of 52 classes with frequency below 10 </li></ul></ul>NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Future Work
  • 15. NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Future Work Confidence FarthestFirst dConfidence Confidence FarthestFirst dConfidence
  • 16. <ul><ul><li>D-Confidence identifies classes faster (lower cost) </li></ul></ul><ul><ul><li>This gain is bigger for minority classes </li></ul></ul><ul><ul><li>D-Confidence performs better in imbalanced data </li></ul></ul><ul><ul><li>Error may increase </li></ul></ul><ul><ul><ul><li>Exploration / exploitation </li></ul></ul></ul><ul><ul><ul><li>Representativeness / accuracy </li></ul></ul></ul>NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Future Work
  • 17. <ul><ul><li>Semi-supervised D-Confidence </li></ul></ul><ul><ul><li>Retrieve cases when representativeness assumption fails </li></ul></ul><ul><ul><li>Scalability </li></ul></ul>NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Future Work
  • 18. <ul><li>Thank you! </li></ul>Nuno Filipe Escudeiro [email_address] Alípio Mário Jorge [email_address]
  • 19. <ul><li>D-Confidence </li></ul><ul><ul><li>Simulated datasets </li></ul></ul><ul><ul><li>Repository (UCI) datasets </li></ul></ul><ul><ul><li>Text corpora </li></ul></ul>NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Future Work
  • 20. NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Future Work Simulated datasets Levels (refer to training set properties) Factor 1 (+) 0 (-) Colinearity colinear centroids non-colinear centroids Balancing imbalanced class distribution balanced class distribution Cohesion isomorphic classes polymorphic classes Overlapping overlapping separable Response ErrorGain = gen.error(dConfidence) – gen.error(Confidence)
  • 21. NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Future Work Colinear Imbalanced Isomorphic Overlapping 1 (+) 1 (+) 1 (+) 1 (+) 1 (+) 1 (+) 1 (+) 0 (-) 1 (+) 1 (+) 0 (-) 1 (+) 1 (+) 1 (+) 0 (-) 0 (-) 1 (+) 0 (-) 1 (+) 1 (+) 1 (+) 0 (-) 1 (+) 0 (-) 1 (+) 0 (-) 0 (-) 1 (+) 1 (+) 0 (-) 0 (-) 0 (-) 0 (-) 1 (+) 1 (+) 1 (+) 0 (-) 1 (+) 1 (+) 0 (-) 0 (-) 1 (+) 0 (-) 1 (+) 0 (-) 1 (+) 0 (-) 0 (-) 0 (-) 0 (-) 1 (+) 1 (+) 0 (-) 0 (-) 1 (+) 0 (-) 0 (-) 0 (-) 0 (-) 1 (+) 0 (-) 0 (-) 0 (-) 0 (-)
  • 22. <ul><li>Error </li></ul>NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Future Work Colinearity Imbalanced Isomorphic Overlapping 4,241 -3,835 -15,459 1,296
  • 23. <ul><li>Finding cases from all classes </li></ul>NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Future Work
  • 24. NAACL HLT, 6 de Junho de 2010 Meta-Learning Colinearity – correlation coefficient, r , among cluster centroids – colinear when | r | ~ 1 Balancing – variance of n k – balanced when var( n k ) ~ 0 Cohesion – #classes divided by #clusters – cohesive when ~ 1 – representativeness fails (or highly overlapping clusters) when > 1 Overlapping – inter-cluster inertia divided by intra-cluster inertia – separable when >> 1 Motivations | D-Confidence | Evaluation | Conclusions | Future Work
  • 25. NAACL HLT, 6 de Junho de 2010 Motivations | D-Confidence | Evaluation | Conclusions | Future Work

Related Documents