Adrian Iftene 1 , Diana Trandab ăț 1 , Mihai Toader 2 , Marius Corîci 2 KEPT Conference, 4-6 July, Cluj-Napoca, Roman...
<ul><li>The problem that we address </li></ul><ul><li>Proposed Solution </li></ul><ul><li>Named Entity Recognition </li></...
<ul><li>We want to find out users’ opinions on various products, events or persons: </li></ul><ul><ul><li>I want to buy a ...
LiSS Conference , 3 - 5 Ma y , Iasi
<ul><li>NER - task which finds textual expressions such as the names of persons , organizations , locations , places ,...
<ul><li>We consider the following categories: Person, Organization, Company, Region, Place, City, Country, Product, Brand,...
<ul><li>Every token with capital letter is then considered to be candidate for named entity </li></ul><ul><li>When a c...
<ul><li>On identified candidates we apply rules that unify adjacent candidates, in order to obtain composed named entit...
<ul><li>We consider the following major categories : City , Organization , Company , Country , Person , and addition...
KEPT Conference , 4 - 6 July , Cluj-Napoca
<ul><li>Classification Rules : </li></ul><ul><li>Rules used in unification of NEs candidates </li></ul><ul><li>P ure resou...
<ul><li>Gold: 48 files with 24,244 words and with 1,638 Nes </li></ul>KEPT Conference , 4 - 6 July , Cluj-Napoca
<ul><li>Agreement between annotators – “ PDL Cluj-Napoca ” (organization) or “ PDL ” (organization) “ Cluj-Napoca ” (city)...
<ul><li>The percentage of the matched and partial matched entities that have been properly categorized is 95.71 % </li></...
<ul><li>38 files with a total of 19,509 words and with 1,215 NEs </li></ul><ul><li>Upper bound 95.12% (P), 96.40% (R), 95....
<ul><li>The problems from upper bound evaluation remain the same </li></ul><ul><li>Additionally, appear new problems relat...
<ul><li>The percentage of the matched and partial matched entities that have been properly categorized is 66.73 % </li></...
<ul><li>For Companies, Organization and Person types, the NEs were not found in our resources and the contextual rules ...
KEPT Conference , 4 - 6 July , Cluj-Napoca
KEPT Conference , 4 - 6 July , Cluj-Napoca
 
<ul><li>In this paper we present a system based on rules and on a list of resources , used in identification and classi...
<ul><li>The research presented in this paper was funded by the Sectoral Operational Program for Human Resources Developmen...
<ul><li>David Nadeau and Satoshi Sekine, A survey of named entity recognition and classification, Linguisticae Investigati...
of 24

Named Entity Recognition for Romanian

Published on: Mar 3, 2016
Published in: Technology      Business      
Source: www.slideshare.net


Transcripts - Named Entity Recognition for Romanian

  • 1. Adrian Iftene 1 , Diana Trandab ăț 1 , Mihai Toader 2 , Marius Corîci 2 KEPT Conference, 4-6 July, Cluj-Napoca, Romania Babes-Bolyai University 1 “Al. I. Cuza”, University of Ia s i, Rom a nia 1 Facult y of Computer Science 2 Intelligentics, Cluj-Napoca, Romania
  • 2. <ul><li>The problem that we address </li></ul><ul><li>Proposed Solution </li></ul><ul><li>Named Entity Recognition </li></ul><ul><ul><li>Named Entity Identification </li></ul></ul><ul><ul><li>Named Entity Classification </li></ul></ul><ul><ul><li>Evaluation (Upper Bound, Real Context) </li></ul></ul><ul><li>Examples </li></ul><ul><li>Conclusions </li></ul>
  • 3. <ul><li>We want to find out users’ opinions on various products, events or persons: </li></ul><ul><ul><li>I want to buy a certain product (e.g. iPhone). What are its strengths and weaknesses ? What are the opinions of persons who have used it already? </li></ul></ul><ul><ul><li>I am the manager of a big company. I am interested in what people say about my company. Which products have good or bad impact? What policy to adopt next? </li></ul></ul><ul><ul><li>I am a candidate for next elections. What should I change in my political discourse? Why a part of voters do not like me? </li></ul></ul>KEPT Conference , 4 - 6 July , Cluj-Napoca
  • 4. LiSS Conference , 3 - 5 Ma y , Iasi
  • 5. <ul><li>NER - task which finds textual expressions such as the names of persons , organizations , locations , places , etc. </li></ul><ul><li>Existing work: </li></ul><ul><ul><li>Statistical models (Nadeau and Sekine, 2007) - require a large amount of manually annotated training data </li></ul></ul><ul><ul><li>Machine learning techniques (Scurtu et al. 2009), (Nadeau, 2007) - require large training data </li></ul></ul><ul><ul><li>For Romanian – (Cucerzan, Yarowsky, 1999), (Ion, 2007) and (Machison, 2009) - NER gazetteer for Romanian included in Gate (Cunningham et al., 2009) </li></ul></ul>KEPT Conference , 4 - 6 July , Cluj-Napoca
  • 6. <ul><li>We consider the following categories: Person, Organization, Company, Region, Place, City, Country, Product, Brand, Model, and Publication </li></ul><ul><li>Named Entity Identification – based on segmentation, tokenizer and lemmatizer components </li></ul><ul><li>Named Entity Classification – based on lists, rules, triggers words </li></ul>KEPT Conference , 4 - 6 July , Cluj-Napoca
  • 7. <ul><li>Every token with capital letter is then considered to be candidate for named entity </li></ul><ul><li>When a candidate is first token in a phrase: </li></ul><ul><li>If it is in our stop word list - we eliminate it from candidates to be named entities; </li></ul><ul><li>If it is in our common word list </li></ul><ul><ul><li>when this common word is followed by lowercase words (we check in a list with trigger words). Examples: Universitatea din Cluj-Napoca (En: University of Cluj-Napoca), Țara de Jos (En: Low Country ) </li></ul></ul><ul><ul><li>when this common word is followed by uppercase words . Example: Doctor Stomatolog (En: Doctor Dentist) </li></ul></ul>KEPT Conference , 4 - 6 July , Cluj-Napoca
  • 8. <ul><li>On identified candidates we apply rules that unify adjacent candidates, in order to obtain composed named entities candidates : </li></ul><ul><li>Rules related to person title – Doctorul Popescu (En: Doctor Popescu ), Pre ședintele Băsescu (En : President Băsescu ) </li></ul><ul><li>Rules related to organization type – Universitatea Cuza (En: Cuza University ) </li></ul><ul><li>Rules related to abbreviation words – S.C. Travis </li></ul><ul><li>Rules related to special punctuation signs – Ana-Maria </li></ul><ul><li>Rules related to candidates to named entities separated by stop words - BCR Banca pentru Locuin ț e (En : BCR Housing Bank ) , Direc ț ia pentru S ă n ă tate (En: Department of Health ) </li></ul><ul><li>Rules for a specific model/product – Qosimio X500-Q930 , Portege R835 </li></ul>KEPT Conference , 4 - 6 July , Cluj-Napoca
  • 9. <ul><li>We consider the following major categories : City , Organization , Company , Country , Person , and additional we consider categories like Brand , Product and Publication </li></ul><ul><li>For almost all major categories we consider subcategories : </li></ul><ul><ul><li>For Cities we consider Romanian , European , American and Other Cities </li></ul></ul><ul><ul><li>For Organizations we consider Parties , Faculties , Universities , Ministries , etc. </li></ul></ul><ul><ul><li>For Persons we consider Sportsmen , Politicians , Males , Females , etc. </li></ul></ul><ul><li>A total of 14 main categories with 98 sub-categories </li></ul>KEPT Conference , 4 - 6 July , Cluj-Napoca
  • 10. KEPT Conference , 4 - 6 July , Cluj-Napoca
  • 11. <ul><li>Classification Rules : </li></ul><ul><li>Rules used in unification of NEs candidates </li></ul><ul><li>P ure resource-based rules – for Title type </li></ul><ul><li>C ontextual rules - we consider a mix between regular expressions and available entities from our files - for Organization, Company, Person, City and Country types </li></ul><ul><ul><li>For example ora ș , capital ă , t â rg, localitate (En: city, capital, town, locality) are triggers for City type, </li></ul></ul><ul><ul><li>companie, corpora ți e (En: company, corporation) are triggers for Company type, </li></ul></ul><ul><ul><li>partid, banc ă , universitate (En: party, bank, university) are triggers for Organization type </li></ul></ul><ul><ul><li>All titles identified at case named entity identification are triggers for Person type </li></ul></ul>KEPT Conference , 4 - 6 July , Cluj-Napoca
  • 12. <ul><li>Gold: 48 files with 24,244 words and with 1,638 Nes </li></ul>KEPT Conference , 4 - 6 July , Cluj-Napoca
  • 13. <ul><li>Agreement between annotators – “ PDL Cluj-Napoca ” (organization) or “ PDL ” (organization) “ Cluj-Napoca ” (city)? </li></ul><ul><li>When first word from a sentence is a common word - Ana ( Romanian female name ) or ( rope used on boats )? </li></ul><ul><li>Special characters at the beginning of row – segmentation problems </li></ul>KEPT Conference , 4 - 6 July , Cluj-Napoca
  • 14. <ul><li>The percentage of the matched and partial matched entities that have been properly categorized is 95.71 % </li></ul><ul><li>The main problems in NEs classification are related to the fact that exist NEs that are in more than one list of NEs </li></ul>KEPT Conference , 4 - 6 July , Cluj-Napoca
  • 15. <ul><li>38 files with a total of 19,509 words and with 1,215 NEs </li></ul><ul><li>Upper bound 95.12% (P), 96.40% (R), 95.76% (F), 1.81% (PP), 1.83 (PR) </li></ul>KEPT Conference , 4 - 6 July , Cluj-Napoca
  • 16. <ul><li>The problems from upper bound evaluation remain the same </li></ul><ul><li>Additionally, appear new problems related to extraction of entities of type Title (which are with small letters) </li></ul><ul><li>The problems related to Title represent 3.70% </li></ul><ul><li>Error cases are represented by following words: “ c ă lug ă ri ță , sor ă , colonel, viceprimar, co- pre ș edin ț i ” (En: nun, nurse, Colonel, vice mayor, co-president) </li></ul>KEPT Conference , 4 - 6 July , Cluj-Napoca
  • 17. <ul><li>The percentage of the matched and partial matched entities that have been properly categorized is 66.73 % </li></ul><ul><li>The error distribution on the named entity types </li></ul>KEPT Conference , 4 - 6 July , Cluj-Napoca
  • 18. <ul><li>For Companies, Organization and Person types, the NEs were not found in our resources and the contextual rules could not be applied </li></ul><ul><li>For Publication and Product types, they are frequently marked interchangeable. </li></ul><ul><li>For Region type, the major cause of errors is due to the fact that respective NE exists also in resources for other type , such as City , Place , Country </li></ul><ul><li>An interesting example is the case of PNL (which does not exist in our resources) - when it is preceded by word partid (En: party), it is correctly classified as Organization , but in all other cases, the system does not identify any type for it </li></ul>KEPT Conference , 4 - 6 July , Cluj-Napoca
  • 19. KEPT Conference , 4 - 6 July , Cluj-Napoca
  • 20. KEPT Conference , 4 - 6 July , Cluj-Napoca
  • 22. <ul><li>In this paper we present a system based on rules and on a list of resources , used in identification and classification of Romanian named entities </li></ul><ul><li>The system is able to distinguish between 14 main NE types </li></ul><ul><li>Future work will be related to the elimination of problems related to common words that are at the beginning of sentences </li></ul><ul><li>Another future direction is related to anaphora , in order to transfer the type of one classified entity to all its referees </li></ul>KEPT Conference , 4 - 6 July , Cluj-Napoca
  • 23. <ul><li>The research presented in this paper was funded by the Sectoral Operational Program for Human Resources Development through the project “Development of the innovation capacity and increasing of the research impact through post-doctoral programs&quot; POSDRU/89/1.5/S/49944 </li></ul><ul><li>The authors of this paper thank the colleagues Alexandru Ginsca, Emanuela Boros, Augusto Perez, Dan Cristea from Faculty of Computer Science Iasi </li></ul>
  • 24. <ul><li>David Nadeau and Satoshi Sekine, A survey of named entity recognition and classification, Linguisticae Investigationes 30 (2007), no. 1, 3-26, Publisher: John Benjamins Publishing Company. </li></ul><ul><li>Silviu Cucerzan and David Yarowsky, Language independent named entity recognition combining morphological and contextual evidence, 1999, pp. 90-99. </li></ul><ul><li>H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan, GATE: A framework and graphical development environment for robust NLP tools and applications, Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, 2002. </li></ul><ul><li>Radu Ion, Word sense disambiguation methods applied to English and Romanian, PhD Thesis, 2007. </li></ul><ul><li>Lucian Mihai Machison, Named entity recognition for Romanian (roner), Proceedings of the International Conference on Knowledge Engineering, Principles and Techniques, KEPT2009, 2009, pp. 53-56. </li></ul><ul><li>Scurtu V. Stepanov E. Mehdad, Y., Italian named entity recognizer participation in ner task @ evalita 09, 2009. </li></ul><ul><li>David Nadeau, Semi-supervised named entity recognition: Learning to recognize 100 entity types with little supervision, PhD Thesis, 2007. </li></ul>

Related Documents