Named entity extraction tools for raw OCR text Kepa J. Rodriguez GCDH-colloquium 04.07.2012
Outline• Context of the experiments at the EHRI project• Description of the experiment• Corpus data• Creation and ...
Context in the EHRI project• Archival institutions have bigs amount of non digitized documents and descriptions• EH...
Experiment• Evaluation of four existing NE extraction tools: – Stanford NER – OpenCalais – OpenNLP – Alc...
Experiment• Different tools use different annotation tagsets. • Output has to be normalized• Stanford NER and Open...
Corpus data• Two datasets of type-writting monospaced text• Wiener Library – 17 pages of testimonies of Shoah surv...
Corpus data (WL) GCDH Colloquium – 11.07.2012
Corpus data (WL)¢3ohad been sold, and we dependedgxhe last night of our stay on thefriendliness of this neighbour. III!! T...
Corpus data (KCL) GCDH Colloquium – 11.07.2012
Corpus data (KCL):_» I |“- _li; A 1 U g _:__ L, £g!g;“ »“K” D. F. NEws.,pNo. 24,~ "Monday, 18th September, 1959.KELLY at ...
Corpus data (KCL)Although the events of Saturday night and Sundaymorning are Weil known to the KELLY shipis Company. they ...
Construction of the corpus• Generate two copies of each datasets• Manual correction of one of the copies – Used to...
Corpus data (KCL) Wiener Library KCL RAW Corrected ...
Results of the NE extraction GCDH Colloquium – 11.07.2012
Results of the NE extraction Raw Corrected P R ...
Results of the NE extraction • Low performance of the tools in corrected and raw text • Our data and data used for tra...
Results of the NE extraction• Performance of extraction of entities of type ORG is very low – F1 = between 0.11 & 0....
Conclusions• Manual correction of OCR output does not improve significantly the performance. – Raw output is enou...
ThanksGCDH Colloquium – 11.07.2012
of 19

Named entity extraction tools for raw OCR text

Text Analysis Seminar at the Göttingen Center of Digital Humanities. 04.07.2012 In this lecture I present an experiment comparing the efficacy of several Named Entity Extraction (NEE) tools at extracting entities directly from the output of an optical character recognition (OCR) workflow. The presentation will discuss the creation of a set of test data consisting of raw and manually corrected OCR output, comparing the precision and recall in the extraction of entities of type PERSON, LOCATION and ORGANIZATION against the manually annotated test data.
Published on: Mar 3, 2016
Published in: Technology      Education      
Source: www.slideshare.net


Transcripts - Named entity extraction tools for raw OCR text

  • 1. Named entity extraction tools for raw OCR text Kepa J. Rodriguez GCDH-colloquium 04.07.2012
  • 2. Outline• Context of the experiments at the EHRI project• Description of the experiment• Corpus data• Creation and composition of the corpus• Results of the NE extraction• Conclusions GCDH Colloquium – 11.07.2012
  • 3. Context in the EHRI project• Archival institutions have bigs amount of non digitized documents and descriptions• EHRI will provide its partners an OCR service that: – Extracts text from image files of the documents – Text can be used to index the documents and improve the quality of the search – Indexes can be later validated and improved by collection and archive specialists• What kind of indexes can be obtained from this noisy text?• Quality of OCR transcripts in very low for humans, but … is it useful for machines? GCDH Colloquium – 11.07.2012
  • 4. Experiment• Evaluation of four existing NE extraction tools: – Stanford NER – OpenCalais – OpenNLP – Alchemy• Extracted entity types: PER, LOC, ORG – Good coverage by the selected tools. – Highly relevant for Shoah research and contemporary historical research in general. GCDH Colloquium – 11.07.2012
  • 5. Experiment• Different tools use different annotation tagsets. • Output has to be normalized• Stanford NER and OpenNLP use Person, Location and Organization as annotation categories. – Direct mapping to PER, LOC and ORG• OpenCalais: – Country, City and NaturalFeature merged into LOC – Organization and Facility into ORG• Alchemy – Organization, Facility and Company into ORG – City and Continent into LOC GCDH Colloquium – 11.07.2012
  • 6. Corpus data• Two datasets of type-writting monospaced text• Wiener Library – 17 pages of testimonies of Shoah survivors – OCR word accuracy 93%• King College Londons Serving Soldier Archive – 33 newsletters written for the crew of the warship H.M.S. Kelly – OCR word accuracy 92.5% GCDH Colloquium – 11.07.2012
  • 7. Corpus data (WL) GCDH Colloquium – 11.07.2012
  • 8. Corpus data (WL)¢3ohad been sold, and we dependedgxhe last night of our stay on thefriendliness of this neighbour. III!! The landlord Mr.and Mrs.Wolkewitz, who had always gone out of their way to be kind to us,had a collection arranged to us, and_wn finally left - on thenight of July 4-5, 1939 - all the tenqnts or the house hadassembled, and we all cried.All people mentioned so for have either been friends oracqndintanoes. There were others e.g. the grocer and the laundrywho refused payment before our departure, end there are twoindidente with German officials which I would like to tell: GCDH Colloquium – 11.07.2012
  • 9. Corpus data (KCL) GCDH Colloquium – 11.07.2012
  • 10. Corpus data (KCL):_» I |“- _li; A 1 U g _:__ L, £g!g;“ »“K” D. F. NEws.,pNo. 24,~ "Monday, 18th September, 1959.KELLY at Sea. _ PKINGSTQN at portsmouth, Remainder of "K" Flotilla building.THE "K" D.E. NEwS IS NCT To EE TAKEN ASHCRE NCR ARE ANY or ITSCONTENTS To EE CCNRUNICATED CUTSIEE THE SHIP UNTIL THE MAR ISOVER, wHEN ARRANGEMENTS CAN EE MADE To SUPPLY BACE CCPIES PCRTHE PRICE CR THE PAPER oN WHICH THEY ARE PRINTED.`________________________as--sauna-__-as-_un-_._-»_.__--.`¢___.-_-n__________..¢.__THE KELLYS HUNT - SEPTENEER Ietn/Ivtn, GCDH Colloquium – 11.07.2012
  • 11. Corpus data (KCL)Although the events of Saturday night and Sundaymorning are Weil known to the KELLY shipis Company. they areincluded here as being of interest to the rest of the Flotilla. `Shortly after dark information was received which enabledCourse to be altered to close a German submarine on the surface.Before the KELLY could arrive the submarine had dived, but aPemarkably good contact was obtained, and an attC0ntact was maintained all night in order that the final attackSh0uld be carried out by daylight- Unfortunately no Oil, wreckageOP Survivors came to the surface, but air bUbb1€S appeared after the1&St attack, which makes it possible, although by no means certain,that the submarine was destroyed. - _THE KINGSTON’S PROGRAIME. ~ -Today the KINGSTON will be inspected by the Commander-in~Chief, Portsmouth, and will then proceed to sea for acceptance GCDH Colloquium – 11.07.2012
  • 12. Construction of the corpus• Generate two copies of each datasets• Manual correction of one of the copies – Used to evaluated the impact of the noise in the NE extraction• Tokenization and POS tagging using TreeTagger• Conversion of the TreeTagger output into stand-off standard XML.• Import of the data into the MMAX2 annotation tool• Manual annotation of the named entities• Control of reliability of the annotation using the Kappa coeficient• K = 0.93• K > 0.8 is considered as reliable GCDH Colloquium – 11.07.2012
  • 13. Corpus data (KCL) Wiener Library KCL RAW Corrected Raw CorrectedFiles 17 17 33 33Words 4415 4398 16982 15693PER 75 83 82 80LOC 60 63 170 178ORG 13 13 52 60Total 148 159 305 319 GCDH Colloquium – 11.07.2012
  • 14. Results of the NE extraction GCDH Colloquium – 11.07.2012
  • 15. Results of the NE extraction Raw Corrected P R F1 P R F1 AL 0.61 0.38 0.47 0.63 0.38 0.48 OC 0.75 0.29 0.41 0.69 0.30 0.42 ON 0.42 0.12 0.19 0.53 0.13 0.21 ST 0.57 0.52 0.54 0.60 0.61 0.60 GCDH Colloquium – 11.07.2012
  • 16. Results of the NE extraction • Low performance of the tools in corrected and raw text • Our data and data used for training and evaluation of tools are quite different. • PER: non standard forms as – [Last name, First name] • “Wa1ter, Klaus” – Parenthesis together with initials of the name • “Captain (D) – Some cases can be resolved using easy heuristics in preprocessing • Names of persons and locations are used for other kind of entities: • Warships have been annotated as PER GCDH Colloquium – 11.07.2012
  • 17. Results of the NE extraction• Performance of extraction of entities of type ORG is very low – F1 = between 0.11 & 0.32 – Name of organizations appear in non-standard forms – Some of the organization dont exists and are not part of the knowledge used to train the system. • SS and other relevant nazi organizations have not be detected.• Spelling errors and typos in the original files: – OpenCalais used general knowledge to resolve this problem – Use of general knowledge my be problematic. • “Klan, Walter” → “Ku Klux Klan” GCDH Colloquium – 11.07.2012
  • 18. Conclusions• Manual correction of OCR output does not improve significantly the performance. – Raw output is enough to obtain provisional index candidates• Focus in near tearm: – Identify most habitual patterns of error – Implement preprocessing pipeline using simple heuristics and pattern matching tools• Focus in longer term: – Use domain specific knowledge in form of authority files to validate and correct the output of NE extraction tools. – Explore the possibility of combining different NE extraction tools and select output using a voting algorithm GCDH Colloquium – 11.07.2012
  • 19. ThanksGCDH Colloquium – 11.07.2012