Populating your Search Index
NEST Meetup, 2016-01
5 Presentations
 Indexing Considerations, Pipelines, and Apache NiFi
 A Proposal for a Document Pipeline
 How we do it ...
Indexing Considerations
Indexing considerations to think about when building out a
search platform
What do I mean?
 How do you plan to get data into the index
(Solr/ES/…)?
 Backups?
 Schedule & Monitor?
 Realtime sear...
Crawling?
 Common in the “enterprise search” space
 What crawler will you use?
 Nutch is well-known but too complex for...
Bulk indexing
 Plan for a “bulk reindex” use-case
 When changing schemas / ingestion extraction rules
 Or recovering wh...
Incremental indexing
 (adding new/updated content)
 Detect deletes how?
 A: Flag for removal upstream before eventually...
Backups (DR: Disaster Recovery)
 Scenario:
 Admin accidentally deleted 30k random docs; oh %#?!
 Not solved by replicat...
Document Transformations
 Mapping source data (e.g. HTML doc or database
record) to a search document
 Examples:
 Text ...
Schedule, Monitor
 How will a bulk index be triggered? Incremental
index?
 Unix Cron? Basic but crude.
 A Web UI to con...
Open-Source ETL Software
A summary of an investigation I did on open-source
options in 2013.
ETL Software
 Extract Transform Load – a general idea
 Software that calls itself ETL tends to be very similar.
 Clover...
Common features
 Two are GPL/LGPL, Talend is Apache
 Fremium model – pay for “enterprise” features
 The Good: (in a wor...
Talend screenshot
Apache NiFi
“is an easy to use, powerful, and reliable system to
process and distribute data.”
Apache Nifi overview
 Web-based UI
 Runtime modification of flow control
 Data provenance features
 Extensible (of cou...
Populate your Search index, NEST 2016-01
of 17

Populate your Search index, NEST 2016-01

Presented at a Meetup by New England Search Technology, 2016-01-14.
Published on: Mar 4, 2016
Published in: Technology      
Source: www.slideshare.net


Transcripts - Populate your Search index, NEST 2016-01

  • 1. Populating your Search Index NEST Meetup, 2016-01
  • 2. 5 Presentations  Indexing Considerations, Pipelines, and Apache NiFi  A Proposal for a Document Pipeline  How we do it at TIAA-CREF with Solr  How we do it at DRG with Solr  Logstash and Beats with ElasticSearch
  • 3. Indexing Considerations Indexing considerations to think about when building out a search platform
  • 4. What do I mean?  How do you plan to get data into the index (Solr/ES/…)?  Backups?  Schedule & Monitor?  Realtime search requirements?  What software? (pipelines, crawlers, …)
  • 5. Crawling?  Common in the “enterprise search” space  What crawler will you use?  Nutch is well-known but too complex for smaller scale jobs  Many more exist.  Security access control metadata to federate?  Try ManifoldCF which excels at this.
  • 6. Bulk indexing  Plan for a “bulk reindex” use-case  When changing schemas / ingestion extraction rules  Or recovering when there’s no backup  Not having a backup is typical; esp. if re-indexing is fast  Optimize settings for this to be fast  May need to toggle after ingestion into “normal” settings  Use multiple machines during indexing (e.g. via hadoop)?  “Optimize” (merge) Lucene segments at the end?
  • 7. Incremental indexing  (adding new/updated content)  Detect deletes how?  A: Flag for removal upstream before eventually removing  B: Track all IDs somewhere; find the ones that went missing  Maybe don’t need to synchronize deletes until off-hours?  Realtime Indexing, separate?
  • 8. Backups (DR: Disaster Recovery)  Scenario:  Admin accidentally deleted 30k random docs; oh %#?!  Not solved by replication/redundancy  Useful in other scenarios, like testing  Might not need it; especially if bulk re-indexing is fast  Take Snapshots (e.g. AWS, or via the search system, or…)  Recovery: Deploy snapshot then sync it back up to date.  Solr: see BloomReach’s “HAFT” project
  • 9. Document Transformations  Mapping source data (e.g. HTML doc or database record) to a search document  Examples:  Text from PDF extraction  Enrichment (e.g. Named Entity Recognition)  Text pre-processing before search platform gets it  Merging multiple data sources; joining  Home-grown or use an existing ETL / “pipeline”?  Do some of this directly on the search platform?
  • 10. Schedule, Monitor  How will a bulk index be triggered? Incremental index?  Unix Cron? Basic but crude.  A Web UI to control this is great.  A CI server (e.g. Jenkins) can work! (web, logs, alerting)  Monitor/alert for problems?  Perhaps via general log monitoring (e.g. ELK)
  • 11. Open-Source ETL Software A summary of an investigation I did on open-source options in 2013.
  • 12. ETL Software  Extract Transform Load – a general idea  Software that calls itself ETL tends to be very similar.  Clover ETL  Pentaho Data Integration, AKA Kettle  Talend Open Studio, Data Integration
  • 13. Common features  Two are GPL/LGPL, Talend is Apache  Fremium model – pay for “enterprise” features  The Good: (in a word, mature)  GUI wire diagram builder  Books / resources  The Bad:  Text-editing the pipeline not recommended: thus need GUI  Poor community  Data model is table-like; no native multi-valued fields
  • 14. Talend screenshot
  • 15. Apache NiFi “is an easy to use, powerful, and reliable system to process and distribute data.”
  • 16. Apache Nifi overview  Web-based UI  Runtime modification of flow control  Data provenance features  Extensible (of course)  Security, role based access control

Related Documents