BIG DATA – CONNECTING THE DOTS…
Delivering split-second decisions by integrating offline and online systems
NASSCOM Big Da...
ART OF DECISION MAKING – SPEED & ACCURACY
11:01AM
11:05AM
11:06AM
• Credit card used from three distance locations in shor...
TIERED BIG DATA STRATEGY
real time
e.g. filters
near real time
e.g. correlations
offline
e.g. behavioral analysis
cost, sp...
BIG DATA - COMPUTATION STRATEGY
Offline
(map-reduce, batch)
Offline variablesOnline variables
Near Real-time
(complex even...
Hadoop Technology Stack
BIG DATA IN USE - OFFLINE ECOSYSTEM
HDFS HBase
Map Reduce Framework
Data Storage
Data Processing D...
BIG DATA IN MOTION – ONLINE ECOSYSTEM
Complex Event Processing
correlations
filtering
aggregations
pattern matching In-mem...
BIG DATA MOVEMENT EVOLUTION
Offline
In-memory data store
Offline
NoSQL
(persistent backing store)
In-memory data store
Two...
Confidential and Proprietary8
USE CASE: GRAPH BASED DECISIONING
Map/Reduce Graph
builder
In-memory graph store
Online Grap...
Confidential and Proprietary9
• Hadoop is best for offline processing of variety and volume data – not for real time
• CEP...
THANK YOU!
of 10

NASSCOM Big Data and Analytics Summit 2013: Keymote 2: Gurinder Grewal

Keynote II: Big Data: Connecting the dots – Delivering split-second decisions by integrating offline and online systems. Gurinder Grewal, Leader of Risk Big Data Platform, Paypal
Published on: Mar 3, 2016
Published in: Technology      Business      
Source: www.slideshare.net


Transcripts - NASSCOM Big Data and Analytics Summit 2013: Keymote 2: Gurinder Grewal

  • 1. BIG DATA – CONNECTING THE DOTS… Delivering split-second decisions by integrating offline and online systems NASSCOM Big Data & Analytics Summit 2013 GURINDER S. GREWAL
  • 2. ART OF DECISION MAKING – SPEED & ACCURACY 11:01AM 11:05AM 11:06AM • Credit card used from three distance locations in short time Result based on realtime analysis: Block the card, not decided? • According to past purchasing behavior • Card holder lives in US - wife paid bill online from home PC • Card holder’s kid studies in Europe - used card to purchase books • Card holder travels to Japan - paid for lunch Result based on historical analysis: It’s a legit usage
  • 3. TIERED BIG DATA STRATEGY real time e.g. filters near real time e.g. correlations offline e.g. behavioral analysis cost, speed data volume, accuracy effective decision = fn(accuracy, speed, cost) data age secondshoursyears Data in-motion Data in-use
  • 4. BIG DATA - COMPUTATION STRATEGY Offline (map-reduce, batch) Offline variablesOnline variables Near Real-time (complex event processing) Realtime (in-flow processing) • fast, very stringent availability and performance SLA’s • computations are simple and eventually accurate • computations are transient, short lived (user sessions) • event-driven, incremental processing • high efficiency and scalability • data for short time windows (hours) • optimized for throughput • computations are slow and accurate • data captured as events for historical analysis
  • 5. Hadoop Technology Stack BIG DATA IN USE - OFFLINE ECOSYSTEM HDFS HBase Map Reduce Framework Data Storage Data Processing Data Integration ETL Flume, Sqoop Programming Languages Pig Hive QL Scheduling, Coordination Zookeeper Oozie UI Framework/SDK Hue Hue SDK Structured Data Unstructured Data MPP DW RDBMS
  • 6. BIG DATA IN MOTION – ONLINE ECOSYSTEM Complex Event Processing correlations filtering aggregations pattern matching In-memory data store Message Bus Offline Decision Service Events stream CEP enables continuous analytics on data in motion • Solution for velocity of big data • Well suited for detection, decisioning, alerting and taking actions • Relies on in-memory data grid for ability to provide low latency Monitoring
  • 7. BIG DATA MOVEMENT EVOLUTION Offline In-memory data store Offline NoSQL (persistent backing store) In-memory data store Two-tier architecture Data Cloud Data Cloud Initial state • 500GB GB in 16 hours Optimization – Phase 1 • 2 TB in 16 hours • Split data files prepared offline • Maximize data load parallelism • Maximum data compression • Optimize data format • Validation before data movement Scale – Phase 2 • 10 TB in 6 hours • Add persistent NoSQL behind in-memory store • Blast bulk load into NoSQL store • Batch process will warm the cache • Lazy warm-up as needed, while serving r/w • Refresh cache contents via time based evictions Batch Multi-tier architecture
  • 8. Confidential and Proprietary8 USE CASE: GRAPH BASED DECISIONING Map/Reduce Graph builder In-memory graph store Online Graph Server Daily incremental updates Continuous graph updates and rollup • Generate graph and associated complex variables on Hadoop on daily basis • Move the incremental changes to online in-memory graph store • Based on event stream, keep graph, offline variables up-to-date • In-memory store provides fast read only access to Decision services Decision Service Avg. read time: 2ms 95th percentile: 6ms Events stream offline online
  • 9. Confidential and Proprietary9 • Hadoop is best for offline processing of variety and volume data – not for real time • CEP is a solution for online, big data in motion (velocity), complements Hadoop • Harness true power of big data by combining offline and online data • Data integration is a key – careful planning and optimization is needed • Online data stores are not optimized for highly parallel writes, bulk loads • Big data can solve complex problems while delivering speed and accuracy CONCLUSION
  • 10. THANK YOU!

Related Documents