The Big Data SaaS Company
Big Data as a Service
Joydeep Sen Sarma
|
The Big Data SaaS Company
Who’s Qubole
• Founded 10/2011:
– Ashish Thusoo & Joydeep Sen Sarma, Apache Hive, Facebook
– +Alumni - Oracle, GreenPlum, ...
Thesis
Managed
Big Data as a Service
in the
Cloud
• SaaS will displace shipped software
• Cloud will displace bare-metal
...
Big Data Puzzle
GUI(Hue)
Interfaces
(ODBC/JDBC)
Operations
Dashboard
Data Connectors
(MongoAdaptor..)
Schedular(Oozie)
...
Meet “Qubole”
Operations
Dashboard
Cloud Orchestration
(Whirr) or
Compute + Storage
GUI(Hue)
Hadoop
Interfaces
(ODBC/J...
Customers
|
The Big Data SaaS Company
Use Cases
• Summarizing Logs and Reporting
• Data Integration
• Ad-Hoc analysis of Historical Data
• Preparing Data for Da...
Qubole Data Service
Integrate – Analyze – Schedule – Visualize
Vertica
Oozie
Oozie
Hive
Hive
Sqoop
Sqoop
Mysql
Presto!...
Now on GCE!
|
The Big Data SaaS Company
What Users Like
• Simplicity
– Great Visual User Interface
– Zero Operations
– Accessible to Analysts (ie. non-Engineers)
...
What Users Like
• Managed Service Model
– Constantly Upgrading software
– Support when needed
– Dealing with AWS issues
•...
Core Technology
• Auto-Scaling Hadoop Clusters in Cloud
– Including OpenStack, Rackspace, GCE etc
• Fastest Hive SaaS
– N...
Auto-Scaling
hadoop jar –Dmapred.min.split.size=32000000
myapp.jar –partitioner .org.apache…
select t.county, count(1) fro...
Scaling Up
Slaves
insert overwrite table dest
select … from ads join campaigns
on …group by …;
Progress
Map Tasks
Job ...
Scaling Down
1. On hour boundary – check if node is required:
– Can’t remove nodes with map-outputs (today)
– Don’t go bel...
Fastest Hive SaaS
• Works with Small Files!
– Faster Split Computation (8x)
– Prefetching S3 files (30%)
• Direct writes ...
Spot Instance Integration
Upto 90% off
|
The Big Data SaaS Company
Spot Instance Integration
• Can lose Spot nodes anytime
– Disastrous for HDFS
– Hybrid Mode: Use mix of On-Demand and Spot...
Closing Thoughts
• AWS (/Cloud) is the new BIOS
• Large multi-tenant [I/S]aaS is the new mainframe
– Feedback loop is not ...
Questions?
Me:joydeep@qubole.com
Us: @Qubole
Free Trial: www.qubole.com
|
The Big Data SaaS Company
of 20

NATC 2013 - Big Data as a Service

NASSCOM Annual Technology Conference 2013 Speaker: Joydeep Sen Sarma, Co-Founder, Quobole
Published on: Mar 3, 2016
Published in: Technology      Business      
Source: www.slideshare.net


Transcripts - NATC 2013 - Big Data as a Service

  • 1. The Big Data SaaS Company Big Data as a Service Joydeep Sen Sarma | The Big Data SaaS Company
  • 2. Who’s Qubole • Founded 10/2011: – Ashish Thusoo & Joydeep Sen Sarma, Apache Hive, Facebook – +Alumni - Oracle, GreenPlum, Vertica, Aster, Karmasphere, TerraCotta, Microsoft • Rapidly growing: – Engineering: Palo Alto (5), Bangalore (16) – Business: Palo Alto (4) • Series-A from LightSpeed and Charles River | The Big Data SaaS Company
  • 3. Thesis Managed Big Data as a Service in the Cloud • SaaS will displace shipped software • Cloud will displace bare-metal • Big Data already displacing Rdbms | The Big Data SaaS Company
  • 4. Big Data Puzzle GUI(Hue) Interfaces (ODBC/JDBC) Operations Dashboard Data Connectors (MongoAdaptor..) Schedular(Oozie) Cloud Orchestration (Whirr) or Compute + Storage | The Big Data SaaS Company Hadoop Hive/PIG Mahout/Weka
  • 5. Meet “Qubole” Operations Dashboard Cloud Orchestration (Whirr) or Compute + Storage GUI(Hue) Hadoop Interfaces (ODBC/JDBC) Data Connectors (MongoAdaptor..) Schedular(Oozie) Hive/PIG Mahout/Weka • Fully Integrated Big Data Service • Users Focus on Analyzing and building Data Driven apps • Qubole manages infrastructure, cloud provisioning | The Big Data SaaS Company
  • 6. Customers | The Big Data SaaS Company
  • 7. Use Cases • Summarizing Logs and Reporting • Data Integration • Ad-Hoc analysis of Historical Data • Preparing Data for Data Mining • Indexing Data for Search • Users – Developers (of end-products) – Java/C++/Python – ETL and Data Engineers – SQL/Java/Python – Analysts – SQL / R | The Big Data SaaS Company
  • 8. Qubole Data Service Integrate – Analyze – Schedule – Visualize Vertica Oozie Oozie Hive Hive Sqoop Sqoop Mysql Presto! Presto! AWS EC2 Pig Pig Hadoop Hadoop | AWS S3The Big Data SaaS Company 8 S3://adco/logs
  • 9. Now on GCE! | The Big Data SaaS Company
  • 10. What Users Like • Simplicity – Great Visual User Interface – Zero Operations – Accessible to Analysts (ie. non-Engineers) • Efficiency – Significantly faster than competition (in most cases) – Cluster Consolidation is game changer – Spot Instance integration | The Big Data SaaS Company
  • 11. What Users Like • Managed Service Model – Constantly Upgrading software – Support when needed – Dealing with AWS issues • Nine-Course Meal – Seamless integration of Hadoop/Hive/Pig/.. – Unified Command/Workflow model (also Simplicity) – Less things to learn/manage: • “Please help us avoid Pentaho, Tableau, …” | The Big Data SaaS Company
  • 12. Core Technology • Auto-Scaling Hadoop Clusters in Cloud – Including OpenStack, Rackspace, GCE etc • Fastest Hive SaaS – Numerous Optimizations for Cloud Storage – 5x faster than EMR • Connectors – RDBMS, MongoDB/NoSql, GA – Incremental Data Scrapes • Job Scheduler – Dependencies, Workflows, Incremental Jobs | The Big Data SaaS Company
  • 13. Auto-Scaling hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache… select t.county, count(1) from (select transform(a.zip) using ‘geo.py’ as a.county from SMALL_TABLE a) t group by t.county; AdCo Hadoop | The Big Data SaaS Company insert overwrite table dest select a.id, a.zip, count(distinct b.uid) from ads a join LARGE_TABLE b on (a.id=b.ad_id) group by a.id, a.zip; 13
  • 14. Scaling Up Slaves insert overwrite table dest select … from ads join campaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Supply Demand Master AWS | The Big Data SaaS Company StarCluster 14
  • 15. Scaling Down 1. On hour boundary – check if node is required: – Can’t remove nodes with map-outputs (today) – Don’t go below minimum cluster size 1. Remove node from Map-Reduce Cluster 2. Request HDFS Decomissioning – fast! – Delete affected cache files instead of re-replicating – One surviving replica and we are Done. 1. Delete Instance | The Big Data SaaS Company
  • 16. Fastest Hive SaaS • Works with Small Files! – Faster Split Computation (8x) – Prefetching S3 files (30%) • Direct writes to S3 – HIVE-1620 • Multi-Tenant Hive Server • Stable JVM Reuse! – Fix re-entrancy issues – 1.2-2x speedup • Columnar Cache – Use HDFS as cache for S3 – Upto 5x faster for JSON data – HIVE-4226 • 5x faster than EMR in TPCH against S3 | The Big Data SaaS Company
  • 17. Spot Instance Integration Upto 90% off | The Big Data SaaS Company
  • 18. Spot Instance Integration • Can lose Spot nodes anytime – Disastrous for HDFS – Hybrid Mode: Use mix of On-Demand and Spot – Hybrid Mode: Keep one replica in On-Demand nodes • Spot Instances may not be available – Timeout and use On-Demand nodes as fallback | The Big Data SaaS Company
  • 19. Closing Thoughts • AWS (/Cloud) is the new BIOS • Large multi-tenant [I/S]aaS is the new mainframe – Feedback loop is not available to average developers – Will be dominated by a few large companies • Open Source is the ocean that lifts SaaS Boat – But Boat has proprietary stuff – SaaS requires software innovation at different pace • SaaS has network effects – Static software cannot keep up with rapidly evolving SaaS | The Big Data SaaS Company
  • 20. Questions? Me:joydeep@qubole.com Us: @Qubole Free Trial: www.qubole.com | The Big Data SaaS Company

Related Documents