Presto
Past, Present and Future
Martin Traverso
June 5, 2014
Why build Presto?
“A good day is when I
can run 6 Hive queries”
— a Facebook data scientist
What is Presto?
Distributed SQL analytics engine
Optimized for low-latency, interactive analysis
ANSI SQL
Extensible
Architecture
Architecture
Scheduler
Data
Location API
Parser/
Analyzer
Planner
Metadata
API
Coordinator
Client
Worker
Worker
Worker
Dat...
Connectors
Coordinator Worker
Parser/
Analyzer
Planner Scheduler
Cassandra
Internal
MySQL
JMX
Hive
Metadata API
Cassandra
...
Connectors
Hadoop 1.x
Hadoop 2.x
CDH 4
CDH 5
Custom S3 integration for Hadoop
Cassandra
TPC-H
Other extension points
Types
Functions
Operators
What makes Presto fast?
Data in memory during execution
Pipelining and streaming
Very careful coding of inner loops
Effici...
What’s next?
More SQL features
Structs, Maps and Lists
Views
Scalar sub queries
Features required to run all TPC-DS
Execution engine
Huge joins and aggregations
•Hash distributed
•Co-distributed and co-partitioned
•Spill to disk (flash)
Wo...
ODBC driver
Targeting major BI tools
•Tableau, MicroStrategy and Excel
Support for Windows, Mac and Linux
Entirely open so...
Native store
Stores data directly on worker nodes
Custom data format
Initial use cases
•‘Hot’ data
•‘Live’ data
Open source
Apache License 2.0
Open development
Releases every 1-2 weeks
!
External contributions welcome!
Presto
http://prestodb.io
github.com/facebook/presto
!
Martin Traverso
@mtraverso
github.com/martint
Bytecode generation
while (in.advanceNextPosition()) {!
if (in.getLong(3) >= 100 && !
in.getLong(3) <= 200 &&!
in.getLong(...
of 18

Presto @ Facebook: Past, Present and Future

Published on: Mar 4, 2016
Published in: Technology      Education      
Source: www.slideshare.net


Transcripts - Presto @ Facebook: Past, Present and Future

  • 1. Presto Past, Present and Future Martin Traverso June 5, 2014
  • 2. Why build Presto?
  • 3. “A good day is when I can run 6 Hive queries” — a Facebook data scientist
  • 4. What is Presto? Distributed SQL analytics engine Optimized for low-latency, interactive analysis ANSI SQL Extensible
  • 5. Architecture
  • 6. Architecture Scheduler Data Location API Parser/ Analyzer Planner Metadata API Coordinator Client Worker Worker Worker Data Stream API Data Stream API
  • 7. Connectors Coordinator Worker Parser/ Analyzer Planner Scheduler Cassandra Internal MySQL JMX Hive Metadata API Cassandra Internal MySQL JMX Hive Data Location API Cassandra Internal MySQL JMX Hive Data Stream API
  • 8. Connectors Hadoop 1.x Hadoop 2.x CDH 4 CDH 5 Custom S3 integration for Hadoop Cassandra TPC-H
  • 9. Other extension points Types Functions Operators
  • 10. What makes Presto fast? Data in memory during execution Pipelining and streaming Very careful coding of inner loops Efficient flat-memory data structures Bytecode generation
  • 11. What’s next?
  • 12. More SQL features Structs, Maps and Lists Views Scalar sub queries Features required to run all TPC-DS
  • 13. Execution engine Huge joins and aggregations •Hash distributed •Co-distributed and co-partitioned •Spill to disk (flash) Work stealing Basic task recovery
  • 14. ODBC driver Targeting major BI tools •Tableau, MicroStrategy and Excel Support for Windows, Mac and Linux Entirely open source (ASL2)
  • 15. Native store Stores data directly on worker nodes Custom data format Initial use cases •‘Hot’ data •‘Live’ data
  • 16. Open source Apache License 2.0 Open development Releases every 1-2 weeks ! External contributions welcome!
  • 17. Presto http://prestodb.io github.com/facebook/presto ! Martin Traverso @mtraverso github.com/martint
  • 18. Bytecode generation while (in.advanceNextPosition()) {! if (in.getLong(3) >= 100 && ! in.getLong(3) <= 200 &&! in.getLong(4) < in.getLong(5)) {! ! out.advance();! in.appendStringTo(0, out);! out.appendLong(in.getLong(1) * in.getLong(2) / 10);! }! } SELECT! k AS c1,! (a * b) / 10 AS c2! FROM T! WHERE! c BETWEEN 100 AND 200! AND d < e! T: ! k varchar, ! a bigint, ! b bigint, ! c bigint, ! d bigint, ! e bigint