Netflix running Presto in the AWS Cloud
Zhenxiao Luo
Senior Software Engineer @ Netflix
Outline
● BigDataPlatform@Netflix
● Use cases & requirements
● What we did
○ Reading/Writing from/to Amazon S3
○ Operation...
BigDataPlatform @ Netflix
Use Cases
● Big Batch Jobs
○ high throughput, fault tolerant, ETL
○ data spills to disk
○ Hive on Tez, Pig on Tez
● Adhoc ...
Netflix Requirement
● SQL like Language
● Low latency for adhoc queries
● Work well on AWS cloud
● Good integration with H...
What did Netflix do?
Reading/Writing to/from S3
● Option 1: Apache Hadoop NativeS3FileSysyem
● Option 2: PrestoS3FileSystem
○ retry logic for r...
Bug Fixes
● https://github.
com/facebook/presto/commit/cf0b2d66f4050fb1959c832809fa76e323d6d4
6e
● https://github.
com/fac...
Our Operations Environment
● Launch script on top of EMR
● Ganglia integration
● Usage graphs - concurrent queries & tasks
Current Deployment
● Presto in Production @ Netflix
● 100+ nodes Presto Cluster
● 1000+ queries running per day
● Presto q...
Observed Performance @ Netflix
● Data in Sequence File Format
● One MapReduce Job SmallTableScan
○ MapReduce overhead domi...
What we are working on
● Support Parquet File Format
○ https://github.com/facebook/presto/pull/1147
○ Parquet performs sim...
Some inconveniences ...
● Support Server Side “Use Schema”
○ Workaround: Client Side “Use Schema” Or “Schema.Table”
● Recu...
Features we would like
● Big table join
● User Defined Functions
● Break down one column value into several tuples
○ In Hi...
Q & A
Thank you!
of 15

Netflix running Presto in the AWS Cloud

Published on: Mar 4, 2016
Published in: Technology      
Source: www.slideshare.net


Transcripts - Netflix running Presto in the AWS Cloud

  • 1. Netflix running Presto in the AWS Cloud Zhenxiao Luo Senior Software Engineer @ Netflix
  • 2. Outline ● BigDataPlatform@Netflix ● Use cases & requirements ● What we did ○ Reading/Writing from/to Amazon S3 ○ Operations ○ Deployment ○ Performance ● What’s next?
  • 3. BigDataPlatform @ Netflix
  • 4. Use Cases ● Big Batch Jobs ○ high throughput, fault tolerant, ETL ○ data spills to disk ○ Hive on Tez, Pig on Tez ● Adhoc Queries ○ low latency, interactive, data exploration ○ in-memory, but limited data size ○ Impala, Redshift, Spark, Presto
  • 5. Netflix Requirement ● SQL like Language ● Low latency for adhoc queries ● Work well on AWS cloud ● Good integration with Hadoop stack ● Scale to 1000+ node cluster ● Open source with community support
  • 6. What did Netflix do?
  • 7. Reading/Writing to/from S3 ● Option 1: Apache Hadoop NativeS3FileSysyem ● Option 2: PrestoS3FileSystem ○ retry logic for read timeout ○ write directly to final S3 path ● Option 3: emrFileSystem ○ disable hadoop logging ○ disable hadoop FileSystem cache
  • 8. Bug Fixes ● https://github. com/facebook/presto/commit/cf0b2d66f4050fb1959c832809fa76e323d6d4 6e ● https://github. com/facebook/presto/commit/594b06c3e93a482dc162d2c49c9bd265795ef b86 ● https://github.com/facebook/presto/pull/1147 ● https://github.com/facebook/presto/pull/1300 ● https://github.com/facebook/presto/issues/1285 ● https://github.com/facebook/presto/issues/1264
  • 9. Our Operations Environment ● Launch script on top of EMR ● Ganglia integration ● Usage graphs - concurrent queries & tasks
  • 10. Current Deployment ● Presto in Production @ Netflix ● 100+ nodes Presto Cluster ● 1000+ queries running per day ● Presto query against the same Petabyte Scale S3 Data Warehouse as Hive and Pig
  • 11. Observed Performance @ Netflix ● Data in Sequence File Format ● One MapReduce Job SmallTableScan ○ MapReduce overhead dominates the query execution time ○ Presto is always ~10X faster than Hive ● One MapReduce Job BigTableScan ○ MapReduce overhead is marginal compared with big table scan time ○ Presto performs similar to Hive ● Multiple MapReduce Aggregation ○ Presto is always > 10X faster than Hive ● Joins ○ Presto is always > 2X faster than Hive
  • 12. What we are working on ● Support Parquet File Format ○ https://github.com/facebook/presto/pull/1147 ○ Parquet performs similar to Sequence, but not as fast as RCFile ● ODBC/JDBC driver for Presto ○ Support Microstrategy running on Presto
  • 13. Some inconveniences ... ● Support Server Side “Use Schema” ○ Workaround: Client Side “Use Schema” Or “Schema.Table” ● Recurse the partition directory ○ Different behavior with Hive ● Metadata caching ○ have to rerun the query a number of times to see the metadata change ● Extend JSON extract functions to allow . notation ○ json_extract_scalar(mapColumn, '$.namePart1.namePart2') ○ Workaround: regexp_extract ● WebUI running slow ○ load query task info on demand
  • 14. Features we would like ● Big table join ● User Defined Functions ● Break down one column value into several tuples ○ In Hive: lateral view explode json_tuple ● Decimal type ● Scheduler ● Writes ○ Insert overwrite ○ Alter table add partition ○ Parallel writes from workers (not client only)
  • 15. Q & A Thank you!

Related Documents