Heterogeneous Resource Scheduling Using
Apache Mesos for Cloud Native Frameworks
Sharma Podila
Senior Software Engineer
Ne...
Scale to Internet traffic
Scale to Internet traffic
… embrace failure
Context
● Operational intelligence, Edge Engineering
Context
● Operational intelligence, Edge Engineering
● Near real time detection of anomalies from
customers' streaming exp...
Context
● Operational intelligence, Edge Engineering
● Near real time detection of anomalies from
customers' streaming exp...
Context
● Operational intelligence, Edge Engineering
● Near real time detection of anomalies from
customers' streaming exp...
Efficient platform
Efficient platform
● do it cheap
Efficient platform
● do it cheap
● do it quick
Efficient platform
● do it cheap
● do it quick
● scale with Netflix traffic
Mantis: reactive stream processing
● Cloud native
Mantis: reactive stream processing
● Cloud native
● Lightweight jobs
Mantis: reactive stream processing
● Cloud native
● Lightweight jobs
● Dynamic jobs
Mantis: reactive stream processing
● Cloud native
● Lightweight jobs
● Dynamic jobs
● Custom SLAs
Bi-directional data sourcingSourceappinstances
0
1
2
999
0
1
2
3
4
5
Source connector
job
Subscribe with query
Data pushed...
Mantis system overview
MantisMantis
Apache Mesos
Apache Mesos
Mantis
Apache Mesos
ASG
ASG
ASG
Fenzo
Mesos
Framework
ZooKee...
Mantis scheduling model
JobJobJob
Source
Stage1
Stage2
Sink
Jobs may be perpetual or
on-demand/interactive
Slave instances
Apache Mesos
Apache Mesos Architecture
Mesos slave
FrmWrk2 executor
TaskTask
Mesos slave
FrmWrk2 executor
FrmWrk1 executor
TaskTask
Mes...
Why Apache Mesos in a Cloud?
Resource granularity
Instance 1
Instance 1
Task A
Instance 2
Task B
Instance 1
Task A
Instance 1
Task A
Task B
Task C Task...
Task start latency
vs.
Instance startup in mins. Mesos task startup in <1sec
Do we need yet another Mesos
framework?
A choice of Mesos frameworks
exists in the community
Likely easy to write a new framework
Likely easy to write a new framework
What about scale, performance, fault tolerance, and
availability?
Likely easy to write a new framework
What about scale, performance, fault tolerance, and
availability?
Scheduling is a har...
Long term justification needed to
create a new Mesos framework
Our motivations for new framework
● Cloud native
○ Large variation in peak to trough usage
○ Servers are more ephemeral
Our motivations for new framework
● Cloud native
○ Large variation in peak to trough usage
○ Servers are more ephemeral
● ...
Cluster autoscaling challenge
Host 1 Host 2 Host 3 Host 4
Cluster autoscaling challenge
Host 1 Host 2 Host 3 Host 4
Host 1 Host 2 Host 3 Host 4
vs.
Cluster autoscaling challenge
Host 1 Host 2 Host 3 Host 4
Host 1 Host 2 Host 3 Host 4
vs.
Stream locality challenge
Data
stream
Host A
Task1
Host B
Task2
Host C
Task3
Stream locality challenge
Data
stream
Host A
Task1
Host B
Task2
Host C
Task3
Data
stream
Host X
Task1
Task2
Task3
vs.
Redu...
Backpressure challenge
On back pressure, scale
horizontally
Processor task
Source
Stage1
Stage2
Stage3
Sink
Hot source Vs....
Fenzo task scheduler
Fenzo task scheduler
Generic task scheduler
Heterogeneous
Autoscale
Visibility
Plugins for
Constraints, Fitness
High speed
Fenzo usage in frameworks
Mesos master
Mesos framework
Tasks
requests
Available
resource
offers
Fenzo task
scheduler
Task ...
Scheduling problem
Fitness
Pending
Assigned
UrgencyN tasks to assign from M possible slaves
Scheduling optimizations
Speed Accuracy
First fit assignment Optimal assignment
Real world trade-offs
~ O (1) ~ O (N * M)1...
Scheduling strategy
For each task
Validate constraints
Eval fitness on each host
Until fitness good enough, and
A minimum ...
Task constraints
Soft
Hard
Task constraints
Soft
Hard
Some built-in
Host attribute based
Co-tasks based
Extensible
Fitness evaluation
Degree of fitness
Composable
Bin packing fitness calculator
Fitness for
Host1 Host2 Host3 Host4 Host5
fitness = usedCPUs / totalCPUs
Bin packing fitness calculator
Fitness for 0.25 0.5 0.75 1.0 0.0
Host1 Host2 Host3 Host4 Host5
fitness = usedCPUs / totalC...
Bin packing fitness calculator
Fitness for 0.25 0.5 0.75 1.0 0.0
✔
Host1 Host2 Host3 Host4 Host5
fitness = usedCPUs / tota...
Stream locality fitness
Stream locality fitness
= percentage of tasks connecting to the
same stream
Composable fitness calculators
Fitness
= ( BinPackFitness * BinPackWeight +
RuntimePackFitness * RuntimeWeight +
StreamLoc...
Cluster autoscaling in Fenzo
ASG/Cluster:
mantisagent
MinIdle: 8
MaxIdle: 20
CooldownSecs:
360
ASG/Cluster:
mantisagent
Mi...
Rules based cluster autoscaling
● Set up rules per host attribute value
○ E.g., one autoscale rule per ASG/cluster, one cl...
Shortfall based cluster autoscaling
● Rule-based scale up has a cool down period
○ What if there’s a surge of incoming req...
Combining predictive autoscaling
● EC2 Auto Scaling groups have 3 numbers to control
● Fenzo sets “desirable” value
● Set ...
Job autoscaling
Autoscale
resources of a job’
s stage based on
backpressure or
its usage of CPU,
memory, network
Job autoscaling
job’s
network
bandwidth
usage
number of
job
processes
Example from a
job that reads
from a queue to
proces...
To summarize...
Why Apache Mesos in a Cloud?
Resource granularity
Task start latency
Fenzo task scheduler
Generic task scheduler
Heterogeneous
Autoscale
Visibility
Plugins for
Constraints, Fitness
High speed
Mantis system overview
MantisMantis
Apache Mesos
Apache Mesos
Mantis
Apache Mesos
ASG
ASG
ASG
Fenzo
Mesos
Framework
ZooKee...
Heterogeneous Resource Scheduling Using
Apache Mesos for Cloud Native Frameworks
Sharma Podila
@podila
of 62

Resource Scheduling using Apache Mesos in Cloud Native Environments

Presentation at VMware, as part of their Tech Talk series.
Published on: Mar 4, 2016
Published in: Software      
Source: www.slideshare.net


Transcripts - Resource Scheduling using Apache Mesos in Cloud Native Environments

  • 1. Heterogeneous Resource Scheduling Using Apache Mesos for Cloud Native Frameworks Sharma Podila Senior Software Engineer Netflix June 2015
  • 2. Scale to Internet traffic
  • 3. Scale to Internet traffic … embrace failure
  • 4. Context ● Operational intelligence, Edge Engineering
  • 5. Context ● Operational intelligence, Edge Engineering ● Near real time detection of anomalies from customers' streaming experience
  • 6. Context ● Operational intelligence, Edge Engineering ● Near real time detection of anomalies from customers' streaming experience ● Stream processing platform to run ad hoc queries on live event streams
  • 7. Context ● Operational intelligence, Edge Engineering ● Near real time detection of anomalies from customers' streaming experience ● Stream processing platform to run ad hoc queries on live event streams ● Iterative detection of coarse grain and fine grain signals
  • 8. Efficient platform
  • 9. Efficient platform ● do it cheap
  • 10. Efficient platform ● do it cheap ● do it quick
  • 11. Efficient platform ● do it cheap ● do it quick ● scale with Netflix traffic
  • 12. Mantis: reactive stream processing ● Cloud native
  • 13. Mantis: reactive stream processing ● Cloud native ● Lightweight jobs
  • 14. Mantis: reactive stream processing ● Cloud native ● Lightweight jobs ● Dynamic jobs
  • 15. Mantis: reactive stream processing ● Cloud native ● Lightweight jobs ● Dynamic jobs ● Custom SLAs
  • 16. Bi-directional data sourcingSourceappinstances 0 1 2 999 0 1 2 3 4 5 Source connector job Subscribe with query Data pushed with subscription ID User Job 1 User Job 2 User Job 3 Subscription Discovery
  • 17. Mantis system overview MantisMantis Apache Mesos Apache Mesos Mantis Apache Mesos ASG ASG ASG Fenzo Mesos Framework ZooKeeper Instance Apache Mesos Slave Cluster
  • 18. Mantis scheduling model JobJobJob Source Stage1 Stage2 Sink Jobs may be perpetual or on-demand/interactive Slave instances
  • 19. Apache Mesos
  • 20. Apache Mesos Architecture Mesos slave FrmWrk2 executor TaskTask Mesos slave FrmWrk2 executor FrmWrk1 executor TaskTask Mesos master Standby master Standby master Mesos slave FrmWrk1 executor TaskTask FrmWrk1 FrmWrk2 ZooKeeper quorum
  • 21. Why Apache Mesos in a Cloud?
  • 22. Resource granularity Instance 1 Instance 1 Task A Instance 2 Task B Instance 1 Task A Instance 1 Task A Task B Task C Task D vs.
  • 23. Task start latency vs. Instance startup in mins. Mesos task startup in <1sec
  • 24. Do we need yet another Mesos framework?
  • 25. A choice of Mesos frameworks exists in the community
  • 26. Likely easy to write a new framework
  • 27. Likely easy to write a new framework What about scale, performance, fault tolerance, and availability?
  • 28. Likely easy to write a new framework What about scale, performance, fault tolerance, and availability? Scheduling is a hard problem to solve
  • 29. Long term justification needed to create a new Mesos framework
  • 30. Our motivations for new framework ● Cloud native ○ Large variation in peak to trough usage ○ Servers are more ephemeral
  • 31. Our motivations for new framework ● Cloud native ○ Large variation in peak to trough usage ○ Servers are more ephemeral ● Resource usage optimizations
  • 32. Cluster autoscaling challenge Host 1 Host 2 Host 3 Host 4
  • 33. Cluster autoscaling challenge Host 1 Host 2 Host 3 Host 4 Host 1 Host 2 Host 3 Host 4 vs.
  • 34. Cluster autoscaling challenge Host 1 Host 2 Host 3 Host 4 Host 1 Host 2 Host 3 Host 4 vs.
  • 35. Stream locality challenge Data stream Host A Task1 Host B Task2 Host C Task3
  • 36. Stream locality challenge Data stream Host A Task1 Host B Task2 Host C Task3 Data stream Host X Task1 Task2 Task3 vs. Reduces network bandwidth usage
  • 37. Backpressure challenge On back pressure, scale horizontally Processor task Source Stage1 Stage2 Stage3 Sink Hot source Vs. cold source
  • 38. Fenzo task scheduler
  • 39. Fenzo task scheduler Generic task scheduler Heterogeneous Autoscale Visibility Plugins for Constraints, Fitness High speed
  • 40. Fenzo usage in frameworks Mesos master Mesos framework Tasks requests Available resource offers Fenzo task scheduler Task assignment result • Host1 • Task1 • Task2 • Host2 • Task3 • Task4 Persistence
  • 41. Scheduling problem Fitness Pending Assigned UrgencyN tasks to assign from M possible slaves
  • 42. Scheduling optimizations Speed Accuracy First fit assignment Optimal assignment Real world trade-offs ~ O (1) ~ O (N * M)1 1 Assuming tasks are not reassigned
  • 43. Scheduling strategy For each task Validate constraints Eval fitness on each host Until fitness good enough, and A minimum #hosts evaluated
  • 44. Task constraints Soft Hard
  • 45. Task constraints Soft Hard Some built-in Host attribute based Co-tasks based Extensible
  • 46. Fitness evaluation Degree of fitness Composable
  • 47. Bin packing fitness calculator Fitness for Host1 Host2 Host3 Host4 Host5 fitness = usedCPUs / totalCPUs
  • 48. Bin packing fitness calculator Fitness for 0.25 0.5 0.75 1.0 0.0 Host1 Host2 Host3 Host4 Host5 fitness = usedCPUs / totalCPUs
  • 49. Bin packing fitness calculator Fitness for 0.25 0.5 0.75 1.0 0.0 ✔ Host1 Host2 Host3 Host4 Host5 fitness = usedCPUs / totalCPUs
  • 50. Stream locality fitness Stream locality fitness = percentage of tasks connecting to the same stream
  • 51. Composable fitness calculators Fitness = ( BinPackFitness * BinPackWeight + RuntimePackFitness * RuntimeWeight + StreamLocalityFitness * StreamWeight ) / 3.0
  • 52. Cluster autoscaling in Fenzo ASG/Cluster: mantisagent MinIdle: 8 MaxIdle: 20 CooldownSecs: 360 ASG/Cluster: mantisagent MinIdle: 8 MaxIdle: 20 CooldownSecs: 360 ASG/cluster: computeCluster MinIdle: 8 MaxIdle: 20 CooldownSecs: 360 Fenzo ScaleUp action: Cluster, N ScaleDown action: Cluster, HostList
  • 53. Rules based cluster autoscaling ● Set up rules per host attribute value ○ E.g., one autoscale rule per ASG/cluster, one cluster for network- intensive jobs, another for CPU/memory-intensive jobs ● Sample: ○ ClusterName, MinIdle, MaxIdle, CoolDownSecs ○ NwkClstr,10,20,360; CmputeClstr,5,20,300 #Idle hosts Trigger down scale Trigger up scale min max
  • 54. Shortfall based cluster autoscaling ● Rule-based scale up has a cool down period ○ What if there’s a surge of incoming requests? ● Pending requests trigger shortfall analysis ○ Scale up happens regardless of cool down period ○ Remembers which tasks have already been covered
  • 55. Combining predictive autoscaling ● EC2 Auto Scaling groups have 3 numbers to control ● Fenzo sets “desirable” value ● Set max value to limit scale up ● Scryer can set min value 1 http://techblog.netflix.com/2013/11/scryer-netflixs-predictive-auto-scaling.html
  • 56. Job autoscaling Autoscale resources of a job’ s stage based on backpressure or its usage of CPU, memory, network
  • 57. Job autoscaling job’s network bandwidth usage number of job processes Example from a job that reads from a queue to process data across multiple processes
  • 58. To summarize...
  • 59. Why Apache Mesos in a Cloud? Resource granularity Task start latency
  • 60. Fenzo task scheduler Generic task scheduler Heterogeneous Autoscale Visibility Plugins for Constraints, Fitness High speed
  • 61. Mantis system overview MantisMantis Apache Mesos Apache Mesos Mantis Apache Mesos ASG ASG ASG Fenzo Mesos Framework ZooKeeper Instance Apache Mesos Slave Cluster
  • 62. Heterogeneous Resource Scheduling Using Apache Mesos for Cloud Native Frameworks Sharma Podila @podila

Related Documents