Avoiding
Down+me
Using
Linux
High
Availability
Jeremy Rust
Jeremy@linbit.com
@linbit
@nerdhacker
Introduction & Agenda
• Downtime is not cheap
• What is High Availability = not a back up!
• Raid or Raid over the netw...
DRBD
HA
and
DR
Down+me
=
$$$
• Lost
revenue
• Lost
reputa+on
• Almost
every
business
these
days
has
a
cri+cal
database
or...
Down+me
=
$$$
“YOU
LOST
THE
DATABASE?!?!”
• “Ummm,
can
you
ping
____?”
• “I
can’t
seem
to
reach
our
inve...
Devo+on
to
Duty
-­‐
xkcd
Why
Monitor?
• Hardware
dies
• DDOS
afacks
• Set
it
and
forget
it
mentality
• Internet
connec+on
• Security ...
Hos+ng
/
XaaS
• Reliability
• Security
• Mul+-­‐tenant
architecture
• Scalability
• Up+me
The
Pillars
of
IT
Security
Integrity
Types
of
Clustering
Solu+ons
• Hardware
redundancy
• SAN
solu+ons
• NAS
boxes
• External
hard
drives
or
JBOD...
Recovery
Time/Point
Objec+ves
What
is
Raid?
Is
it
enough?
RAID
Microsok
Library
hfp://msdn.microsok.com/en-­‐us/library/aa226166(v=sql.70).aspx#sql7perkune_moreinfo
What
Could
Go
Wrong
• Your
shiny
new
hardware
will
fail
• Single
points
of
failures
are
dangerous
• Droppe...
• Easy
SAN/NAS
to
implement
-­‐
high
cost
per
TB
• Large
SLAs
-­‐
quality
of
technicians
• Management
via ...
Single
Point
of
Failure
SAN
NFS
MySQL
VM’s
Piualls
• High
ini+al
and
ongoing
costs
• Vender
lock
in
is
required
• Ongoing
worry
of
voiding
the
warran...
Sokware
Only
Solu+ons
Things
to
look
for:
• Synchronous
or
Asynchronous
replica+on
• Stability
/
maturity
• ...
Asynchronous
Architecture
Synchronous
Architecture
Secondary
Layer
Cake
of
Replica+on
• Virtualiza+on
• Applica+on
• File
system
• Object
store
• Block
layer
hfp://images....
Cluster
Cake
Fail
Common
Issues
/
Piualls
• File
locking
• Network
conges+on
• Data
consistency
/
data
corrup+on
• High
overhe...
DRBD
• Completely
hardware
and
applica+on
agnos+c
• German
engineering
• In
development
since
2001
• Created
...
DRBD
Users
A
DRBD
Cluster
Stack
LAN
Server
High Speed NIC
Replication Network
High Speed RAID-Controller
RAID
shared nothin...
Fully
Redundant
System
Ac+ve
Passive
Storage
1
Storage
2
MySQL.com
Fully
Redundant
System
Storage
1
Storage
2
Ac+ve
Passive
Passive
Ac+ve
MySQL.com
Ausweb.com
Heartbeat/Corosync:
The
Comm
Layer
• These
are
the
communica+on
tools
of
the
cluster
• “Are
you
dead?”
• “A...
Pacemaker
The
Linux
Cluster
Resource
Manager
• The
powerful
and
bossy
cluster
manager
• Manages
all
aspects ...
Pacemaker : Sleep All Night
• It
lets
you
sleep
though
the
night
even
if
there’s
a
failure.
• Highly
Configu...
Linux
HA
Stack
Disaster
Recovery
/
Offsite
Replica+on
• True
Disaster
Recovery
happens
live
• Interval
based
snapshots
no
l...
Real-­‐+me
Disaster
Recovery
Scaling
DRBD
• DRBD
Proxy
is
typically
done
in
3
node
configura+ons.
• Extremely
configurable
• Proxy
mi+gat...
3
node
HA
/
DR
Proxy Proxy Proxy
Location A
Live Site
Location B
DR Site
4
node
DR
+
Ac+ve-­‐Ac+ve
HA
Proxy Proxy Proxy Proxy
Location A
Live Site
Location B
DR Site
Dedicated
Proxy-­‐Many
Resources
Proxy Proxy
Location A
Live Site
Location B
DR Site
Dedicated Server Dedicated Se...
How
to
apply
this
in
your
cloud
DRBD
works
in
the
cloud
and
AWS
VPC
On
na+ve
bare
hardware
or
as
part...
HA
with
Nagios!
• Filesystem
(which
has
many
symlinks
in
it)
• MySQL
• PostgreSQL
• Crond
• Ndo2db
• The
Na...
Q+A
Jeremy Rust
Jeremy@linbit.com
@NerdHacker
877-DRBD247
www.linkedin.com/in/RustJeremy
DRBD.org
Linbit.com
Linux...
DRBD
9
the
future
DRBD
8
Branch
build
structure
DRBD
9
Branch
build
structure
2
Full
redundant
systems
Storage
1
Storage
2
Ac+ve
Passive
Passive
Ac+ve
MySQL.com
Ausweb.com
of 45

Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

Jeremy Rust's presentation on Avoiding Downtime Using Linux High Availability. The presentation was given during the Nagios World Conference North America held Oct 13th - Oct 16th, 2014 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/conference
Published on: Mar 3, 2016
Published in: Technology      
Source: www.slideshare.net


Transcripts - Nagios Conference 2014 - Jeremy Rust - Avoiding Downtime Using Linux High Availability

  • 1. Avoiding Down+me Using Linux High Availability Jeremy Rust Jeremy@linbit.com @linbit @nerdhacker
  • 2. Introduction & Agenda • Downtime is not cheap • What is High Availability = not a back up! • Raid or Raid over the network (DRBD) • SANs and clustered applications • The Linux cluster stack • Cluster management with Pacemaker • Disaster Recovery / Linking sites • DRBD and the Cloud
  • 3. DRBD HA and DR
  • 4. Down+me = $$$ • Lost revenue • Lost reputa+on • Almost every business these days has a cri+cal database or file system that they could not do without. • HP es+mates $31,705 per hour 3.8 hours a year totaling $481,900/ year • 40% internet traffic stops when Google goes down Survey: Cri+cal “IT-­‐Systems in the medium-­‐sized business“, 2013 Techconsult, on behalf of HP Germany Basis: 300 medium-­‐sized companies from Germany
  • 5. Down+me = $$$ “YOU LOST THE DATABASE?!?!” • “Ummm, can you ping ____?” • “I can’t seem to reach our inventory system.” • “Can you try pulling up this record?”
  • 6. Devo+on to Duty -­‐ xkcd
  • 7. Why Monitor? • Hardware dies • DDOS afacks • Set it and forget it mentality • Internet connec+on • Security programs
  • 8. Hos+ng / XaaS • Reliability • Security • Mul+-­‐tenant architecture • Scalability • Up+me
  • 9. The Pillars of IT Security Integrity
  • 10. Types of Clustering Solu+ons • Hardware redundancy • SAN solu+ons • NAS boxes • External hard drives or JBODS • So#ware Solu+ons
  • 11. Recovery Time/Point Objec+ves
  • 12. What is Raid? Is it enough?
  • 13. RAID Microsok Library hfp://msdn.microsok.com/en-­‐us/library/aa226166(v=sql.70).aspx#sql7perkune_moreinfo
  • 14. What Could Go Wrong • Your shiny new hardware will fail • Single points of failures are dangerous • Dropped alerts • Internet outage • Power outage
  • 15. • Easy SAN/NAS to implement -­‐ high cost per TB • Large SLAs -­‐ quality of technicians • Management via GUI • Scalable -­‐ with the right packages • SAN maintenance -­‐ learning curve • Off site replica+on is expensive
  • 16. Single Point of Failure SAN NFS MySQL VM’s
  • 17. Piualls • High ini+al and ongoing costs • Vender lock in is required • Ongoing worry of voiding the warrantee • Maintenance is tricky and ongoing • It is a black box, typically Solaris based • Cannot add or remove features • It is s+ll a single point of failure
  • 18. Sokware Only Solu+ons Things to look for: • Synchronous or Asynchronous replica+on • Stability / maturity • Time to recovery • Chance of data loss • Onsite / offsite • Is it real +me (live) or snap shots
  • 19. Asynchronous Architecture
  • 20. Synchronous Architecture Secondary
  • 21. Layer Cake of Replica+on • Virtualiza+on • Applica+on • File system • Object store • Block layer hfp://images.pinkcakebox.com/cake696.jpg
  • 22. Cluster Cake Fail
  • 23. Common Issues / Piualls • File locking • Network conges+on • Data consistency / data corrup+on • High overhead and/or addi+onal CPU cycles • Asynchronous or even back up based • Require ongoing licensing and royal+es
  • 24. DRBD • Completely hardware and applica+on agnos+c • German engineering • In development since 2001 • Created by LINBIT founder and CEO Phillip Reisner • DRBD built into the na+ve Linux kernel as of 2.6.33 • Ships in all major Linux distribu+ons • Does not void RHEL or Oracle support
  • 25. DRBD Users
  • 26. A DRBD Cluster Stack LAN Server High Speed NIC Replication Network High Speed RAID-Controller RAID shared nothing Storage
  • 27. Fully Redundant System Ac+ve Passive Storage 1 Storage 2 MySQL.com
  • 28. Fully Redundant System Storage 1 Storage 2 Ac+ve Passive Passive Ac+ve MySQL.com Ausweb.com
  • 29. Heartbeat/Corosync: The Comm Layer • These are the communica+on tools of the cluster • “Are you dead?” • “Are you alive?” • Heartbeat is seasoned and stable (reliability = HA) • Corosync is newer and under development
  • 30. Pacemaker The Linux Cluster Resource Manager • The powerful and bossy cluster manager • Manages all aspects of system • Decides who is alive and primary • Well known • Widely deployed • Does not require applica+ons have specific plugins
  • 31. Pacemaker : Sleep All Night • It lets you sleep though the night even if there’s a failure. • Highly Configurable • Used with a number of clustering tools / File Systems • Very powerful if done well Disastrous if done wrong
  • 32. Linux HA Stack
  • 33. Disaster Recovery / Offsite Replica+on • True Disaster Recovery happens live • Interval based snapshots no longer meet todays SLA requirements • DRBD does real-­‐+me replica+on on-­‐site and off-­‐site • DRBD Proxy tool mi+gates throughput constraints and latency-­‐ highly configurable
  • 34. Real-­‐+me Disaster Recovery
  • 35. Scaling DRBD • DRBD Proxy is typically done in 3 node configura+ons. • Extremely configurable • Proxy mi+gates bandwidth constraints and latency • Can replicate across 4 machines even across distances
  • 36. 3 node HA / DR Proxy Proxy Proxy Location A Live Site Location B DR Site
  • 37. 4 node DR + Ac+ve-­‐Ac+ve HA Proxy Proxy Proxy Proxy Location A Live Site Location B DR Site
  • 38. Dedicated Proxy-­‐Many Resources Proxy Proxy Location A Live Site Location B DR Site Dedicated Server Dedicated Server
  • 39. How to apply this in your cloud DRBD works in the cloud and AWS VPC On na+ve bare hardware or as part of your hardware or sokware appliance DRBD can be used as backing storage for ISCSI hfp://www.gamesparks.com/wp-­‐content/uploads/2013/07/the-­‐cloud.jpg
  • 40. HA with Nagios! • Filesystem (which has many symlinks in it) • MySQL • PostgreSQL • Crond • Ndo2db • The Nagios applica+on itself • A Virtual IP
  • 41. Q+A Jeremy Rust Jeremy@linbit.com @NerdHacker 877-DRBD247 www.linkedin.com/in/RustJeremy DRBD.org Linbit.com Linux-HA.org
  • 42. DRBD 9 the future
  • 43. DRBD 8 Branch build structure
  • 44. DRBD 9 Branch build structure
  • 45. 2 Full redundant systems Storage 1 Storage 2 Ac+ve Passive Passive Ac+ve MySQL.com Ausweb.com

Related Documents