Failover and High Availability
Solutions for Nagios XI
Andy Brist
abrist@nagios.com
Introduction
• Who am I?
• Nagios Support Team Manager
• Team Lead for Nagios-Plugins
(github.com/nagios-plugins
Disclaimer
• Every environment is different
• Failover/HA by nature, is a customized
solution
• My case studies are no...
Agenda
● Short overview of the different failback/failover solutions
● Nagios XI Data Locations and other files/services...
Backup (snapback)
● Restore VM snapshot or spin up a new instance and restore a
backup
● Most common implementation
● ...
Automated XI Backups
● XI provides a method for scheduled backups through the
"Scheduled Backups Component"
– ssh
– ft...
Failback
Failback
● Secondary is periodically updated from an XI backup.
● The nagios process is started by hand when the master ...
Additional Considerations
● Easy to implement with the “Scheduled Backups” XI
component.
● Agents must maintain 2+ allo...
Failover
● Difficult to get right
● Demanding on i/o resources and network speed
● Very little to no loss of historical...
Failover
Nagios XI
● Object Configuration
● Check Status
● Object State
● Program State
● Historical State Data
● Performance...
Nagios XI - Services
nagios – Monitoring engine
mysql – Object configuration and ndo historical data
ndo2db – Writes hi...
XI Data and Redundancy
Absolute minimum redundant data required for any failover
scenario:
● (Working) Object configura...
Full Check Redundancy
Additional requirements for full check redundancy:
● mrtg config and RRDs (for bandwidth checks)
...
Runtime State Redundancy
Additional requirements for runtime state redundancy:
● retention.dat (state, runtime options, ...
Historical Redundancy
Additional Data required for complete historical redundancy:
● nagios.log and archives directory
...
XI Data Summary
Logs/archives
Perfdata
Mrtg/configs
Databases
Object configs
Plugins
XI Data Summary
/usr/local/nagios/var/nagios.log
/usr/local/nagios/var/archives/
/usr/local/nagios/share/perfdata/
/va...
High Availability?
1. Elimination of single points of failure.
2. Reliable crossover/failover.
3. Detection of failures...
High Availability?
Why would you need it?
● Least amount of downtime
● (limited) Service clustering
● Shared volumes s...
High Availability/Failover
Major components:
● Shared storage
● Virtual IP
● Management applications/scripts
Shared Storage
● DRBD – block level replication, part of the linux kernel, well supported and
understood. Works well for...
DRBD
● Active/passive suggested
● Low latency storage
● Active mount should move with the vip
● Refer to Jeremy Rust's...
Virtual IP
● pacemaker vip script
● Custom ifconfig/ip shell scripts
● uCarp Scripts
● keepalived
HA Failover Management
● Pacemaker/Heartbeat (the HA stack)
● uCarp scripts
● keepalived scripts
Custom Scripts:
● na...
Extra Considerations
● STONITH
● Clustering?
● DRBD/Shared Storage
● High Latency HA
● NDO/Databases
● Recovery
STONITH
(shoot the other node in the head)
● Mechanism by which a failing
server is guaranteed to be
removed from the ...
Deathmatch!
No, really. Stonith gives your servers the ability to
KILL THEMSELVES and FRIENDS
● Beware of services whos...
Clustering/Fencing
● A number of portions of Nagios Core and Nagios XI are clusterable.
Processes that can potentially b...
DUAL DRBD Primary
● Disconnecting from the master before mounting of the shared
volume during failover is no longer need...
High Latency HA
● Problematic if the HA solution was not designed for potential
high latency
● Will potentially cause i...
NDO Considerations
● Enforce single ndo instance access to mysql
● If multiple ndo processes connecting to a single ndo ...
Database Considerations
● Initiating failover due to crashed DBs may cause a deathmatch
as all nodes will fail (due to t...
Recovering from Failover
● Degraded ex-primaries should not be added back to the cluster
automatically. Doing so may cau...
A Plea from Nagios Support
● Failover/HA != backups
● Test, test, TEST! Use your lab please.
● Document. Everything. Th...
Final Comparisons
● Snapback: Easy. Slow recovery. Requires manual intervention.
Highest potential historical loss.
● F...
Food for thought . . . .
● HA in a federated model . . . . . . . .
Final Questions For You
● How much of Nagios XI, or Core, can truly be
set up to be "HA"? Do you care? :P
● Do you need...
Questions for Me?
Any questions?
(common/critical answers noted below for the sake of efficiency)
● 11 meters/sec (unla...
The End
Andy Brist
abrist@nagios.com
of 41

Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions

Andy Brist's presentation on High Availability and Failover Solutions for Nagios XI. The presentation was given during the Nagios World Conference North America held Oct 13th - Oct 16th, 2014 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/conference
Published on: Mar 3, 2016
Published in: Technology      
Source: www.slideshare.net


Transcripts - Nagios Conference 2014 - Andy Brist - Nagios XI Failover and HA Solutions

  • 1. Failover and High Availability Solutions for Nagios XI Andy Brist abrist@nagios.com
  • 2. Introduction • Who am I? • Nagios Support Team Manager • Team Lead for Nagios-Plugins (github.com/nagios-plugins
  • 3. Disclaimer • Every environment is different • Failover/HA by nature, is a customized solution • My case studies are not your production environments • I know Nagios/XI, not your SLA • Test in a lab. First.
  • 4. Agenda ● Short overview of the different failback/failover solutions ● Nagios XI Data Locations and other files/services relevant to failover scenarios. ● Snapback ● Failback ● Failover ● HA? Failover ● Observations, Considerations
  • 5. Backup (snapback) ● Restore VM snapshot or spin up a new instance and restore a backup ● Most common implementation ● Easiest of all options ● Most potential downtime of scenarios ● Maximum historical and configuration data lost = the interval between snapshots ● Requires manual intervention
  • 6. Automated XI Backups ● XI provides a method for scheduled backups through the "Scheduled Backups Component" – ssh – ftp – local fs ● Useful for remote backups or manual failback
  • 7. Failback
  • 8. Failback ● Secondary is periodically updated from an XI backup. ● The nagios process is started by hand when the master has an issue. ● Cronjob on the secondary restores newest backup once a day. ● If unconcerned with historical data and mrtg performance data, just push/restore the object configs and sql dumps (if not offloaded) ● Not to be confused with snapback as this is a separate, different instance/image, not just a previous state of the failed instance.
  • 9. Additional Considerations ● Easy to implement with the “Scheduled Backups” XI component. ● Agents must maintain 2+ allowed hosts ● SNMP traps must be configured to push to 2+ hosts ● May experience substantial downtime if the backup is large and the primary fails during a data restore on the secondary.
  • 10. Failover ● Difficult to get right ● Demanding on i/o resources and network speed ● Very little to no loss of historical data ● Minimal downtime ● Fully automated ● Can provide minimal clustering for XI services through “High Availability”
  • 11. Failover
  • 12. Nagios XI ● Object Configuration ● Check Status ● Object State ● Program State ● Historical State Data ● Performance Data
  • 13. Nagios XI - Services nagios – Monitoring engine mysql – Object configuration and ndo historical data ndo2db – Writes historical data to mysql database postgresql – Nagios XI settings/user database npcd – Performance data daemon crond – Task scheduler httpd – Web server
  • 14. XI Data and Redundancy Absolute minimum redundant data required for any failover scenario: ● (Working) Object configuration ● Mysql 'nagiosql' database ● Postgresql 'nagiosxi' database
  • 15. Full Check Redundancy Additional requirements for full check redundancy: ● mrtg config and RRDs (for bandwidth checks) ● nagios libexec folder (plugins) Any additional dependencies for plugins. For example: ● VMWare SDK ● Oracle Perl Library ● Java JRE
  • 16. Runtime State Redundancy Additional requirements for runtime state redundancy: ● retention.dat (state, runtime options, acknowledgments, notification depth) ● NDO mysql database "nagios
  • 17. Historical Redundancy Additional Data required for complete historical redundancy: ● nagios.log and archives directory ● perfdata RRDs ● mrtg config and RRDs ● NDO mysql database "nagios"
  • 18. XI Data Summary Logs/archives Perfdata Mrtg/configs Databases Object configs Plugins
  • 19. XI Data Summary /usr/local/nagios/var/nagios.log /usr/local/nagios/var/archives/ /usr/local/nagios/share/perfdata/ /var/lib/mrtg/ /etc/mrtg/ /var/lib/pgsql/ /var/lib/mysql/ /usr/local/nagios/etc/ /usr/local/nagios/libexec/ /usr/local/nagiosxi/
  • 20. High Availability? 1. Elimination of single points of failure. 2. Reliable crossover/failover. 3. Detection of failures as they occur.
  • 21. High Availability? Why would you need it? ● Least amount of downtime ● (limited) Service clustering ● Shared volumes solve the issues with syncing historical data in redundant configurations
  • 22. High Availability/Failover Major components: ● Shared storage ● Virtual IP ● Management applications/scripts
  • 23. Shared Storage ● DRBD – block level replication, part of the linux kernel, well supported and understood. Works well for all XI data types (including RRDs/DBs) ● NFS – Fine option, just make sure the NFS share does not have an i/o latency issue or your checks WILL get behind. Do not mount the volume on more than one server at time to avoid writing multiple checks in the case of a partial failover. ● Replicated DBs – Fine solution, clusters well. Use DNS or virtual ips to control access to the databases. ● rsync – Not immediate replication, but close. Easy to implement. ● GlusterFS – More problematic to set up, but good for offloaded mrtg/RRDs
  • 24. DRBD ● Active/passive suggested ● Low latency storage ● Active mount should move with the vip ● Refer to Jeremy Rust's presentation notes for more information
  • 25. Virtual IP ● pacemaker vip script ● Custom ifconfig/ip shell scripts ● uCarp Scripts ● keepalived
  • 26. HA Failover Management ● Pacemaker/Heartbeat (the HA stack) ● uCarp scripts ● keepalived scripts Custom Scripts: ● nagios itself – Event handler driven ● cron – Job that checks the master for connectivity. Reuse the check_icmp or check_http plugins for this purpose.
  • 27. Extra Considerations ● STONITH ● Clustering? ● DRBD/Shared Storage ● High Latency HA ● NDO/Databases ● Recovery
  • 28. STONITH (shoot the other node in the head) ● Mechanism by which a failing server is guaranteed to be removed from the cluster ● Not required, but advised ● Hardware (including ups) and software (vmware stonith “device” and shell scripts) ● Only failing over when the primary is unreachable is safest ● Beware of overzealous failover conditions as they can lead to a . .
  • 29. Deathmatch! No, really. Stonith gives your servers the ability to KILL THEMSELVES and FRIENDS ● Beware of services whose init actions/failures should not cause failover/stonith ● Any actions requiring a shared volume in active/passive mode should not immediately cause failover due to potential latency during volume mounts ● Test, test, test the disaster scenarios in a LAB first or the fragfest may include your job!
  • 30. Clustering/Fencing ● A number of portions of Nagios Core and Nagios XI are clusterable. Processes that can potentially be clustered: – offloaded postgresql – offloaded mysql/ndo2db – offloaded mrtg ● Services that are dependent on the core monitoring engine and filesystem and should not be clustered: – nagios, npcd, cronjobs – httpd – snmptrapd, snmptt
  • 31. DUAL DRBD Primary ● Disconnecting from the master before mounting of the shared volume during failover is no longer needed. ● Careful implementation allows multiple servers to concurrently access the shared volume. Potentially useful for ambitious clusters and shared historical records. ● Slower, as the “secondary” can lock blocks. ● More prone to “split-brains” ● Usually requires clustered file systems
  • 32. High Latency HA ● Problematic if the HA solution was not designed for potential high latency ● Will potentially cause i/o wait issues ● It may be better to push checks to a central server(s) with NRDP/outbound checks/etc, keeping HA solutions local, or to pay for a faster pipe. ● DRBD Proxy – A good solution if high latency HA is a must – uses an asynchronous buffer for block writes to the secondary volumes (does not support dual primary)
  • 33. NDO Considerations ● Enforce single ndo instance access to mysql ● If multiple ndo processes connecting to a single ndo db is required, consider using ndo db instances ● You can control ndo's access to the mysql server through iptables and the vip. ● Offload ndo2db to the offloaded mysql server ● Configure ndomod it to connect through a tcp socket. This can potentially decrease load on the nagios server.
  • 34. Database Considerations ● Initiating failover due to crashed DBs may cause a deathmatch as all nodes will fail (due to their shared nature) ● Offload both postgresql and mysql databases. Requires a virtual ip or careful management of DNS. ● XI has scripts to repair the databases, use them!
  • 35. Recovering from Failover ● Degraded ex-primaries should not be added back to the cluster automatically. Doing so may cause split brains. ● Split brains REQUIRE manual intervention if preservation of historical data is desired. ● Stonith Deathmatches – Have a primary image/instance without stonith enabled for recovery ● Maintain an ultimate disaster recovery server instance/image outside of the cluster pool for when all else has failed.
  • 36. A Plea from Nagios Support ● Failover/HA != backups ● Test, test, TEST! Use your lab please. ● Document. Everything. The biggest barrier and largest hurdle for support are unknown, undocumented, non-standard configurations. Failover/HA deployments definitely qualify.
  • 37. Final Comparisons ● Snapback: Easy. Slow recovery. Requires manual intervention. Highest potential historical loss. ● Failback: Intermediate. Moderate recovery. Can be automated. Less historical loss. ● Failover: Difficult. Fast recovery. Fully automated. Nearly no historical loss. ● High Availability: Difficult. Fast recovery. Automated. Redundancy across WAN links. Limited clustering. Least potential downtime. Multiple potential issues with split-brain, stonith/deathmatches and latency, so care should be given, and scenarios tested.
  • 38. Food for thought . . . . ● HA in a federated model . . . . . . . .
  • 39. Final Questions For You ● How much of Nagios XI, or Core, can truly be set up to be "HA"? Do you care? :P ● Do you need HA/failover, or will failback/snapback suffice? ● Is the time trade off in your environment worth it?
  • 40. Questions for Me? Any questions? (common/critical answers noted below for the sake of efficiency) ● 11 meters/sec (unladen European swallow) ● 42 ● The Prime Directive ● 3 Times ● The Categorical Imperative/Pragmatism (choose 1) ● No.* ● Evasive Subjunctive ● . . . Yes?
  • 41. The End Andy Brist abrist@nagios.com

Related Documents