Distributed Monitoring with Nagios: Past, Present, Future Mike Guthrie [email_address]
Distributed Monitoring Introduction <ul><li>Basic Definition: Splitting up your monitoring server over multiple machines
Why use distributed monitoring? </li><ul><li>Multiple sites with firewall restrictions
Large installations that exceed the CPU and memory resources that a single machine can offer. </li></ul></ul>
Understanding CPU Limitations <ul><li>The primary task of the Nagios Core engine is to schedule checks
Example Monitoring Server </li><ul><li>1000 Hosts, 4 services per host, 5mn interval
Check load = ( 5000 checks / 5mn ) / 60 seconds </li><ul><li>About 16.6 checks per second
In 1 second: About 16 scripts or binary processes are being launched, with about 16 sets of results coming in and being pr...
When the check schedule exceeds CPU limitations, you get “check latency” </li></ul></ul></ul>
Picking the Right Distributed Model <ul><li>Pick the right model for your environment
Think logistics: PLAN before implementation </li><ul><li>Every hour spent in planning logistics will save tens or even hun...
A 30mn task on 1 server = 5 hours on 10 servers.
Consider how to effectively view information across multiple machines
As data quantity increases, discerning useful information from it becomes more important
Viewing 10,000 hosts and 50,000 services on a page is too much raw data to be effective information </li></ul></ul>
The Classic Distributed Model Central Server (Passive Only) Active Checks Distributed servers running active checks, forw...
The Classic Distributed Model
The Classic Distributed Model <ul><li>Central Monitoring vs Central Viewing? </li><ul><li>OCSP vs Event Handlers
OSCP runs after every check
Event handlers run only on state changes </li></ul><li>Freshness checking ensures current data
Child servers can also do local monitoring without forwarding results
Distributed servers can also receive passive checks and forward them along, creating a multi-level tree structure </li></ul>
The Classic Distributed Model <ul><li>Strengths: </li><ul><li>Well tested, well documented, proven solution
All built into the Nagios Core package
Extremely flexible for checks, performance graphing, notifications, etc.
Can be combined with other distributed models </li></ul><li>Challenges: </li><ul><li>Maintaining configs on multiple mach...
of 26

Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios

Mike Guthrie's presentation on distributed monitoring solutions for Nagios. The presentation was given during the Nagios World Conference North America held Sept 27-29th, 2011 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna
Published on: Mar 3, 2016
Published in: Technology      
Source: www.slideshare.net


Transcripts - Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios

  • 1. Distributed Monitoring with Nagios: Past, Present, Future Mike Guthrie [email_address]
  • 2. Distributed Monitoring Introduction <ul><li>Basic Definition: Splitting up your monitoring server over multiple machines
  • 3. Why use distributed monitoring? </li><ul><li>Multiple sites with firewall restrictions
  • 4. Large installations that exceed the CPU and memory resources that a single machine can offer. </li></ul></ul>
  • 5. Understanding CPU Limitations <ul><li>The primary task of the Nagios Core engine is to schedule checks
  • 6. Example Monitoring Server </li><ul><li>1000 Hosts, 4 services per host, 5mn interval
  • 7. Check load = ( 5000 checks / 5mn ) / 60 seconds </li><ul><li>About 16.6 checks per second
  • 8. In 1 second: About 16 scripts or binary processes are being launched, with about 16 sets of results coming in and being processed by Nagios and written to disk.
  • 9. When the check schedule exceeds CPU limitations, you get “check latency” </li></ul></ul></ul>
  • 10. Picking the Right Distributed Model <ul><li>Pick the right model for your environment
  • 11. Think logistics: PLAN before implementation </li><ul><li>Every hour spent in planning logistics will save tens or even hundreds of man hours later on
  • 12. A 30mn task on 1 server = 5 hours on 10 servers.
  • 13. Consider how to effectively view information across multiple machines
  • 14. As data quantity increases, discerning useful information from it becomes more important
  • 15. Viewing 10,000 hosts and 50,000 services on a page is too much raw data to be effective information </li></ul></ul>
  • 16. The Classic Distributed Model Central Server (Passive Only) Active Checks Distributed servers running active checks, forwarding results to a central server Active Checks Active Checks Active Checks Active Checks Active Checks Active Checks Active Checks Forward Results After Every Check
  • 17. The Classic Distributed Model
  • 18. The Classic Distributed Model <ul><li>Central Monitoring vs Central Viewing? </li><ul><li>OCSP vs Event Handlers
  • 19. OSCP runs after every check
  • 20. Event handlers run only on state changes </li></ul><li>Freshness checking ensures current data
  • 21. Child servers can also do local monitoring without forwarding results
  • 22. Distributed servers can also receive passive checks and forward them along, creating a multi-level tree structure </li></ul>
  • 23. The Classic Distributed Model <ul><li>Strengths: </li><ul><li>Well tested, well documented, proven solution
  • 24. All built into the Nagios Core package
  • 25. Extremely flexible for checks, performance graphing, notifications, etc.
  • 26. Can be combined with other distributed models </li></ul><li>Challenges: </li><ul><li>Maintaining configs on multiple machines
  • 27. Which server issued the check?
  • 28. Where to process/view performance data? </li></ul></ul>
  • 29. The Classic Distributed Model <ul><li>Workarounds: </li><ul><li>Use SVN, rsync, or cron to automatically maintain host and service configs on both distributed and central servers.
  • 30. Use templating as much possible </li><ul><li>Read Core Docs on “Object Inheritance”
  • 31. Keep template definitions separate </li></ul><li>Use naming conventions to keep configs organized
  • 32. Nagios XI distributed tools: </li><ul><li>Inbound and Outbound Checks
  • 33. Unconfigured Objects </li></ul></ul></ul>
  • 34. The Cluster Model – Nagios Load Balancing <ul><li>Nagios checks are managed by a sub-process and distributed evenly across multiple servers
  • 35. Works like a load balancer
  • 36. Two Popular Examples: </li><ul><li>DNX: Distributed Nagios eXecutor
  • 37. Mod Gearman </li></ul><li>Check results and configs are all managed at the central server </li></ul>
  • 38. The Cluster Model – DNX
  • 39. The Cluster Model – DNX <ul><li>DNX: How it works </li><ul><li>When a check is scheduled to execute, the job is passed to a worker node
  • 40. Worker node executes the check, and send results directly to results queue
  • 41. Checks are not associated with any particular worker node
  • 42. Bypasses the nagios.cmd pipe to eliminate a potential bottleneck
  • 43. If a worker goes down, all checks continue </li></ul></ul>
  • 44. The Cluster Model – DNX <ul><li>DNX: Strengths: </li><ul><li>Central configuration management
  • 45. Checks redistributed if a worker is down
  • 46. Worker nodes can be added at any time </li></ul><li>Challenges: </li><ul><li>Performance data is still handled at the central server
  • 47. If the master goes down, all checks cease </li></ul></ul>
  • 48. The Cluster Model – Mod Gearman
  • 49. The Cluster Model – Mod Gearman <ul><li>Strengths: </li><ul><li>Central configuration management
  • 50. Checks can be split by hostgroups or servicegroups, which can come in useful if groups are located in different network segments </li></ul><li>Challenges: </li><ul><li>Performance data is still handled at the central server
  • 51. If the master goes down, all checks cease
  • 52. Effectively viewing more than 10k+ services on a single machine </li></ul></ul>
  • 53. The Central Dashboard Model <ul><li>Checks are executed and managed on multiple distributed servers
  • 54. Central viewer unifies all servers
  • 55. Central viewer polls data from each server and displays tactical data in the UI
  • 56. Examples: </li><ul><li>Nagios Fusion
  • 57. MNTOS
  • 58. check_MK Multisite </li></ul></ul>
  • 59. The Central Dashboard Model
  • 60. The Central Dashboard Model: Nagios Fusion <ul><li>Displays tactical overview for each server
  • 61. Monitoring and object configurations compartmentalized to each server
  • 62. Good for geographically distributed servers where local management is required
  • 63. Unified login for all XI servers (basic auth still required for Core machines) </li></ul>
  • 64. The Central Dashboard Model: Nagios Fusion <ul><li>Strengths: </li><ul><li>Easy to add new servers
  • 65. User-level control of server views
  • 66. High level overview
  • 67. Very little CPU usage
  • 68. Commercial solution with support </li></ul><li>Challenges: </li><ul><li>Not a monitoring solution by itself
  • 69. Free 60 day trial, requires a license </li></ul></ul>
  • 70. The Central Dashboard Model: Nagios Fusion
  • 71. The Central Dashboard Model: MNTOS
  • 72. The Central Dashboard Model: Multisite
  • 73. Single Server – Distributed Parts <ul><li>Not all environments require check distribution </li><ul><li>Offload nodutils (DB backend) to a different machine
  • 74. Offload performance data processing to a different machine
  • 75. Mount disk io intensive files to a RAM disk
  • 76. A Nagios Core installs can run between 10 - 20k checks depending on what is being checked and how it is configured </li></ul></ul>
  • 77. Where To Go From Here? <ul><li>Future of Distributed Monitoring? </li><ul><li>Improved information viewing instead of just raw data
  • 78. Aggregated reporting and statistics
  • 79. Business process views and monitoring
  • 80. What do you, as admins, need to see in this area of software development? </li></ul></ul>
  • 81. Conclusion <ul><li>Pick the right setup for your environment
  • 82. Any of these models can be mixed and combined
  • 83. PLAN before implementation: </li><ul><li>Plan for efficient maintenance
  • 84. An environment that implemented 250k services being overseen by a single server took almost an entire year of planning and implementation to do it right
  • 85. Environments can scale even larger with the right logistics planning in place </li></ul></ul>
  • 86. Conference Resources <ul><li>Daniel Wittenberg: “Scaling Nagios At A Giant Insurance Company” @2pm Thursday </li><ul><li>35,000 hosts and 1.4 million services </li></ul><li>Mike Weber: “Reducing Server Load with Mod Gearman” @10:30am Friday
  • 87. Dave Williams: Author of DNX </li></ul>

Related Documents