Nagios and Cloud Computing <ul>Presentation by William Leibzon ( [email_address] ) Thanks for being here! </ul>Nagios <u...
Cloud Computing <ul><li>What is Cloud Computing? Virtualized systems independent of hardware and leased to various custome...
Virtualization and Cloud Computing <ul><li>Virtualization </li><ul><li>Separates Hardware from User Software - either one ...
Efficient use of modern multi-core processors
Micro-Kernel design is simpler, easier to support </li></ul><li>More Servers with Less Hardware </li><ul><li>Unused system...
Less energy, more power efficient use of resources
Less rack space in expensive datacenters </li></ul><li>Virtualization is the core of Cloud Computing </li></ul>
Cloud Computing Architecture <ul><li>Virtualized Systems in a Cloud </li><ul><li>Can be managed entirely remotely
Can move (even live) from one hardware to another
Can be shutdown, saved to disk and started again when required
Can be easily cloned to have another alike system started exactly when it is needed </li></ul></ul><ul>Cloud allows to aut...
Cloud Solutions and Vendors <ul><li>Hypervisors (Viritualization Kernels): </li><ul><li>Commercial: VMware ESX, IBM Z/VM, ...
Open-Source: Xen, KVM, OpenVZ, Quemu, VirtualBox
Xen originally implimented paravirtualization, which required modified OS and limited it to Linux. KVM and new Xen-HVM can...
Open-Source: Eucalyptus, OpenNebula, OpenStack, Baracus
Commercial based on Open-Source: Citrix XenServer, Oracle VM, Ubuntu Enterprise Cloud, Redhat CloudForms, Parallels Virtu...
Open-Source Cloud Software <ul><li>Open-Source Hypervisors used in Cloud Systems </li><ul><li>Xen - http://www.xen.org/
KVM - http://www.linux-kvm.org/
OpenVZ - http://www.openvz.org/ </li></ul><li>Open-Source Cloud Management Software </li><ul><li>Eucalyptus - http://ope...
OpenNebula - http://www.opennebula.org/
OpenStack – http://www.openstack.org /
Baracus – http://baracus-project.org/
Proxmox - http://pve.proxmox.com/ </li></ul></ul>
Monitoring for the Cloud <ul><li>Monitoring of hardware (host OS) & hypervisor </li><ul><li>More static, hardware does not...
Monitoring of system resources often integrated into virtualizer and info not available to cloud customer </li></ul><li>Mo...
Focus on application and network performance
Ideally should monitor utilization and be able to launch new server instances (auto-scaling)
Monitoring system should itself be robust and handle more servers without impacting performance </li></ul></ul>
of 28

Nagios Conference 2011 - William Leibzon - Nagios In Cloud Computing Environments

William Leibzon's presentation on using Nagios in a cloud computing environment. The presentation was given during the Nagios World Conference North America held Sept 27-29th, 2011 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna
Published on: Mar 3, 2016
Published in: Technology      
Source: www.slideshare.net


Transcripts - Nagios Conference 2011 - William Leibzon - Nagios In Cloud Computing Environments

  • 1. Nagios and Cloud Computing <ul>Presentation by William Leibzon ( [email_address] ) Thanks for being here! </ul>Nagios <ul>Nagios 2011 Conference in Saint Paul, Minnesota </ul>
  • 2. Cloud Computing <ul><li>What is Cloud Computing? Virtualized systems independent of hardware and leased to various customers in what is referred to as Infrastructure as a Service </li></ul>Image courtesy of thetechlabs.com
  • 3. Virtualization and Cloud Computing <ul><li>Virtualization </li><ul><li>Separates Hardware from User Software - either one can be upgraded independent of the other
  • 4. Efficient use of modern multi-core processors
  • 5. Micro-Kernel design is simpler, easier to support </li></ul><li>More Servers with Less Hardware </li><ul><li>Unused system resources can be utilized in other types of servers with different resource usage
  • 6. Less energy, more power efficient use of resources
  • 7. Less rack space in expensive datacenters </li></ul><li>Virtualization is the core of Cloud Computing </li></ul>
  • 8. Cloud Computing Architecture <ul><li>Virtualized Systems in a Cloud </li><ul><li>Can be managed entirely remotely
  • 9. Can move (even live) from one hardware to another
  • 10. Can be shutdown, saved to disk and started again when required
  • 11. Can be easily cloned to have another alike system started exactly when it is needed </li></ul></ul><ul>Cloud allows to automate scaling up of infrastructure to handle peak traffic load while scaling down after to keep overall cost low <ul><li>This requires monitoring of all system resources ! </li></ul></ul>
  • 12. Cloud Solutions and Vendors <ul><li>Hypervisors (Viritualization Kernels): </li><ul><li>Commercial: VMware ESX, IBM Z/VM, Microsoft VirtualPC
  • 13. Open-Source: Xen, KVM, OpenVZ, Quemu, VirtualBox
  • 14. Xen originally implimented paravirtualization, which required modified OS and limited it to Linux. KVM and new Xen-HVM can do full virtualization, but require Quemu and CPU virtualization extensions (Intel's VT or AMD's SVM) </li></ul><li>Virtualization and Cloud Software Suites </li><ul><li>Commercial: VMware vCloud, Microsoft Azure
  • 15. Open-Source: Eucalyptus, OpenNebula, OpenStack, Baracus
  • 16. Commercial based on Open-Source: Citrix XenServer, Oracle VM, Ubuntu Enterprise Cloud, Redhat CloudForms, Parallels Virtuozzo </li></ul><li>Cloud Infrastructure providers </li><ul><li>Amazon EC2 (modified Xen), Rackspace (Xen), Linode (Xen), Savvis (Vmware), many many more... </li></ul></ul>
  • 17. Open-Source Cloud Software <ul><li>Open-Source Hypervisors used in Cloud Systems </li><ul><li>Xen - http://www.xen.org/
  • 18. KVM - http://www.linux-kvm.org/
  • 19. OpenVZ - http://www.openvz.org/ </li></ul><li>Open-Source Cloud Management Software </li><ul><li>Eucalyptus - http://open.eucalyptus.com/
  • 20. OpenNebula - http://www.opennebula.org/
  • 21. OpenStack – http://www.openstack.org /
  • 22. Baracus – http://baracus-project.org/
  • 23. Proxmox - http://pve.proxmox.com/ </li></ul></ul>
  • 24. Monitoring for the Cloud <ul><li>Monitoring of hardware (host OS) & hypervisor </li><ul><li>More static, hardware does not change as often
  • 25. Monitoring of system resources often integrated into virtualizer and info not available to cloud customer </li></ul><li>Monitoring of virtual systems </li><ul><li>Dynamic, should be able to handle addition and removal of server instances
  • 26. Focus on application and network performance
  • 27. Ideally should monitor utilization and be able to launch new server instances (auto-scaling)
  • 28. Monitoring system should itself be robust and handle more servers without impacting performance </li></ul></ul>
  • 29. Cloud Monitoring Architecture <ul><li>Horizontal Scaling
  • 30. Clouds can be as small as 10 servers and as as large as 10,000+. When developing architecture, you need to support its future growth from the start.
  • 31. Scaling on Demand
  • 32. A pro-active system should handle big changes in the number of cloud instances. You may have 2 webserver instances at 6am and grow to 20 at 10pm.
  • 33. High Availability
  • 34. Good system design should be fully fault-tolerant and application as a whole should continue to function without interruption if any one server instance dies </li></ul>This means cluster !!!
  • 35. Nagios Cluster Options <ul>The base nagios-core package is for stand-alone monitoring where server does all service checks. It can be extended to Nagios Cluster with : <ul><li>Passive Service Checks (Classic Distributed Model)
  • 36. ”Old Way” - NCSA used to forward results of checks from client servers to main nagios server, not robust
  • 37. Shared database (Central Dashboard Model)
  • 38. NDO-Mod and Merlin projects implement this with a combination of NEB modules, daemon & database
  • 39. Worker Nodes (Load Balancing of Checks)
  • 40. DNX and Mod-Gearman do it with combination of loaded NEB module, server daemon & client servers </li></ul></ul>
  • 41. Passive Service Checks <ul><ul><ul><li>How
  • 42. - One central server with all services, it does not do any checks listing them all passive
  • 43. - Separate client nagios servers run plugins and do checks for specific sets of hosts, each has its own subset of full nagios config
  • 44. - Scripts are setup that capture results from each client host and send them to central server using NSCA, it puts them into nagios command queue
  • 45. Advantages
  • 46. This will work with any nagios server, organizations have been doing it from at least 2002
  • 47. Disadvantages </li></ul></ul></ul><ul>Requires a lot of custom scripting to organize nagios configs. Not reliable if server dies. Not robust to automate cloud instances being added and deleted </ul>NCSA NCSA Nagios Client Server Nagios Client Server
  • 48. Shared Database <ul><li>Who: NDO-DB and Merlin
  • 49. How
  • 50. - Multiple Peer Nagios servers, each has different config file specifying which services it would check
  • 51. - All servers use common database to share results of checks and status of services they are monitoring </li></ul><ul><li>Advantages
  • 52. - There is no master nagios server. There is master DB server, however it is a better understood topic how to create a db cluster
  • 53. - Using NEB avoids slow command-queue processing
  • 54. Disadvantages
  • 55. Partioning of monitoring infrastructure among servers is still manual process. It is not easy to use this for dynamic cloud environment, however it works very well for fault-tolerance </li></ul>
  • 56. DNX and Mod-Gearman Worker Nodes <ul><li>How
  • 57. - Similarly to Passive Service Checks, there is a central Nagios Server, it does not execute any plugins.
  • 58. - Unlike with Passive Checks, nagios does schedule checks. Thereafter NEB module takes over.
  • 59. - Module passes information on which plugin(s) to run to DNX server (or Gearman server for Mod-Gearman) which manages worker nodes. </li></ul>- Worker nodes are separate servers, each has special worker daemon running. The daemon communicates with management server and gets information (plugin command) on what to run. It then passes results back to management server and NEB module writes these results directly into nagios memory.
  • 60. Advantages of DNX and Mod-Gearman <ul><li>Robust and Scalable </li><ul><li>Checks are automatically distributed among all cluster worker nodes (round-robin on equal basis by default)
  • 61. All worker nodes are essentially the same and there is no additional re-configuration necessary to add a new node
  • 62. This fully achieves Horizontal Scaling of nagios checks </li></ul><li>Easy to Use in a Cloud Environment </li><ul><li>As nodes are the same. Existing worker node can be replicated with no special config to start it
  • 63. Adding node lets expand cluster on demand </li></ul><li>Efficient Integration with Nagios </li><ul><li>Using NEB loaded modules achieves low-level integration with nagios, much better than NCSA and command queue </li></ul></ul>
  • 64. Disadvantages of DNX and Mod-Gearman <ul><li>Single Instance of Nagios Server </li><ul><li>The solution has no direct disadvantages however it only achieves horizontal scaling of nagios checks.
  • 65. This still relies on a single central nagios server to processes the results, send alerts and schedule new checks. </li></ul><li>Does not achieve fault-tolerance </li><ul><li>If central nagios server dies entire system is out
  • 66. Author of this presentation does have a patch to DNX that allows results to be multicast to multiple instances of a nagios servers (second one of them would be stand-by and not scheduling checks only receiving results). This is experimental. </li></ul></ul>
  • 67. DNX Architecture <ul><li>DNX Server and DNX Client (Worker) Daemons are multi-threaded. Client thread model is controlled by these commands:
  • 68. Communication between Server and Client using own UDP protocol passing XML packets .
  • 69. Almost all communication is from client to server. Client contacts DNX server dispatcher port, receives list of checks to run, runs them and returns results on collector port
  • 70. DNX Client can support having common checks built into client. check_nrpe was included before, but was pulled out of a package as it required nagios source. </li></ul>#poolInitial = 20 #poolMin = 20 #poolMax = 100 #poolGrow = 10 channelDispatcher = udp://10.1.1.1:12480 channelCollector = udp://10.1.1.1:12481
  • 71. DNX System Internals DNX Server System Internals DNX Client (Worker Node) System Internals
  • 72. Mod-Gearman MOD-Gearman System Nagios Checks and Mod-Gearman Queues
  • 73. DNX vs Mod-Gearman <ul><li>Single package, no external dependencies. Includes all job cluster control components </li><ul><li>Hard to maintain and test for non-Linux environment </li></ul><li>Can use localCheckPattern in server configuration to direct jobs. But it is not documented
  • 74. Supports nagios-2.x with a patch and nagios-3.x as is
  • 75. Client can be extended with nagios- specific features. Planned are: - Embedded Perl, check_icmp, - check_snmp, check_nrpe </li></ul><ul><li>Mod-Gearman is built around Gearman Project </li><ul><li>Better maintained since Gearman has many uses
  • 76. Enjoys benefits of wider testing on new releases </li></ul><li>Easy to configure and direct to separate queues depending on hostgroup & servicegroup
  • 77. Only supports nagios 3.x
  • 78. Supports eventhandlers and not just checks !
  • 79. Nagios-only features are hard to add at node level </li></ul>DNX Mod-Gearman
  • 80. Combining Shared Database and Worker Nodes <ul>Nagios cluster options can be combined ! DNX or Mod-Gearman with Merlin or ADO are great fit : - DNX offers horizontal scaling for all checks and relieaves Nagios of need to run them - Merlin provides horizontal scaling and failover for Nagios itself for infrastructure of thousands of hosts </ul>
  • 81. Ideal Fully Fault-Tolerant Nagios Cluster Architecture Replication udpecho cross-monitor Ideally you would have each of the above as a separate cloud server, but even those with 1000s of servers may find this hard to maintain udp udp heartbeat Nagios Server Merlin/ADO DB Merlin/ADO DB Backup DB Proxy Nagios Web Interface Server Backup Nagios Web Interface Server Standby DB Proxy Worker Node Worker Node Worker Node Worker Node Backup Nagios Server Performance Data (RRD) Server (like NagiosGrapher) Backup Performance Data (RRD) Server
  • 82. Nagios Cloud Cluster with 4 hosts N P C D N P C D MAIN NAGIOS SERVER STANDBY NAGIOS SERVER <ul><li>Standby Server has all checks disabled (except checking main nagios host)
  • 83. Cross-monitor of other nagios does not use DNX cluster
  • 84. If main server dies, backup takes over and registers itself in dynDNS server replacing primary.
  • 85. DNX Clients use dynDNS address, they are restarted on server switch </li></ul>replication cross-monitor Nagios Daemon Apache Mysql DB Merlin PNP w/ RRD DNX Server DNX Client DNX Client Nagios Daemon Apache Mysql DB Merlin PNP w/ RRD DNX Server
  • 86. Configuration of a cloud host <ul>The best way to configure monitoring of cloud hosts with multiple instances is to have a template and define all services by hostgroups Then starting new host of same type is just a matter of adding config like above but for w2, etc One of the alternatives is to add a few extra hosts to nagios config and disable all service checks on those hosts, enabling them using script when server is launched </ul>define host { use wprod-server <--- Template for all Webservers host_name w1 alias webserv1 <---- This is second way to search address w1.dynamic.cloud1.mydomain <---- Local DNS hostgroups production,loadbalanced,linux_centos5,webserv parents loadbalancer1,loadbalancer2 contact_groups admins }
  • 87. Auto-Scaling <ul><li>Event handlers can be used or custom check.
  • 88. Trigger based on total number of open http sockets (check_netstat, check_apache_status) from all servers
  • 89. Write custom script that keeps number of currently active servers in DB or local file to set name of new server.
  • 90. Have new server name as a parameter for launching cloud instance. Write startup scripts that use this to set hostname and register ip in local dynamic dns server.
  • 91. For Amazon EC2, aws utility is very useful to automate launching of new servers. Get it at http://timkay.com/aws/
  • 92. Extra nagios worker node is launched similarly and this is triggered when enough servers have been launched. Can also do it based on nagios stats (check_nagios)
  • 93. Scale down after an hour or more of low resource usage, you can do it with a check that relies on RRD data </li></ul>
  • 94. Use of SQL DB for Auto-Scaling This is for illustration of logic only. Not real code. CREATE TABLE ServerData ( id bigint(10) unsigned NOT NULL, name varchar(50) unsigned default NULL, connections bigint(20) unsigned default 0, started_on date default NULL, PRIMARY KEY(id)); After you got results of server check (like event handler that runs): UPDATE ServerData SET connections=<data from nagios check> WHERE name=<server host> Custom check to see if new server should be started: $count=sqlexec(&quot;SELECT COUNT(id) FROM ServerData&quot;) $sumit=sqlexec(&quot;SELECT SUM(Connections) FROM ServerData&quot;) $lastlaunched=sqlexec(&quot;SELECT MAX(started_on) FROM ServerData&quot;) if $sumit/$count > $threshold && ($now-$lastlatched)<600 { <figure out the name and id> launch_new_server_instance($newname) sqlexec(”INSERT INTO ServerData VALUES ($newid, $newname,0,CURDATE())”) enable_nagios_service_checks($newname) }
  • 95. Additional Cloud Monitoring Tips <ul><li>Cloud Servers are not entirely independent, and other servers on same hardware server may effect yours </li><ul><li>For Virtualized OS System load checks are less useful and can show ”false” spikes in load. Put larger emphasis on 15-minute load and do more checks before alerts are sent
  • 96. But if you control the cloud, find way to get cloud hardware system load. Write check showing physical server name
  • 97. For load issues rely more on a number of connections (TCP session) and time to process each request. Do prior tests on how many connections one server should handle </li></ul><li>Remember, you can always just launch a new server </li><ul><li>Do not spend too much time investigating cause, take it out of production first, replace, and investigate later </li></ul></ul>
  • 98. Nagios Cluster Software <ul><ul><li>Nagios, NDO-Utils, NCSA – http://www.nagios.org/
  • 99. DNX (Distributed Nagios eXecutor) -
  • 100. http://dnx.sourceforge.net/
  • 101. Mod-Gearman - http://labs.consol.de/lang/de/nagios/mod-gearman/
  • 102. Gearman - http://gearman.org/
  • 103. Merlin (Module for Effortless Redundancy and Loadbalancing by OP5) – http://www.op5.org/community/plugin-inventory/op5-projects/merlin
  • 104. Check-Multisite (collect data from multiple servers) – http://www.my-plugin.de/check_multi/
  • 105. Ganglia (open-source computing cluster monitoring, can be integrated with nagios) – http://www.ganglia.info </li></ul></ul>
  • 106. Demo & Questions <ul>Questions ? </ul>
  • 107. <ul>More Questions? Feedback? William Leibzon < [email_address] > My Nagios Page (mostly plugins) : <ul>http://william.leibzon.org/nagios/ </ul></ul>

Related Documents