October 14th 2014 Dave Williams
Technical Architect
Multi-Tenant Nagios
Monitoring
© Bull, 2014 1
Agenda
Background
Multi-Tenant Monitoring
Why Multi-Tenant
Multi-Tenant Design
Service Catalogue
Futures & ‘Blue Sky...
Background
UK based
Mainframe (IBM & Honeywell)
Unix (HP-UX, AIX, Solaris)
Linux (RedHat, SLES, Debian)
Network (CASE...
Background
System Monitoring
OpenView
Netview
Open Master
Open Source Monitoring
NetSaint on AIX
Nagios
© Bull, 20...
Why Multi-Tenant ?
Outsourcing Support & Monitoring
Multiple Customers
–Different Levels of security
–Different Hardwa...
Multi-Tenant Design
Each customer may have 2-3000 hosts
10-100 services per host
Real time monitoring
Customer profile...
Multi-Tenant Design
Hardware Platform – Central Support
Virtualised Platform (Intel based)
–XenServer Hypervisor
 All...
Hardware Platform – Basic Schematic
© Bull, 2014 8
Multi-Tenant Design
Hardware Platform – Resilience
Virtualised Platform (Intel based)
–XenServer Hypervisor
 Allows c...
Hardware Setup
© Bull, 2014 10
Multi-Tenant Design
Hardware Platform – Recovery
Virtualised Platform (Intel based)
–XenServer Hypervisor
 Allows clu...
Hardware Platform - Resilience
© Bull, 2014 12
Hardware Platform – Customer Site
Using generic netbooks
Minimum requirement
–1Gb Memory , Atom processor, Ethernet Por...
Software Platform – Central Site
Nagios – Core
Running latest 4.0.8
Using MK Livestatus for interfacing
Using Thruk fo...
Software Platform – Central Site (contd)
NRPE
Running 2.1.5
NSCA &NSCA-ng
Using NSCA for external communication
Using...
Software Platform – Remote Site
Nagios – Core
Running latest 4.0.8
NRPE
Running 2.14
NSCA
Using NSCA for external co...
Customer Multi-Tenant
© Bull, 2014 17
Multi Tenant Schematic
© Bull, 2014 18
Service Catalogue
ITIL Flavour
Really just services & their characteristics
© Bull, 2014 19
Service Catalogue
Agreed list of servers / services
With importance levels
With alerting paths
With escalation paths
...
Examples
Basic Spreadsheet plus Shell script
Usually easy to create, Shell script is different for each customer based
...
Multi Tenant Issues
Naming conventions
Every customer has a server01
Customers naming conventions are obscure
Customer...
Futures & Blue Sky thinking
The Nagios Visualisation is resource heavy
All Customers want their own Dashboard
All Custo...
Load Sharing
Using plugins like check_wmi_plus put a strain on the
monitoring system, large number of queries that take ...
BPI Example
© Bull, 2014 25
Dashboard Example
© Bull, 2014 26
Questions ?
© Bull, 2014 27
© Bull, 2014 28
of 28

Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

Dave Williams presentation on Multi-Tenant Nagios Monitoring. The presentation was given during the Nagios World Conference North America held Oct 13th - Oct 16th, 2014 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/conference
Published on: Mar 3, 2016
Source: www.slideshare.net


Transcripts - Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

  • 1. October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring © Bull, 2014 1
  • 2. Agenda Background Multi-Tenant Monitoring Why Multi-Tenant Multi-Tenant Design Service Catalogue Futures & ‘Blue Sky thinking’ Questions © Bull, 2014 2
  • 3. Background UK based Mainframe (IBM & Honeywell) Unix (HP-UX, AIX, Solaris) Linux (RedHat, SLES, Debian) Network (CASE, 3COM, CISCO) Working for Bull French Computer Manufacturer Mainframes, Unix, HPC, Security, Managed Services, Advisory Services © Bull, 2014 3
  • 4. Background System Monitoring OpenView Netview Open Master Open Source Monitoring NetSaint on AIX Nagios © Bull, 2014 4
  • 5. Why Multi-Tenant ? Outsourcing Support & Monitoring Multiple Customers –Different Levels of security –Different Hardware / Software Platforms One Support Team –Only need to know about real problems –Can be driven by support ticket not Nagios Required 365 x 24 –Infrastructure must survive all outages without loss of service © Bull, 2014 5
  • 6. Multi-Tenant Design Each customer may have 2-3000 hosts 10-100 services per host Real time monitoring Customer profile SLA Reporting Batch Event completion Different SLA’s for each Business Process per customer Different alerting & escalation methods per customer © Bull, 2014 6
  • 7. Multi-Tenant Design Hardware Platform – Central Support Virtualised Platform (Intel based) –XenServer Hypervisor  Allows clustering with shared storage  Inexpensive Licensing Shared Storage –NAS  Using QNAP Appliances with underlying RAID-5 & Hot Spare protection  Network connection using dual interfaces bound across multiple switches  Could have used FreeNas LAN Infrastructure –Dual connections to all hardware –SNMP managed switches © Bull, 2014 7
  • 8. Hardware Platform – Basic Schematic © Bull, 2014 8
  • 9. Multi-Tenant Design Hardware Platform – Resilience Virtualised Platform (Intel based) –XenServer Hypervisor  Allows clustering with shared storage  If Primary node fails cluster will ‘spin up’ image on 2nd node Same data / logs (Shared storage) LAN Infrastructure –Dual connections to all hardware  Bonded interfaces for NAS access – no data loss / access loss with failure  SNMP managed switches © Bull, 2014 9
  • 10. Hardware Setup © Bull, 2014 10
  • 11. Multi-Tenant Design Hardware Platform – Recovery Virtualised Platform (Intel based) –XenServer Hypervisor  Allows clustering with shared storage  If Primary Site fails will spin up image  Internet Access fails over – using BGP Shared Storage – replicated from Prime Site –NAS  Using QNAP Appliances with underlying RAID-5 & Hot Spare protection  Using RTRR (Real Time Remote Replication) between sites  Network connection using dual interfaces bound across multiple switches LAN Infrastructure –Dual connections to all hardware  Bonded interfaces for NAS access – no data loss / access loss with failure  SNMP managed switches © Bull, 2014 11
  • 12. Hardware Platform - Resilience © Bull, 2014 12
  • 13. Hardware Platform – Customer Site Using generic netbooks Minimum requirement –1Gb Memory , Atom processor, Ethernet Port –Running Centos 6.4 64 bit Operating System Can use Raspberry Pi for small customers –512K Memory , Arm processor , Ethernet Port –Running Raspbian Operating System © Bull, 2014 13
  • 14. Software Platform – Central Site Nagios – Core Running latest 4.0.8 Using MK Livestatus for interfacing Using Thruk for Visualisation Graylog2 / Elastic Search Store all logs & Syslog in ‘Big Data’ repository using MongoDB Asterisk PBX Allow all alerting to use standard dial-up with speech synthesis + IVR SMS-Client Still using TAPI to SMS Text contacts © Bull, 2014 14
  • 15. Software Platform – Central Site (contd) NRPE Running 2.1.5 NSCA &NSCA-ng Using NSCA for external communication Using NSCA-ng for issuing remote commands Postfix / Procmail Used to generate emails but also handle responses. Routes unsolicited alerting emails (HP Insight, Pingdom) OTRS Record alerts, track issues © Bull, 2014 15
  • 16. Software Platform – Remote Site Nagios – Core Running latest 4.0.8 NRPE Running 2.14 NSCA Using NSCA for external communication OpenVPN Communication via IPSec VPN © Bull, 2014 16
  • 17. Customer Multi-Tenant © Bull, 2014 17
  • 18. Multi Tenant Schematic © Bull, 2014 18
  • 19. Service Catalogue ITIL Flavour Really just services & their characteristics © Bull, 2014 19
  • 20. Service Catalogue Agreed list of servers / services With importance levels With alerting paths With escalation paths Recovery options Feeds into Service Level Agreements and Operational Level Agreements Basis of agreed reporting structures © Bull, 2014 20
  • 21. Examples Basic Spreadsheet plus Shell script Usually easy to create, Shell script is different for each customer based on a initial standard script Chef or Puppet Use Exported Resources Nagios Cookbook – Nagios Conference 2012 Presentation © Bull, 2014 21
  • 22. Multi Tenant Issues Naming conventions Every customer has a server01 Customers naming conventions are obscure Customers have multiple physical locations or levels of security –This gives rise to different nagios names to actual names: –Custloc1-swfeltsw01 –Custloc2-nwfeltsw01 Not so smart when a non-Nagios originated alert is received, –‘swfeltsw01 – RAID battery backup failure’ from HP Insight for example –The external alert processor has to perform table lookups before building the appropriate NSCA command for example © Bull, 2014 22
  • 23. Futures & Blue Sky thinking The Nagios Visualisation is resource heavy All Customers want their own Dashboard All Customers want a different screen layout Why not move the visualisation into the cloud ? Use a Amazon EC2 image to access central Livestatus via https Allow end user to authenticate Customer portal allows ‘spin up’ & ‘spin down’ of images –Move billing to the customer –Scale horizontally for Visualisation © Bull, 2014 23
  • 24. Load Sharing Using plugins like check_wmi_plus put a strain on the monitoring system, large number of queries that take wall clock time to complete and parse. Better to have ‘worker nodes’ via Merlin or Mod Gearman similar to perform these functions – Raspberry Pi for example. No great expense to add 2/3 Pi’s to customer site configurations, easy fall back if they fail – no unique locally stored data © Bull, 2014 24
  • 25. BPI Example © Bull, 2014 25
  • 26. Dashboard Example © Bull, 2014 26
  • 27. Questions ? © Bull, 2014 27
  • 28. © Bull, 2014 28

Related Documents