Monitoring Best Practices
Enterprise Monitoring services provides tools and expertise for monitoring IT services. Monitoring our services is a critical component to providing reliable IT services. Every service owner and team should follow best practices to monitor their services:
Planning
- Identify staff resources to regularly monitor alerts and reports generated from monitoring tools
- Update staff job descriptions to include monitoring responsibilities. (i.e., Ensure services and systems are reliably monitored for security and performance on a regular basis)
- Include monitoring goals in staff performance reviews to improve performance, availability and reachability monitoring (i.e., reviewing alerts, setting up a dashboard, creating a report, etc.)
- Develop a weekly or monthly availability, reliability or performance report for services.
- Review all alert thresholds and ensure services are properly monitored.
- Learn how to leverage monitoring tools on a regular basis for troubleshooting, performance monitoring.
- Test monitoring alerts at least twice per year.
- Setup security detections and device groups for services in extraHOP.
- Work with your vendors to ensure you are following best practices for monitoring your services.
- Work with your clients to review and agree to critical components that require monitoring for reliability, reachability, performance and availability.
- Ensure any managed services (either in the cloud or on-campus) include requirements in agreements to monitor services for performance, reliability, reachability and availability. Verify alerts for issues can be delivered to Princeton staff.
Design
- Identify critical components and metrics for your service and ensure those are monitored.
- Identify your most critical service alerts and define those alerts to include SMS text messaging or other emergency notifications to ensure you are notified 24x7.
- Ensure you are sending your service logs to a centralized event management system (SIEM) to configure alerts or reports on logs that are generated.
- Identify any dependencies with other services and ensure you those services are properly monitored.
- Develop a written playbook defining procedures to handle service degradation issues, service disruption issues, or security issues for your alerts and monitoring.
Implementation
- Follow Service Management best practices including opening incidents when monitoring tools generate alerts, documenting your service monitoring in a knowledge article, and submitting requests for any monitoring updates
- Contact the Enterprise Monitoring Team either with a Service Portal request to "Application Monitoring" or "System Monitoring" or request a meeting to review implementation process (you can write to us at epm-list@princeton.edu )
- Enable and leverage any built-in monitoring that is inherent to the service or application
- Subscribe to any vendor status pages or vendor mailing lists to receive alerts for any cloud applications
Operations
- Ensure staff with operational responsibilities are leveraging monitoring tools on a daily basis.
- Ensure staff with engineering or programming responsibilities are leveraging monitoring tools on a weekly or bi-weekly basis to identify any potential gaps or issues.
- Regularly test your alerts and review your reports for accuracy (especially when there are staff changes or service changes)
- Regularly tune your alerts and update your reports for any changes in your services (especially when there are staff changes or service changes)
- Review trending reports to identify future service improvements or changes needed in service delivery.
- Do not ignore alerts and allow issues to accumulate. As problems increase, it is harder to identify actual problems. Either fix problems or tune alerts to be more actionable.
- If your service, host or application has a problem, please schedule downtime, include notes to verify you are aware of the issue, and ask the Monitoring team to filter out any alerts until the issues resolve.
- Include monitoring information in your incident management updates - time stamps, error messages, graphs, and other monitoring information are critical components of incident management.
- Use monitoring tools after any major system changes or upgrades to ensure services are running as expected. This includes reviewing log files, ensuring all critical services are operational and service performance is optimal.
Email Best Practices
Monitoring Alert Email Addresses
- Alert/Notification email can come from the following addresses:
- Nagios XI: nagios@princeton.edu
- ExtraHop: extrahopalerts@princeton.edu (alerts & reports) and no-reply@notify.extrahop.com (detections)
- Statseeker: statseeker@princeton.edu (formerly statseeker@statseeker.princeton.edu)
- Do not reply or send email to these addresses. Any questions should be forwarded to the Enterprise Monitoring Team.
Listserv lists
- Email Listserv lists should be utilized for receiving alerts from the monitoring systems.
- If the listserv list is set to private, you will need to allow the following addresses by adding the following to Special (located under the Access Control Tab)
- *@princeton.edu
- alerts@thousandeyes.com
- no-reply@notify.extrahop.com
- Adding *@princeton.edu will also allow the EPM team members to send email to alert contacts as well.
Remove the 10 Minute Listserv Email Delay
- By default, listserv lists have a 10-minute delay. You can disable the delay by setting Spam-Delay to 0 (located under the Error Handling Tab).
Don't Forget to Check Spam
- Verify alert emails are not going to your Spam or Junk Email folder.
- Add the monitoring email addresses or domains to your safe senders list.
Other Alert Options
- Can alerts be sent via an SMS Text Alert?
- Yes, using an email to SMS email address. Email is sent to an address using your 10-digit cell number @ your provider.
- Verizon uses @vtext.com
- AT&T uses @txt.att.net
- T-Mobile uses @tmomail.net
- Can alerts be sent to Microsoft Teams?
- Alerts can be sent to Microsoft Teams in one of two ways, using the email address associated with your Teams channel or using a webhook. Contact the Enterprise Monitoring Team for more information.
Please contact the Enterprise Monitoring team in OIT to ensure your services are monitored. You can open a ServiceNow request for monitoring services or email epm-list@princeton.edu for further details.