Categories
ZSuite

CEM & BCM: what’s the difference and which do you need?

Sivaraman Subramaniam

As a Resilience Management Professional, you may be familiar with the concepts of Critical Event Management (CEM) and Business Continuity Management (BCM). While these two areas of focus may seem similar, they are actually quite distinct and have different objectives. In this blog post, we will explore the key differences between CEM and BCM and how they complement each other to ensure the overall resilience and continuity of an organization.

Critical Event Management

CEM is a proactive approach to identify, assess, and respond to potential crises and emergencies that could significantly impact an organization. The goal of CEM is to minimize the negative effects of a critical incident and to ensure a timely and effective response.

CEM typically involves developing incident response plans, conducting risk assessments, and providing training for employees on how to respond  to critical events.

Business Continuity Management

BCM, on the other hand, is a proactive approach to ensure the continuation of essential business functions during and after a disruptive incident. The goal of BCM is to minimize the impact of the incident on the organization and to restore normal operations as soon as possible. BCM typically involves developing continuity plans, identifying critical functions and dependencies, and testing and exercising the plans.

Key Differences

CEM and BCM are different in their scope and objectives. CEM is focused on managing and responding to a specific incident, while BCM is focused on ensuring the continuity of essential business functions. CEM is typically activated during an incident, while BCM is always active, providing an ongoing process.

CEM is more incident-driven and focuses on immediate response, rescue and recovery, while BCM is more process-driven and focuses on continuity of operations and recovery. CEM typically deals with short-term issues and disruptions, while BCM deals with longer-term disruptions and the resilience of the organization.

The Importance of Coordination

While CEM and BCM are distinct, they are closely related and should be coordinated to ensure an overall effective response and continuity of operations. Having a clear incident response plan and continuity plan in place, along with regular testing and exercising, can help an organization to quickly respond and recover from a critical event.

CEM and BCM are complementary and the objectives of each are designed to work together to ensure the overall resilience and continuity of an organization. This means that critical incident management is an integral part of overall business continuity management, and this two-pronged approach provides a comprehensive, holistic approach to ensuring the organization can continue operating in the face of disruptions.

Benefits of critical event management

Critical event management (CEM) is a strategy for dealing with unexpected and potentially disruptive events that can have a significant impact on an organization. Some benefits of CEM include:

Faster response times: By having a pre-planned and well-rehearsed response to critical events, organizations can respond more quickly and effectively to minimize the impact of the event.

Improved communication: CEM helps to ensure that all relevant parties are informed and involved in the response to a critical event, improving coordination and reducing confusion.

Automated Alerts Automated alerts across and seamless engagements with responders ensure faster resolution time and reduce impact .

Reduced impact: By identifying and mitigating potential risks, CEM can help to minimize the impact of a critical event on an organization.

Seamless Data Integration Capabilities; Integrating with multiple database and contact sync with s=custom integration helps in obtaining completing Org operating picture

Compliance  Deep Analytics,Audit Logs & Consolidated Reporting helps Management in getting the real time updates as CEM is more operational in nature enabled by AI platforms

Cost savings: Effective CEM can help to minimize the financial impact of a critical event on an organization.

Reputation management: By demonstrating a capability to effectively respond to critical events, organizations can improve their reputation and maintain customer trust.

Conclusion

In conclusion, while CEM and BCM share some similarities, they are distinct in their focus, objectives, and implementation. CEM deals with immediate response, rescue, and recovery, while BCM deals with ensuring continuity of operations and resilience of the organization. Having both in place along with regular testing and exercising, will help organizations minimize the impact of critical events and restore normal operations as soon as possible.

Categories
White papers

IT Event Management

 

Sivaraman Subramaniam

Before we go much deeper into this, let’s understand what an IT Event means. An IT event is simply a state of change to an IT service. The goal of IT event management is to detect and log these changes in order to gain full visibility into the IT service. For example, a user login, information about a recent deployment, or the completion of a server maintenance are changes that technical teams need to keep track of.

While such changes don’t inherently imply service degradation, they can be indicative of simmering issues that may be customer- impacting. Thus, events must be collected, prioritized, and acted upon as necessary. In modern IT shops, each IT service is monitored by multiple combinations of monitoring tools.

Categories
ZSuite

Shift focus to CEM platform ROI for clients

 

How will your clients get funding and prioritization for their Critical Event Management platform and enterprise resilience programs once the acute crisis ends? It’s a question well worth asking. And the answer won’t surprise you: get your clients thinking about the ROI of their Critical Event Management programs.

 

How the ROI of Critical Event Management depends on the top business risks your clients face

The easiest way to do so is to calculate the estimated cost of the program (including associated tools and resources) subtracted from projected revenue loss risked if a disruptive event were to occur without proper business continuity and resilience safeguards.

But not just any disruptive event. Clients must determine which risks pertain to their business.

Those risks, likely to change over time, will be based on factors such as geography, industry, political and regulatory climate, customer base, etc.

Generic risk indicators are a good place to start. For instance, the most recent World Economic Forum Global Risks Report (2022) amassed the following list of the top global risks:

 

  • Cyber incidents
  • Business interruption, including supply-chain disruption
  • Natural catastrophes
  • Pandemic outbreak
  • Changes in legislation and regulation
  • Climate change
  • Fire, explosion
  • Market developments
  • Shortage of skilled workforce
  • Macroeconomic developments, e.g., monetary policies, austerity programs, commodity price increase, deflation, inflation

 

How the cost of unplanned downtime factors into the ROI of Critical Event Management

Of course, executives don’t just want to know what’s likely to happen. They’ll demand to know how much the unplanned downtime from that disruption is likely to cost, as well.

 

Here, costs are likely to vary by industry. Unplanned interruption in heavy industry, for instance, entails higher machine costs than in education.

Cross-industry estimates, though, will get clients thinking. Practitioners can feed data, such as the following, into ROI calculations.

 

 

  • Server downtime. The hourly cost of server downtime tops $1 million for 44 per cent of enterprises.
  • Data breach. In 2021, the cost of a data breach was $4.24 million, representing a 10 per cent jump in two years. Lost business (including increased customer turnover, lost revenue due to system downtime and the increasing cost of acquiring new business due to diminished reputation) constituted 38 per cent of the total, or $1.59 million.
  • Among other business interruption incidents, Allianz estimates that:
    • The average value of a fire/explosion-related insurance claim comes in around $6.7 million.
    • The average value of a storm-related insurance claim comes in around $4.4 million.
    • The average value of an earthquake-related insurance claim comes in around $1.6 million.
    • The average value of a machinery breakdown-related insurance claim comes in around $.62 million
    • The average value of a water damage-related insurance claim comes in around $.55 million

Executives will ask for more precise data, requiring forensic analysis. But trend lines are important, too.

And those lines point to the fact that the cost of unplanned interruption is going up – precipitously.

Investing in Critical Event Management Platform strategies will lower costs to the business – often even to put money back into the business when business continuity interventions identify expensive deficiencies before disruptions occur.

The only way to do so, though, is for client to get the right

digital Critical Event Management software.

Indeed, not all platforms enhance ROI. Clients, with your help, will have to do due diligence to scout out the platforms that automate key Critical Event Management functions, to make business resilience and management easy.

 

Categories
ZSuite

Why collaborations during Major Incidents and evolution of Chatops

 

Collaboration is the fundamental and most critical element of any team that provides incident response. Information silos between incident response members during major incident response will increase the overall Incident resolution time. Post pandemic, collaborations between remote teams is must as response members can’t just walk to other members and work together. There must be some easier way to collaborate with remote incident response members per Incident level and the Chatops is the solution for this. ChatOps is designed to eliminate information silos that hinder inter team or interdepartmental collaboration and proactive decision-making.

 

Chat Operations (ChatOps) is the use of real-time chat tools to facilitate software development and operations. Also known as “conversation-driven collaboration” or “conversation-driven DevOps,” ChatOps is designed for fast and simple instant messaging between Incident response team members

 

ChatOps offers a collaboration model that connects people, tools, process, and automation into a transparent workflow per incident. This flow connects the work needed, the work happening, and the work done in a persistent location staffed by the people, bots, and related tools. The transparency tightens the feedback loop, improves information sharing, and enhances team collaboration. Not to mention team culture and cross-training.

 

What are the benefits of ChatOps?

  • Collaboration: Removes silos and communication barriers between teams and departments.
  • Engagement: Builds and sustains distributed team culture to align communication and decision-making.
  • Productivity: Enhances business processes via real-time information provision.
  • Security and compliance: Provides current and historical task documentation to enhance safety and regulation.
  • Transparency: Aligns communication and documentation project statuses.

Individually and collectively, these benefits ultimately strengthen DevOps and ITOps by speeding up team communications, which shortens development pipelines and incident response time.

 

How to deploy a ChatOps environment

Deploying a ChatOps environment requires using the following tool types:

  • Notification system to send alerts to chat rooms when incidents occur.
  • Chat client (e.g., Slack and Microsoft Teams) to execute pre-programmed commands.
  • Incident Management software with inbuilt Chatops tools (integrated into the ChatOps environment) for improving ticket tracking and automating incident remediation workflows.

 

Beneficial use cases for ChatOps

Depending on the phase of implementation you’re in and your intentions for creating a ChatOps environment, the following use cases are applicable to your enterprise:

  • Access control and security: Many companies implement ChatOps for its agile engagement features. The intricate communication architecture enhances project access control, which enables long-term chat security operations (ChatSecOps).
  • Application deployment: ChatOps improve the visibility of application development pipelines amongst DevOps teams. This helps them collectively consider deployment options and orchestrate their selections accordingly.
  • Incident management: When it comes to incident detection, response and resolution, ChatOps is an invaluable tool. Throughout the resolution process, it keeps teams informed and automatically updates tickets throughout their remediation workflow.
  • Continuous delivery (CD): ChatOps integrates DevOps, ITOps and automation processes into a singular workflow. It bridges team communication, pipeline development and operational tasks for the continuous delivery of apps.

 

Zapoj IT Event Management comes with inbuilt ChatOps and war rooms per incident level to enhance the collaboration among interdepartmental collaboration and proactive decision-making.

 

Categories
ZSuite

How IT Alerting Improves Overall IT Incident Response

Incident response is an organized approach to addressing and managing the aftermath of an IT Service disruption, also known as an IT incident. The goal is to handle the situation in a way that limits damage and reduces recovery time and business losses — and prevents it from happening again. Incident response generally includes an outline of processes that need to be executed upon in the event of an IT incident.

Ideally, incident response activities within your company or organization are built up over time and get better with each incident. Many times, the knowledge of how to conduct thorough incident response is lost when a team member leaves, making it ever more crucial to have a documented process. Also Many times incident response processes are often fragmented and require significant manual work to align the right technical responders and business stakeholders. A delay in notifying the Incident responders or acknowledging the Incident or Notifying the wrong IT expert increases the overall Mean Time Between Failure (MTBF). But this doesn’t have to be the case.

 

Let’s understand how IT Alerting can help companies in Mobilizing and automating a coordinated Incident response. 

What is IT Alerting 

IT alerting automates the manual process of identifying the right IT Team or Support personnel and contacting them on right communication channels. In a sense, it streamlines the way IT notifies and communicates during major IT incidents to resolve issues faster and minimize the overall impact on the business. It provides consistent messages to the right IT Teams and keeps all stakeholders and impacted customers informed on resolution progress.

What are the Key features of IT Alerting 

IT Alerting under the hood leverages IT Service Ownership details like which IT Team owns a specific IT Service, their Oncall schedules and how to Notify them as per their preference and also by Incident priority.

 

Oncall Schedules :

Whenever a new incident is launched, IT Alerting Identifies  in real time the right IT team based on the IT Service impacted, picks the correct shift details based on date/ times of Incident and selects the right people based on the type of incident, time of day, skill set required for that Incident.

 

 

Multi Channel Communications & Escalations:  

Sends consistent messages about Incident on multiple communication channels such as Voice Call, SMS , Mobile app Push Notifications and EMAIL. On each channel messages are delivered, it provides an option to acknowledge the message to the Incident Responder. Whenever an Incident responder acknowledges the message, it stores communication channel , date & time of acknowledgement . This information is used to calculate the MTTR (Mean time to respond).

 

In case if the Primary responder didn’t acknowledge the Incident notifications within threshold time, then it automatically launches escalation notifications as per escalation rules.

 

Best practices to follow for successfully IT Alerting 

  • Oncall Planning , make sure each Team is sufficiently planned for the next few quarters. Also ensure each team has conducted the Oncall readiness checks and no gaps are found. 
  • Always provide more than one communication channel for each responder. Phone calls are an important channel to notify for high priority incidents. 
  • Ensure each shift is planned with multiple responses,Notifying multiple responders at once to orchestrate a real-time, cross-functional response.
  • Encourage response to acknowledge notifications, so that recorded response activities help to understand the gaps in response SLA’s.

 

Advantages of IT Alerting as part of Automated Incident Response

 

  1. Automatically identify who should respond for immediate response
  2. Automatically send multi-channel alerts until acknowledged
  3. Automatically escalate alerts until acknowledgement.
  4. Self-service calendar and notification management for best efficiency
  5. Records response metrics to understand and improve overall MTTR . 

 

Categories
ZSuite

Prioritize incidents with an incident priority matrix

 

 

Before we go deep into the rest of the blog content, let’s understand what is Incident Priority.

“Incident priority is defined as the intersection of impact and urgency of an incident. When you consider the impact and urgency of a situation, you can easily assign priority and assign adequate resources. You start by calculating impact and the urgency, and assign the incident a priority value”. 

At the same time, it’s important to remember that priority is relative. It defines the actions you will take in a particular situation. However, the actions are not set in stone and will change with the situation and context. It isn’t about an objective priority level, but what’s the highest priority among your options.

Factors determining the Impact of an Incident 

 

  • Number of users or customers impacted
  • Loss of revenue or cost incurred in incident resolution
  • Number of IT services involved 

 

Now let’s understand the Priority Matrix and it helps determine the Incident Priority.

 

 

What is an Incident Priority Matrix 

 

Definition :

An incident priority matrix is a method of prioritizing incidents based on their impact and urgency.  “Impact” is a measure of the extent of an incident and the potential damage it can cause. “Urgency” is a measure of how quickly a resolution is required.

Effective incident management relies on the ability to focus on impact rather than the order in which issues arose. In defining urgency, it’s important to create a hierarchy for handling issues that reflect your business demands—such as restoring customer service as quickly as possible before handling other problems.

In general, many IT departments use the following as guidelines for categorizing incident urgency:

High urgency

  • Mission critical for daily operations
  • Extremely time sensitive
  • Propagation rate rapidly expanding in scope
  • Visibility to business stakeholders or C-suite

Low urgency

  • Optional services (i.e., “nice to have, but not essential”)
  • Issue affects only a small section of the IT environment—not expanding
  • Low visibility in terms of affecting the business

 

 

Priority Matrix 

 

Category: Description

High : A significant incident that has a broad impact. You should repair the problem as soon as possible to minimize downtime costs, keep customers happy, and maintain your company’s good reputation

Medium : A medium-level incident that may not directly cause lost revenue but may escalate without swift action

Low : A low-level incident that has almost no chance of reducing revenue. Customer experience may be degraded, but not enough to make them switch to a competitor

 

Priority Levels and SLA’s

 

The first step toward ensuring an effective incident Management response  is to properly define and implement standardized incident Priority Matrix levels.

The best way to map out incident priority is in an incident management matrix. In the matrix, we map out various incidents according to their impact and urgency, and a priority class is automatically assigned. If both urgency and impact are low, then the incident is assigned a low priority (P4 or P5). if both values are high, then it’s a high-priority incident (P1 or P2), and if the values lie somewhere in the middle, then it’s a medium priority (P3) incident.

 

Priority Level  Description SLA Response Time
P1 Critical Immediate
P2 High 10 Mins
P3 Medium 1 hr
P4 low 24 hrs
P5 Very Low Up to 7 days

 

Zapoj IT Event Management is committed to empowering your team with the actionable real-time data it needs to act successfully—following the guidance you provide in your defined incident priority matrix to reduce alert fatigue, downtime, and incident impact.

Categories
ZSuite

Fundamentals of IT Incident Management

 

Sivaraman Subramaniam

 

IT incident management is a practice, process as well as tools, which will enable the IT teams to bring a failed IT service to normal as quickly as possible after a disruption. It ensures critical systems and applications are always online and available for customers. IT incident management is an area of IT service management (ITSM), helps keep an organization prepared for unexpected hardware, software and security failings, and it reduces the duration and severity of disruption from these events. 

 

A focus on IT incident management processes and established best practices will minimize the duration of an IT  incident and shorten recovery time of IT Service, and it can prevent future issues. Understanding the foundations of incident management and infusing the fundamentals will create a polished and swift response process—and, more importantly, keeping services “always on” makes your customers happy. 

 

Companies tend to gravitate toward different types of incident management processes. For IT Incidents in specific, most successful companies adapt to an established ITSM framework, such as IT infrastructure library (ITIL) or COBIT, or be based on a combination of guidelines and best practices established over time. Organizations who follow ITIL or ITSM practices may use the term major incident management for this instead. 

 

The Importance of IT Incident Management 

 

Whenever a disruption to IT service occurs, in turn it impacts the related business services. Many organizations report disruption of IT services costing more than $300,000 per hour, according to Gartner. For organizations, where most customer interactions happen over the web, that number can be dramatically higher.

Incidents are generally categorized by low, medium and high priorities. Low-priority incidents do not interrupt end users, who typically can complete work despite the issue. Medium-priority incidents are issues that affect end users, but the disruption is either slight or brief. High-priority incidents, however, are issues that will affect large amounts of end users and prevent the proper functioning of a system. 

 

Assigning the right priority , severity , impact scope and description are as important as resolving the Incident itself. Not having proper process and training jeopardize the entire IT incident Management Practice. 

 

IT Incident Management helps Devops, SecOps, IT Operations and MIM (Major Incident Management)  teams with reliable methods to prioritize incidents, assign severity, get to resolution faster, and offers following functionalities 

 

  • Identifying, logging and categorizing an Incident, so that proper teams can be identified to recover failed IT Service.
  • Provide a situational awareness and common Operating picture of IT Services across the organization.
  • Relation between affected IT services and underlying Business Service Mapping. 
  • Communicate clearly to customers, stakeholders, service owners, and others in the organization.
  • Collaborate effectively to solve the issue faster as a team and remove barriers that prevent them from resolving the issue.
  • Continuously improve to learn from these outages and apply lessons to improve a service and refine their process for the future.

 

 

 

 

 

 

 

 

 

Lifecycle of IT Incident 

 

Detect : 

 

NOC, SOC and MIM teams at your company can become aware of incidents in many ways. They can be alerted by Devops Monitoring tools , through customer support cases, or by observing it themselves. Identifying the problem and detecting the true Incident is the real challenge as all the above mentioned methods do generate a huge volume of data, this is where Event Management helps companies. Once the teams realize there is an incident, the first step the team takes is logging an incident ticket, with description , severity and priority. Metrics such as MTTI (Mean Time to Identify) incidents are  important SLA to track. 

 

Next step is to figure out whom to assign the Incident.

 

Rally & Respond :

 

This is a very important stage in the IT Incident Management process, identifying the right team, right person with the appropriate skills to assign the Incident and notifying them on the right communication channel. Most organizations take an average of 30 mins to identify and rally the right people due to either manual Incident alerting process or siloed tools for logging the Incident in one tool and notifying the right person via a different tool. Manual or semi or solid automation of IT Incident response increase MTTR (Mean Time To resolution ) 

 

An automated incident response solution like Zapoj IT Incident Management helps organizations  to orchestrate the identifying right oncall personnel and notifying him/her on the right communication channel to resolve incidents faster and reduce your mean-time-to-resolution (MTTR). 

 

Diagnose :

 

Once the incident  is assigned, oncall staff personnel can begin investigating the type, cause, and possible solutions for an incident. This is where situational awareness of what had happened and happening now with a specific IT Service involved as part of the Incident helps teams to narrow down the root cause of the problem. Common Operating Picture provides which down and upstream service might be already impacted or soon going to be, so that oncall personnel can request other teams to be notified or escalated within his team. After an incident is diagnosed, you can determine the appropriate remediation steps. 

 

Resolve :

 

Once the right teams join and a plan of attack has been formulated, the incident resolution phase begins. Here Incident commander role , determine what needs to be shared with the public, stakeholders , and customers. Ability to share the relevant information with each of the involved parties improves customer satisfaction and compliance. Critical Event Management Platform like Zapoj, helps Major Incident Manager quickly identify the impacted customers, launch Mass Notifications , provide ChatOps & Video conferencing features to Incident Responders and Status Pages to Stakeholders.

 

Closing incidents typically involves finalizing documentation and evaluating the steps taken during response. SLA metrics to be tracked include MBTF (Mean Time Between Failure). 

 

Learn :

 

Learning from recent and historical incidents is arguably the most important step in the IT incident Management process. It’s in the aftermath that your team is able to look and see what went well or what didn’t go so well, and what you can do to prevent things from happening again. Incident post-mortems and analytics are a great way for teams to continuously learn and serve as a way to iteratively improve your infrastructure and IT incident management  process. 

 

Roles In IT Incident Management 

 

Every organization typically has their own custom roles and responsibilities, below are some of the most common IT incident management roles:

  • End user : This is the client or Customer who usually experiences the first sign of an outage or disruption and will flag it to the customer support case. 
  • Customer Service Desk : Typically the first point of contact when there is a customer case involving IT service and initiates the IT incident management process requesting an IT incident ticket.
  • IT Incident Manager : A key stakeholder in the IT incident management process that drives the entirety of the lifecycle from Response to Resolution.
  • NOC Service Desk or IT Operations : Composed of technicians with primary knowledge around major incidents involving applications, infrastructure, and systems management.
  • IT Incident Responders (Devops , Secops, App Teams , DBA’s ) : Specialist technicians that have advanced knowledge in extremely specific regions of the company’s infrastructure and applications. Usually these professionals are brought in for complex incidents, maintenance and remediation
  • IT Incident Operator : This person typically moderates the incident communications between response teams, customers and Key stakeholders.

 

Choosing IT Incident Management Software 

 

For 21st century businesses are “always one”, be it a small or large or multi national corporation, the cost of  IT downtime can mean thousands of dollars lost revenue, negative customer sentiment, or hundreds of lost customers. These consequences mean IT Executives (CIO , CTO)  must maintain and manage cloud and on-premises infrastructure, applications, APIs, and containers—all while rolling out enhancements and upgrades in near real time to meet ever-changing customer demands.

 

There are many siloed or traditional IT service management (ITSM)  tools to choose from when it comes to implementing your IT incident management processes, and they all have varying features. To remain competitive, it’s critical that you ensure the following features are met.

 

Here are important features to look out for 

 

  1. Ensure your IT incident management platform is always up when your IT is down.
  2. Does it support an IT Service based approach rather than IT Team based approach?
  3. How easy your teams can integrate their preferred Monitoring, security , and change management tools. 
  4. Does it support Automated Incident alerting and response for your IT response teams.
  5. Does it provide real-time situational awareness into incidents and overall health of your IT services?
  6. Does it have AI, to reduce the number of Incidents or convert alerts into Incidents.
  7. Does it offer important collaboration features like inbuilt Chatops and Conferencing.
  8. Does it offer Customer and Key stakeholders notifications to update status. 
  9. Can you gain actionable insights from data (IT Services, Incidents, Teams, On Call schedules, Alerts).
  10. Finally , the cost to operate and own the software. 

 

Learn more about Zapoj IT Event Management  and the automated Major incident Management, which encompasses everything from Detect, to resolve – to learning and prevention to support IT teams as they move towards owning their code in production.

 

Categories
ZSuite

The Biggest Problem With Oncall Management, And How You Can Fix It

 

With the increase in digitalization of business services, the worldwide business depends on IT services more than ever before. An outage to an IT Service can affect millions of people, with real impact: They can’t pay their bills, they can’t book their flights, they can’t contact family or friends.

 

These IT incidents not only cost businesses $700 billion per year in North America alone — but also on the reputation of your company, your product, and your team.

More than ever, organizations need a way to instantly and accurately spin up a precise multi-team, business-wide response for major incidents and accelerate the speed of resolution, to mitigate increasingly costly impacts from unexpected disruptions. 

 

The Problem

 

This always-on, always-available expectations of digital services have increased the availability of the IT teams to be ready to provide a response around the clock. Being on-call means that a person should be able to be contacted at any time in order to investigate and fix issues that may arise for the system he/she is responsible for. This leads to anxiety in IT teams on how to be ready round the clock and balance personal life. One of the biggest challenges for teams taking on a new on-call responsibility is the reputation that on-call is disruptive to responder’s lives in a very detrimental way. No one wants to miss family events, holidays, and sleep. Before we learn on how to solve this problem, let us understand few fundamentals 

 

What is Oncall

On-call is the practice of designating a specific person to be available at specific on date and times to respond in the event of an IT service disruption, even though it’s outside normal business hours”. 

 

On-call is a critical responsibility inside many Devops, Secops, NOC, IT Ops, developer, and customer support teams who run services where customers expect 24/7 availability. 

The Solution

Let’s see how Oncall anxiety can be solved. Creating a better on-call experience for your team requires cultural best practices and Oncall Management software. 

 

Guidelines for Planning Oncall 

 
 

To alleviate the fear of going Oncall, Teams can follow below guidelines to navigate the murky waters of on-call for teams that haven’t been on call before. 

 

  1. Clearly define the on-call schedule dates & times

    Oncall during should be clearly defined. This helps prevent burnout, confusion, and frustration. We suggest documenting your incident response process and expectations for what it means to be on call.

  2. Have primary and secondary responders and responsibilities

    Life doesn’t stop just because someone is on call. Just like an unexpected personal emergency can take a developer offline during the work day, the same can happen when they’re on call. Putting a backup in place limits the potential damage from this kind of interruption. 

  3. Make sure alerts are being assigned to the right person

    Getting your alerting tooling dialed in effectively shouldn’t be overlooked. Making sure to have a clear altering flow and escalation process with the right notifications and overrides can avoid a lot of headaches.

  4. Fine-tune and review schedules

    Teams are not static things, neither should be your on-call schedule. We recommend a culture of continuously reviewing, adjusting, and improving your on-call practices.

  5. Make sure they have access and familiarity with all the relevant diagnostics tools

    Every team varies in the tools they use to track operational health, application performance, resource utilization, etc., Make sure your on-call engineers are familiar with the tools used and have proper access to them.

Oncall Management Software 

Following are the important features to look for Oncall Management Software 

 

Oncall  Planning and Notification Channel preference 

 

Your on-call Management software can help you plan and manage the staffing schedules per team, responsibilities of oncall members with a team, and which notification routes will be most effective—whether email, text message, phone calls, chat messages, or other methods. Then help your team configure their on-call accounts with the appropriate notifications to meet their needs and response requirements. 

 

Readiness Reports 

 

“On-Call Readiness Report” helps your team get organized around notification types. The Readiness Report will look at your team members in your Oncall Management software and determine if they have set up their notifications to meet certain standards, ranging from “More than email” to “Never miss a page”. Different teams may have different preferences for how their notifications are set up based on the services they support. Some organizations may set this as a top-down mandate, or it could be an individual team decision. However you set your standards, the On-Call Readiness Report is a useful tool to ensure standardization across the team. 

 

 Analytics 

Both Response teams and management should be able to analyze following metrics to address the burnout of the response teams and improve their productivity. 

 

  • Total number of Incidents received per Team 
  • Average number of Oncall Members per schedule 
  • Number of days or hours a specific person is Oncall
  • Number of Incidents assigned to a specific Oncall Person 

The Zapoj CEM platform has a number of useful tools for you to use to make sure your team is ready to go on call  

Categories
ZSuite

The 5 stages of effective IT incident Detection and Assigning 

 

Ideally, monitoring and alerting tools will detect and alert your IT teams about an incident before your customers even notice. Though sometimes you’ll first learn about an incident from customer support cases. For successful early incident detection, you must not only have a holistic view into the health of your IT infrastructure by implementing different monitoring tools to appropriately monitor disparate and new systems, you can gain full-stack observability. 

 

No matter how the incident is detected, your first step as a Major IT Incident Manager is to record a new incident with appropriate details about the Incidents. Not all incidents are created equal. The impacts and severity of a system outage affecting 10% of your users are different from an outage impacting 90%. 

 

While the process of incident detection can grow to be quite complex, you can break down the stages into these seven main categories:

 

Stages of Incident Detection and Creation

Identification:

The first and most obvious step is identifying the problem. Identifying the problem isn’t just about finding the breach, though. Following are the set of questions Incident Manager had to ask 

  • What is the impact on customers (internal or external)?
  • What are customers seeing?
  • How many customers are affected (some, all)?
  • When did it start?
  • How many support cases have customers opened?
  • Are there other factors, e.g. Twitter, security, or data loss?

 

 

Logging :

The next step is logging and tracking the problem to make sure each issue and contingency is being documented as it happens. Tracking is vital to ensure that the same breaches don’t happen more than once, and that teams can learn from past weaknesses and/or errors. 

 

 

 

Categorization :

Classifying the breach or incident into categories helps to show trends over time, which exposes recurring issues and vulnerabilities. Categorization stage should map the problem to a specific business service or if it’s an IT Service that’s even better. 

 

Prioritization :

Prioritizing defines how fast a responder should react to the incident. Prioritizing the more important issues to address can be done in a number of different ways. Oftentimes, it’s done by determining how many users are affected by a particular incident. However, sometimes the loss or interruption of just a small number of users can be highly impactful. So it’s important to create an internal procedure for the Incident priority matrix  that best suits your organization. 

An ITIL incident management priority matrix provides critical baseline information and hierarchical guide that defines the potential impact to your IT environment, along with the ranked measurement of urgency for considering prioritization.  By following ITIL urgency impact priority recommendations, your organization will be better prepared to effectively respond and resolve incidents.

 

Severity : 

Incident severity levels are a measurement of the impact an incident has on the business. Typically, the lower the severity number, the more impactful the incident. Severity levels can also help build guidelines for response expectations. Using a numbering system for severity levels helps quickly define and communicate the incident. Following are some of the example on how to assign severity 

 

Severity 1

Description: A critical incident with very high impact 

Examples :

  • A customer-facing service is for all users
  • Confidentiality or privacy is breached
  • Customer data loss

Severity 2

A major incident with significant impact 

Examples :

  • A customer-facing service is unavailable for some, but not all, customers
  • Core functionality is significantly impacted

Severity 3

A minor incident with low impact 

Examples :

  • A minor inconvenience to customers, workaround available
  • Usable performance degradation

 

Assignment :

Finally you need to assign the Incident to the right team. If you have an effective incident response plan in place, various teams and responsibilities should be clearly laid out. That means when something does happen, you’re able to swiftly assign Incidents to people in key roles, and they’ll be prepared to handle them. Automating the Incident response is used to address potential and active breaches quickly, efficiently and effectively. 

 


Learn more about Zapoj IT Event Management  and the incident response, which encompasses everything from Detect, to resolve – to learning and prevention to support developers as they move towards owning their code in production.

Categories
ZSuite

Situational awareness for Effective IT Incident response

 

Incident response (IR) is a process used by ITOps, DevOps, and dev teams in reacting to IT disruptions such as cyberattack, security breach, and server downtime etc.

An incident response process is something you hope to never need, but when you do, it’s critical that it provides you with all the information you need for you to take important decisions or necessary steps for the response to go smoothly and seamlessly.

 

Information Overload 

 

In today’s always-on digital world, IT infrastructure is monitored by multiple monitoring solutions that generate events and alerts, which indicate changes to the IT environment or a monitor being in a failed state. Many IT and development teams get hundreds of emails a day from their monitoring systems due to alert storms that flood their inboxes. This type of notification traffic creates  ‘alert fatigue,’ which makes it very difficult for incident responders to properly diagnose the incident and respond effectively.  

To improve the Incident response time, it’s imperative to have situational awareness of all the IT Services such as upstream / downstream dependencies of each IT service, current status of Incidents associated with each IT service and mapping of IT Service to Business service. This will enable the Incident responder to take his response decision easily and swiftly. 

 

Situational awareness provides Incident responders, business responders and leaders a live, shared view of system health and incident impact to improve real-time decision making when operational issues affect customers.

 

What are the important elements of IT Operations Situational awareness

 

Following are the three categories of important elements of situational awareness are : 

  • Information gathering – know the typical sources of Incident by collecting information available such as ITSM, Monitoring tools

 

  • Understanding and correlation of information – be able to interpret the information gathered to IT services and present it using the Dashboards , Status Pages, Dependencies service graphs 

 

  • Root cause Anticipation – be able to anticipate the root cause of  an incident

 

Lets elaborate more on this and see how it works 

 

Information gathering 

 

Though the monitoring tools have sophisticated features such as reducing duplicate alerts or flapping, it’s still hard for the monitoring tools themselves to determine which information is vital for an incident responder to make a decision.  Adjusting rules or thresholds or setting up maintenance mode at the source can be an arduous and manual process, and forces teams to look to multiple sources of truth to see all their data.

 

At the same time operations teams perform best when they can see all their data in a single pane of glass. To get to the balance of providing a single pane of glass and also at same time not to overload the incident responder with overwhelming monitoring data, information from various monitoring tools and ITSM needs to be  fused properly to show actionable events only, needed for that specific incident. 

 

Understanding and correlation of information

Obtained information needs to fused and presented in a manner to make which operating picture via following status and dashboards 

 

Service Activity :

Service Activity allows users to see incident activity across many of their services at once, so that they can quickly understand their digital operations health, and identify if there’s a widespread major incident. 

 

Status Dashboard :

Status dashboard provides technical responders, business responders, and leaders a live, shared view of system health to improve awareness of operational issues. It displays the current status of key business services and sends notifications to alert users when business services are impacted. This feature improves communication between response teams and stakeholders during incidents. 

 

Response Status :

Oncall Response status provides a quick and easy way to search which users are on call across all escalation policies. 

 

Timelines :

Incidents and Changes Timeline provides a timeline of incidents and change events across all services 

 

Anticipation

Anticipation of the root cause of an Incident can be achieved using ML alogothium, which leverages historical Incident data to create vector representations (or embeddings) of those incidents in a multi-dimensional space.  Historical incidents that are similar (consistently occurring within the same time window, triggered from dependent resources or from the same source) are clustered together in the muti-dimensional space as calculated by the distance between the vectors. When new alerts are triggered, the ML algorithm converts them into their vector representations. Alerts related to the same incident share similar context (e.g., occurring within a time window) and hence, will be converted to vectors that fall within the same cluster and grouped together. 

Zapoj IT Event Management is known as the digital Operations platform for driving real-time incident response. Our customers leverage the core Zapoj CEM platform for coordinating incident response processes when operational issues occur. Many customers also use our situational awareness  features to act as a central nervous system through which digital signals from their various distributed systems can be routed, correlated, and surfaced automatically.