How to host an effective downtime postmortem
Reading time: about 7 min
Posted by: Lucid Content Team
Downtime incidents are never a good thing. They can cost your company money, resources, and consumer trust. However, hosting a downtime postmortem can turn an incident into a learning experience by giving your team the opportunity to develop a better understanding of the incident and identify best practices for preventing downtime in the future.
What is a network downtime postmortem meeting?
Postmortem meetings bring critical team members together for a review and problem-solving session. The focus? Reviewing the incident, compiling a list of learnings, and preventing future downtime events. Usually, the output for this type of meeting is a postmortem report and action plan.
Following the meeting, your network engineering team should have specific action items to work on. Your organization may decide to use part of your written postmortem report to create external communications for your customers who were impacted by the incident. Additionally, you may be requested to share information internally with relevant teams.
Keep in mind that your meeting’s emphasis should be on the future, using references to the past only to move your team forward.
Benefits of hosting a network downtime postmortem
Postmortems provide your team with an opportunity to develop a 360-degree view of what happened and what needs to happen next. This process plays a vital part in the continuity of service for your customers. Since ongoing uptime is essential for your organization’s operations, conducting a postmortem meeting after a downtime event allows you to salvage a valuable learning opportunity out of an otherwise negative experience.
There are other benefits, too, such as:
- Getting your organization on the same page
- Developing consistent messaging so you can communicate with your customers
- Reviewing the timeline and incident details for accuracy
You should also invite any team members who were close to the incident and the resulting response. Beyond that, consider including any decision-makers who must approve the changes and decisions established at your meeting. For stakeholders or managers who can’t make it, be sure to take good notes and identify action items that require follow-up.
Data, metrics, logs, and timelines
It’s important that you have the right records available to review as a team, including any logs and timestamped details from the incident and response. Be as specific as possible with times, event histories, and details to ensure accuracy. And resist the temptation to round numbers and change nine minutes to “about ten minutes” or 1:32 pm to “around 1:30 pm.” Specific details may yield valuable information later.
Your organization should gather metrics about how long the downtime episode lasted, how quickly you initiated a response, and how serious the incident was. These details can then be compared with other incidents to track how well your organization is doing at incident response and resolution.
Working from available information, carefully recreate the timeline as a group, noting how your organization responded and what it ultimately took to successfully resolve the downtime incident.
Human vs. technological errors and shortcomings
As you review the data, your team can identify the issues and errors that contributed to the incident. Determining whether specific errors had a human or a technological cause will be key to drafting an actionable prevention plan.
While you have everyone’s attention, it’s the perfect time to start looking for possible solutions.
Best practices for effective network downtime postmortem meetings
Follow these best practices to achieve your goals for the meeting and keep everyone productive.
Get the timing right
Ideally, you should host your downtime meeting immediately following the incident. Plan to meet within the first 48 hours or so, definitely no later than five business days afterward. You want to act while the events are still fresh in your team’s memory and before customers and other stakeholders start to wonder why you haven’t offered an explanation. Plus, the sooner you can uncover what went wrong, the more easily you can prevent another occurrence.
Leave blame at the door
Don’t make it personal. Postmortem meetings aren’t the time to place blame, accuse individual team members, or point fingers at particular groups or departments. Blame discourages honesty and active participation, both of which are essential to successful postmortems. Remind your network engineering team that the postmortem’s true purpose is prevention, not punishment for past mistakes or failures.
Keep the meeting focused with an agenda
Open-ended postmortems are often unproductive. Instead of taking an unstructured approach, create an agenda and share it with meeting attendees beforehand to keep everyone on track.
Important agenda items include:
- Incident summary: As a team, create a short summary of what happened.
- Analysis: Diagnose the root causes of the incident.
- Architecture vulnerabilities: Examine possible vulnerabilities in your network architecture.
- Prevention/solution discovery: Conduct a brainstorming session discussing possible solutions and prevention strategies.
- Notifications: Discuss who should be notified about the downtime event and what should be communicated internally and externally about the incident.
- Action list: Create a list and assign owners to specific tasks.
- Follow up: Plan accountability in advance and decide how you’ll keep everyone updated.
Save more unstructured conversation for the end of the meeting. In any remaining time after you get through your agenda items, take the opportunity to open the floor for other questions and comments. This respects everyone’s time and helps you reach your meeting goals.
Incentivize your team
Throughout the postmortem process, try to win buy-in from your team. Encourage the group to collaborate on the meeting report and incident response plan. In addition, provide an opportunity for each individual to create their own postmortem summary. Remind participants that their incident plan will guide the entire organization’s actions and serve as an important reference for management.
Reference your cloud architecture diagram
As you conduct your postmortem, it’s important to get everyone working with the same data. By referring to your cloud architecture diagram during these conversations, you can facilitate clear discussions of what systems were impacted during the incident, why they were impacted, and what architectural improvements are currently under discussion.
Finish with a postmortem report
This is the final outcome of your incident response planning and downtime postmortem meeting. Your report should summarize your group’s findings and recommendations. To make things simple, consider starting with a report template and drafting your report in a shared document. This will make it easier for people to pitch in when you’re ready for the group to review the document.
To prepare your report, consider including the following elements:
- Incident summary
- Response to the incident
- Timeline of the event and response
- Discoveries and opportunities
To evenly distribute the workload, think about taking time during your meeting to assign different sections of the report to different team members. This will also ensure that the people closest to each section of the report are the ones drafting content. Making your report a collaborative effort allows you to quickly create an effective and accurate incident response report.
Learn from your postmortem meeting
Once your report is drafted, it’s time to share it with key stakeholders. You may want to share your team’s findings across the organization. This will keep everyone informed and will give customer-facing departments the information they need to handle inquiries. In some instances, it may even make sense to publish some of your findings for customers or publically share the changes you’re making to prevent future downtime events.
After the meeting, don’t forget to follow up with an internal audit. Find out how everyone is doing with the new changes, policies, or technologies in place and make sure they have the support and resources they need. If anyone still needs buy-in or permission for action items they own, it’s important to provide reassurance and official backing. Maintaining a culture of visibility and accountability without blame will empower your team long after the downtime incident is resolved.
With Lucidchart Cloud Insights, you’ll always have an accurate, up-to-date cloud architecture diagram at your fingertips.Learn more
Start diagramming with Lucidchart today—try it for free!Sign up free
Lucidchart is the intelligent diagramming application that empowers teams to clarify complexity, align their insights, and build the future—faster. With this intuitive, cloud-based solution, everyone can work visually and collaborate in real time while building flowcharts, mockups, UML diagrams, and more.
The most popular online Visio alternative, Lucidchart is utilized in over 180 countries by more than 25 million users, from sales managers mapping out target organizations to IT directors visualizing their network infrastructure.