How to Create a Downtime Dashboard Tool with Zapier and Google Sheets

How to Create a Downtime Dashboard Tool with Zapier and Google Sheets thumbnail

At Lucid we want to monitor the health of our services both individually and in aggregate. Last month we worked on setting up a new downtime dashboard tool. We wanted the following functionality:

  • a record of incidents containing the duration of downtime, the service(s) affected, and information about the cause and resolution
  • ability to mark incidents as false positives and exclude them from downtime calculations
  • ability to correct the downtime recorded for an incident
  • ability to see the number of minutes of downtime for one or more services over a given time period
  • a graph of monthly downtime per service
  • a graph of total downtime

After several hours of searching, we weren’t able to find any tool that could do what we wanted better than a spreadsheet. So, we decided to use Google Sheets to record incidents and graph trends. The next problem was how to get the incident data into Google Sheets. At first we were going to write a cron job that would poll OpsGenie for alerts, process any alerts about downtime, and use Zapier to add rows to a Google Sheet. However, after looking into Zapier a little, we discovered Zapier could actually do everything we needed the cron job to do.

Diagram of data flow from OpsGenie to Google Sheets

Zapier is tool that makes it easier to send data between different cloud services. In many ways it is similar to piping unix commands together. You take data from a source, perform some transformations on the data, and feed it into a destination. In our case, the source was OpsGenie and the destination was a Google Sheet.

Step 1: Getting the data

The first thing we needed to do was get the incident information into Zapier. There are two kinds of web hooks that Zapier supports: catch hooks and polling. With a catch hook, Zapier provides you with a URL to which you can post JSON data to begin the pipeline. With polling, on the other hand, you give Zapier a URL which it will periodically fetch, and then you look for differences. If there is a difference from the last poll, it will start processing the new data. Catch hooks are preferable, since a request is only made when new data is actually available. However, depending on your application, catch hooks may not be feasible.

Selecting the type of web hook

Fortunately, OpsGenie has the capability to send data to a web hook on various events (the web hook integration), so we added a web hook that would send any alerts containing the string “downtime:” in the message to Zapier when the alert was closed.

Unfortunately, the incident data sent by OpsGenie didn’t contain the time the alert was closed at, which we needed to calculate the duration of the downtime. Thankfully, Zapier had another feature which allowed us to get the information we needed. As a second step in the Zapier pipeline, we added another web hook that makes a GET request to OpsGenie to get the full incident report using the id sent from OpsGenie to Zapier in the first step.

At this point we have all the information we need about the incident in Zapier.

Data values pulled in from OpsGenie

Step 2: Sending data to Google Sheets

The main reason we went with Zapier is that it makes it extremely easy to add data to Google Sheets. Essentially you just create a template that uses data retrieved from previous steps in the pipeline, and that template is used to create or update a row in a spreadsheet. The template is basically a form with fields that come from a header row on the spreadsheet that the data will be exported to. The sheets integration also has a “Lookup Spreadsheet Row” option, which allows you to look up a row based on a key column, and possibly add a new row, if one doesn’t exist. This is the option we used, to avoid adding duplicate entries, using the alert id as the deduplication key.

Google Sheets Zapier template

One slight inconvenience is that Zapier doesn’t really have a good way to handle spreadsheet expressions in the template that reference cells in the same row. There are two ways to deal with this. The first is to output a spreadsheet expression in the template that uses the inputs from earlier in the pipeline directly. For example, the start field in the screenshot. This works in many cases, but not always. In our case we wanted the user to be able update two columns in the spreadsheet, “Additional Downtime” and “Include in calculation,” and automatically update the “Total downtime” column. In order to do this, we needed to create a formula that could somehow reference cells in the same row without the actual cell name. This is possible using the INDIRECT and ROW functions. For example, in order to reference the D cell on the current cell you would use INDIRECT("D"&ROW()). This concatenates the string “D” with the number of the row and then converts the string to a cell reference. This works pretty well, but is somewhat brittle since if the order of the columns ever changes the Zapier template needs to be updated to match.

Finally, although not completely necessary, transforming the data a little bit in Zapier before sending it to Google Sheets made building the spreadsheet easier and more performant. This is possible using the “Code by Zapier” action, which allows you to run a simple javascript function to transform your data. We did three things with this:

  1. Parse the name of the service from the message of the OpsGenie alert.
  2. Convert the timestamp of the alert creation from nanoseconds to days because days are the units Google Sheets uses for datetimes.
  3. Convert the closeTime (i.e. the duration) from milliseconds to minutes.

Step 3: Analyzing Data

Once we had the incident data in a page of our spreadsheet, we needed to create pages for the analysis and graphs. The graphs themselves are pretty straightforward, but generating the data to graph is less so.

First we wanted to be able to calculate the minutes of downtime and percentage of uptime for each service between a start and end date. In order to do so, we created a cell for the start date and a cell for the end date. Conveniently, Google sheets will pop up a date selector if you double click on a cell containing a date.

To sum all the downtime for all incidents in the time range split by services we use the following formula =sumifs(incidents!C:C,A2,arrayformula(datevalue(incidents!D:D)),">"&$D$2,arrayformula(datevalue(incidents!D:D)),"<"&$E$2), where incidents!C:C is a range containing the total downtime for the incidents, A2 is the cell containing the name of the service, incidents!D:D is a range containing the timestamps corresponding to the durations in C, and D2 and E2 are the cells containing the start and end date, respectively. To make the dashboard easier to use, we also generate the service names using the formula =sort(unique(incidents!C:C)).

Spreadsheet formula and incidents sheet header

Creating data for graphs by service works very similarly, except that rather than using specific cells for the start and end date, we use the boundaries of months and create a grid of service vs month that can then be graphed.

Graphing total downtime is a little more complicated because if multiple services have downtime at the same time, then we don’t want to double count the overlapping time (because typically that means they are related). To do that we wrote a javascript function to use in Google app script, which is accessible from Tools -> Script Editor, which takes a range containing the start and end times for each incident and calculates the total downtime collapsing any overlapping time ranges. Here is the code:


function totalDowntime(range) {
  if (typeof(range) != 'object') {
    return 0;
  }
  var merged = [];
  var current = null;
  for (var i = 0; i < range.length; i++) {
    var start = range[i][0],
        end = range[i][1];
    if (current && start <= current[1]) {
      current[1] = end;
    } else {
      current = [start, end];
      merged.push(current);
    }
  }
  return merged.reduce(function(minutes, dur) {
    return minutes + (dur[1] - dur[0])/60000;
  }, 0);
}

Once defined, the function can be used from formulas in cells of the spreadsheet.

Conclusion

Although we were disappointed when we were unable to find a downtime dashboard tool that provided the functionality we needed, it was surprisingly easy to build something that worked using Zapier and Google Sheets. So far we have been pretty happy with the result. Eventually we may need something a little more robust and scalable, but for now an automatically generating spreadsheet is sufficient. What was your most successful project that was quickly hacked together with minimal resources?

No Comments, Be The First!

Your email address will not be published.