While catching up on emails this weekend, I ran into a problem—a link in one of the messages ruined my Sunday and almost ruined my professional reputation.

I clicked on the link. Nothing. Refreshed the page. Nothing. Tried a different browser. No joy. Changed my ISP. Nothing. Changed DNS providers. And, as you might guess…nothing.

Just this error: The site cannot be reached. Traceroutes looked fine.

At Catchpoint, we rely on Microsoft Office 365 for all our corporate communication; The service includes a critical feature, Advanced Threat Protection, ATP Safe Links. This feature vets all the links, in all of our company emails, to ensure that we don’t click on anything malicious: phishing links, attachments, etc.

I told myself, I’ll work on something else, it’ll fix itself at some point. Minutes turned to hours. And suddenly, I got another email from the same person, asking—why are you ignoring me?!

Since we drink our own champagne here at Catchpoint, I jumped into our solution, more specifically, our SaaS monitoring. We have deployed a Catchpoint synthetic agent in every office to monitor every service we use (Office365, Salesforce, Slack, Netsuite, Zoom, etc.). Because we ran into a similar issue previously, we’d enabled a monitor to observe this specific Microsoft service.

Lo and behold—I discovered this was a widespread issue. I was not alone! it was affecting our Bangalore, Boston, Los Angeles, and New York offices!

 

 

Some IPs were more affected than others.

 

 

Also, most of the time, the problem was in establishing the TCP Connect.

 

 

 

We also validated that this issue was not limited to our offices. Comparing the data to measurements taken from our extensive backbone and broadband testing network showed a similar pattern.

 

 

The issue started on May 3rd, at 16:04:40 EST to be precise, and is still ongoing! Interestingly, this issue started shortly after the Office 365 outage of May 2nd. Maybe this incident was a side effect?

Of course, I emailed our corporate IT team, but they didn’t know anything either. When they checked the Microsoft status page in our Office 365 admin panel, there were no reported issues.

In the meantime, I annoyed a customer because I couldn’t review the document she sent in time. Then got our corporate IT team to troubleshoot an issue on a weekend, and wasted hours of productivity and precious family time.

This episode is a prime example of how complicated things have become. Any little grain of salt can have a significant ripple effect—the complexity surrounding every service we use, and the total lack of visibility resulting from the lack of control over the systems and services we use.

We live in a new Enterprise IT world, where best-of-breed solutions are easily integrated into the enterprise. To ensure customers get all the benefits of the best-of-breed cloud ecosystem, these tools must work! In a SaaS world, SLAs between vendors and customers has never been more critical. As a vendor of a SaaS solution, how do you make sure you monitor all the services that can cause an SLA breach? Moreover, as a buyer of SaaS solutions, what can you do to make sure your vendors are meeting the SLA that was mutually agreed upon? Furthermore, should you trust your SaaS vendors with their SLA reports? These are some pertinent questions you must address when using SaaS. And going forward, I will follow this old Russian proverb: TRUST BUT VERIFY!

Mehdi

Author Mehdi Daoudi

Leave a Reply

Your email address will not be published. Required fields are marked *