Geotab has a large and sophisticated server infrastructure. There are thousands of servers in cloud hosted environments all working together. Telematics has gone from a useful service, in many cases, to a mission critical function for those companies using telematics. Keeping these servers and systems operational for the customers is critical. Geotab systems are very well tested and built to very high standards of reliability, but there is some reliance on third-party components and sometimes engineers make mistakes that do impact customers.
Geotab’s engineers take every issue or failure extremely seriously. Geotab has recently established a team to further enhance the response to and resolution of such issues. The designated 24/7 Server Operations On-Call team, formerly a sub-team within Engineering Support, are always one alarm away from solving any critical problem. The team is expected not only to deal with the daily on-call requests, but to work towards preventing future issues from arising in the first place.
Who are the Server Operations On-Call engineers?
The Server Operations engineers are Geotab’s best troubleshooters. They are recruited internally and chosen for their ability to thrive under pressure, their deep understanding of Geotab’s systems and, most importantly, their excellent troubleshooting skills. They are required to make business-critical decisions independently on a regular basis. If a production system goes down, for whatever reason, team members will be woken up at any hour (night or day) to work on the issue until it is fully resolved or, at the least, a workaround has been found that restores service while the root cause is identified and corrected.
What is the troubleshooting process?
The Server Operations On-Call team is Geotab’s first line of defense, making sure the company meets its Service Level Agreements regarding up-time. In practice, this means responding 24/7 to any of the automated alarms that monitor Geotab’s 1000+ production servers, as well as responding to any critical issues raised by Resellers or Strategic Partners.
Once an “on-call” event is triggered, the Server Operations engineer will triage the issue and, if they are unable to resolve it on their own, will escalate to the relevant subject matter expert in departments such as Internal Development, MyGeotab Development, Security, IT, Development Operations, and more. All of Geotab’s technical teams have a 24/7 on-call rotation to monitor all service disruptions, encouraging fast resolution and minimizing customer impact.
Using the vast amount of performance metrics available to them, the team has built many tools and dashboards to help assess the state of whichever machine or service triggered the on-call. Millions of queries are made daily to help make sure all systems are operational. In the event of a failure, each type of issue has a specific troubleshooting strategy. The team is working constantly to improve their monitoring tools and eliminate false positives.
What is the “War Room?”
The team’s escalation policy culminates in what is called a “War Room.” On rare occasions when on-call teams are unable to resolve, or see a clear path to resolving, a service degradation, or if there is widespread outage, the team initiates their “War Room” policy. All Software Development leads and the CEO are engaged, no matter the time of day. Depending on the hour, participants meet in either a boardroom or virtual room and stay until they come up with a solution, or at least a temporary workaround.
Regardless of what level of escalation is required, the Server Operations engineer that received the initial on-call will stay involved with the issue until it is fully resolved, both to orchestrate the efforts of the many teams involved and to promote accountability.
How does the team help improve future products?
In addition to responding to active on-calls, the team leverages their knowledge and experience in developing new iterations of Geotab products. Working closely both with customers and the Development team, they are in a unique position to assist ongoing efforts to build more robust and reliable solutions.
When the Server Operations team identifies some outage or failure of the service, they dig down to its root cause, understanding at the most fundamental level how it happened and how it affected the Customer. The developers then make changes to the code to prevent it from happening again. The Server Operations team helps validate the changes, making sure the problem has indeed been fixed.
The Server Operations On-Call engineers form the backbone of Geotab’s commitment to reliability, providing customers, resellers and partners with dependable service now and in the future.