linkedin facebook twitter rss

03 Dec Co-Responsibility in Hybrid IT

Operational Continuity takes a Village

Today’s post departs from my current stream of topics because I am thinking about this subject often lately. I apologize for the ambiguous title, but I think it encapsulates what I want to talk about. “Hybrid IT” is a way of describing the technology supporting an organization in which the systems and capabilities that support the business are deployed in a combination of on-premise data centers, often in separate places, and “cloud” locations both geographically and legally separated. The legal separation is often defined by separate company charters and articles, and mediated by contracts between the entities. These contracts may define uptime and recovery service levels to be maintained, and the separate and mutual responsibilities in keeping technology up and running. These agreements are commonly known as Service Level Agreements or SLAs.

Co-Responsibility Puzzle

Business continuity can suffer greatly from simple failures, especially when they occur at critical nodes such as Domain Name Servers, telephony systems and primary storage systems. Authentication problems arising from forgotten policy settings can bring a company to its digital knees. It takes a village to troubleshoot many types of system failures, and the more complex the systems portfolio, the more important it is to formalize the co-responsibility model.

Understanding Context Cross-Reference
Click on these Links to other posts and glossary/bibliography references

  

Section 8 #11

AI Apps and Processes Icon

 

Table of Context

 

Co-Responsibilityresponsible (from Reverso.net)
      adj  

1    postpositive; usually foll by: for –  having control or authority (over)  
2    postpositive; foll by: to –  being accountable for one’s actions and decisions (to)  
      i.e.  to be responsible to one’s commanding officer     
3    (of a position, duty, etc.) involving decision and accountability  
4    often foll by: for –  being the agent or cause (of some action)  
       i.e.  to be responsible for a mistake     
5   able – to take rational decisions without supervision; accountable for one’s own actions  
      i.e.  a responsible adult     
6    able to meet financial obligations; of sound credit  
     (C16: from Latin responsus, from respondere to respond)  
In Information Technology Service Management (ITSM), support responsibilities could apply to applications, databases, servers, storage, networks (including telephone and mobile) and anything for which an IT organization may have the charge of maintaining and troubleshooting. Even electricity generators may be in the mix. The more complex the mix is, the more diverse the skill sets needed to provide continuity of operations. And while the buck may stop at a senior manager or leader, the work is inevitably distributed across team members in various backgrounds, locations and even separate companies.
Maintaining uptime and service levels involves co-responsibility in hybrid IT shops where experts from vendors work in teams with internal IT to deliver business continuity. Outsourcing capabilities to “cloud” vendors can be tricky if you are not prepared for the costs. Let me share with you a recent comment I made on a LinkedIn discussion. Himanshu Thakur asked: I am wondering if anyone has worked for, or tried to achieve, an enterprise IT architecture entirely on the cloud? My key concerns with the concept are:
  1. Integration – As an EA, I would like to choose the best platform for the function I am trying to put on the cloud. Different functions may go to different vendors (on their clouds) but for my enterprise, it is crucial that they integrate well.
  2. Customisation – Even if the software my enterprise chooses is the best of the breed, I may still want to customise it for my organisation. How much flexibility would this arrangement offer?
  3. Control – What if the cloud platform we use for a particular function not up to scratch? How easy is it to move away from one vendor to another?
  4. Management Reporting – are there any platforms available, again in the cloud possibly, that could source data from disparate applications all over and put together management reports tailored for my enterprise needs?

I am sure there are other challenges too.” 

The thread has many great responses. My response was from an engagement in which I served as IT Architect for a new company. “We did it because we had to. In 1Q 2013, Land of Lincoln Health in Illinois received funding to establish a Consumer Operated and Oriented Health Plan (COOP) through the US Government Affordable Care Act. There was enough funding to establish the systems needed to create plans, define benefits, sell plans, enroll members, connect with medical professionals, process claims, process billing to members, process payments to doctors, etc… – but we had to have all that up and running by October 1st of the same year. We had to go from absolute zero to 100 in less than 9 months, and the only way to do it was using the established capabilities in the cloud exclusively.
We succeeded, and began enrolling members on October 1st. All the responses to your questions are valid and I endorse the advice you have received. My great learning from this experience is that it is necessary to formalize the Co-Responsibility for maintaining SLAs early and thoroughly, spelling out collaborative troubleshooting procedures, so in significant failures you can achieve rapid progress and restoration of services rather than finger-pointing. Without such collaboration, problems can be prolonged to no one’s advantage. Shared tools and shared access to information, including tickets (ITSM), config information (CMDB) logs and packet sniffer streams are extremely helpful. Clearly established primary and secondary channel communications are also essential to get the SMEs talking to each other rapidly and frequently when things go south.” (LinkedIn discussion)
 SLA Management Tools

ITSM

Let’s look at the components and why each is important. First, ITSM is often where trouble tickets begin their lives. A ticket may be handled by the same person who identifies the problem, in case of pro-active troubleshooting based on automated indications and warnings, or it may be escalated to experts and handled by a team of people with different skills. ITSM usually recognizes at least three levels of support:
  1. Trouble TicketLevel 1 support often receives calls, creates tickets and resolves those for which adequate triage scripts enable the help desk generalist to perform the work. If access restrictions prohibit help desk personnel from the necessary access, the ticket will be assigned to someone who has the needed access. If the complexity of the problem is beyond the generalist’s ability, the ITSM tool should provide specific guidance as to what person or group is qualified and available to handle the ticket. Mature organizations reduce as many common or recurring issues to scripts so they can be handled by the generalists without escalating to specialists.
  2. Level 2 support is for issues beyond the ability of the generalists or existing triage scripts, and requires either special skills or special access, or both to resolve. Many IT shops have several technical people who are capable of and assigned to handling trouble tickets. The more complex the IT systems portfolio, especially the more diverse the technologies in the mix, the larger this group needs to be to avoid more expensive Level 3 support escalations.
  3. Level 3 support is for problems that require the gurus. People with deep expertise in specific technologies, often outside experts from system vendors, are the big guns needed to solve the most vexing problems.

Joe Hertvik defines level 0 and level 4 support as well. Level 0 is self-service and may include automated password reset, web forms for requesting IT support, and FAQ or knowledge base lookup. Level 0 support requires no Help Desk technician. Level 4 involves hardware and/or software vendors for specialized application support, printer, copier and other equipment maintenance. Level 4 support, sometimes called depot maintenance, is contracted by an organization for specific services, but they are not part of the organization. “Generally speaking, the bigger the organization the more stratified these roles” (Joe Hertvik). A good definition of escalation is available on a Washington.edu wiki.

Tools like ServiceNow, Jira, Cherwell and Heat support these capabilities. When a problem can be easily isolated to a specific machine, application, disk or database, troubleshooting can be pretty straightforward. Sometimes, however, the root cause is not apparent and a team is needed to perform repairs and restore service.

Monitoring

Report DashboardTools like SolarWinds, Splunk, Microsoft SCOM provide ongoing streams of system health information. Systems like these can help monitor network, server and workstation performance, optimize applications and databases, and oversee security. The capabilities range from electronic asset discovery (an automatic way to document the IT portfolio) to predictive intelligence on system failure modes and security breaches. These tools are getting smarter and smarter every year, but human experts are still needed for troubleshooting almost all complex problems. Monitoring is an indispensable component for maintaining operational continuity, but visibility needs to be shared with experts who may be from internal and vendor organizations to maximize the value of collaboration.

Configuration Management

With the rapid progress of technology innovation and the frequent penetration of sensitive resources by malicious or otherwise unwanted hackers, many systems need periodic patches and updates. Sometimes these are managed by the vendors themselves, with automated access to the networks and devices that need patches and updates. Sometimes this is done using internal IT managed configuration tools such as MS SCCM, sometimes in ITSM tools such as ServiceNow, and sometimes in Monitoring tools such as Solarwinds. The Hybrid IT organization is likely to have multiple sources of configuration information across premise and cloud-based capabilities. The ideal scenario is for all this information to be accessible in a single place. This is seldom the case in real life, so the next best thing is for collaborating experts to be able to use a desktop sharing tool, such as WebEx, or other communication mechanism to be able to share critical configuration data (and monitoring data if available) with other team members in a collaborative troubleshooting session.

Electronic Conferencing

Control centers like NASA Mission Control in Houston or Strategic Air Command in Omaha once served as the model for IT support centers. With the experts geographically separated, the room with banks of large monitors is no longer feasible, nor needed. In lieu of this, however, there are many times when real-time communications and data sharing are needed to troubleshoot complex problems, or even simple problems in complex environments. That’s when electronic conferencing comes in handy. WebEx, Join.me, Skype, Google Hangouts and other tools provide good solutions for enabling the experts to combine their knowledge and insights to solve complex problems more efficiently.

Conclusion

In a collaborative troubleshooting strategy, in which all parties are collectively responsible for uptime, and all parties are empowered with the information and communications needed to effectively execute the strategy, cooperation is the key to success. A predictable outcome of the combination of strategy and enabling tools is more rapid recovery, even in complex scenarios. The strategy requires mature processes from vendor contracting to tool implementation and integration. If proper configuration management and monitoring tools are not in the budget, good luck. But even if you have good configuration management and monitoring tools, there are two important elements of the strategy that require attention and diligence:
  1. Keeping the configuration database up to date, including external vendors’ integration points
  2. Ensuring monitoring coverage for the entire critical portfolio
As companies and their IT organizations mature, the ability to combine point solutions into a coherent strategy for maintaining operational continuity improves. As the tool vendors improve the quality and breadth of their offerings, this maturation process can be accelerated. There are several places in this process flow where expert system style models and rules can improve the quality of automated tools, and an ontology representing objects in the ecosystem can add semantic context to the overall solution. As more and more capabilities migrate off premise into vendor hosted and Whatever as a Service (WaaS) offerings, the need for establishing solid co-responsibility models, contracts, tools and smarter procedures increases, and will continue to grow as we move from the Information Age into the Age of Knowledge.

Click below to look in each Understanding Context section


 

Comments are closed.