Software Quality for Cloud Apps: An Afterthought

Cloud software provides the opportunity to make various services more elastic.  For example, network services can be deployed using VNFs (Virtual Network Functions), that are orchestrated using opensource tools such as openstack.  Instances of VNFs can be dynamically orchestrated and decomissioned depending on network events, customer requirements, capacity management objectives, and other criteria.  While running, the VNFs use other cloud components, such as the software data plane, for example, DPDK, OVS, SR-IOV, and so on.  A system’s “hypervisor’s” may play an integral role in passing of packets, and applications may run in containers.  There are a variety of “cloud” configurations and your organization may use one or more cloud architectures to accomplish its network services missions. In other words, Network Function Virtualization (NFV) makes it possible to provide highly elastic networks that can provide elastic services; the network itself becomes a “living entity”, if you will, that grows, shrinks, and changes shape as old services are retired and as new services are envisioned.  The net net is a potentially significant reduction in OpEx and a growing revenue stream as services very quickly adapt to market demands.


So at a 30,000 foot level, the above is an umbrella statement of the benefits of “Cloud” for organizations, such as ISPs, who offer network services.
Unfortunately, the benefits of “Cloud” come at a very significant price.  Cloud-based network services offered by ISPs and Telecomm/Datacom service providers cannot sacrifice availability, reliability, resiliency, and performance (ARRP) for Elasticity gains.  Although it is true that Elasticity can result in significant OpEx reduction through automation of Deployment, Configuration, Customer Provisioning, and Closed-Loop (Policy)-based orchestration, there is one major operations activity that stands to be impacted in a major way.


“SERVICE ASSURANCE”
Traditionally, enterprise customers of Telecoms expect 99.999 (referred to as “Five Nines”) availability, link and node outage resiliency of <= 75ms, and very low latency, jitter, and packet loss depending on the application.  For example, enterprises who use Telepresence (real-time High Definition video and audio) become very upset when there is even a single pixel missing from a video conference.  To say that they become upset is the understatement of the century.  Telepresence meetings are typically between high-level executives, and when anything detracts from the meeting, anything at all, the executives become quite angry and immediately escalate to the top executive at the Telecom, usually the chairperson.  This in turn goes directly to the mid-level manager (i.e., the “Director”) of the telecom group responsible for Service Assurance of network services.  


Network problems of this nature typically require a dispatch of diagnostic equipment to troubleshoot links & nodes along the service path.  In a legacy switch and router (aka, nodes) context, the troubleshooting paradigm is similar to a “binary search”, whereby you start between the nearest and farthest nodes, then using a binary search method, you isolate the most likely node pairs along the path and run diagnostics on the link between the nodes and on the nodes themselves.  In most cases, the problem is found rather quickly, within the SLA (Service Level Agreement) parameters, including the Time to Repair SLAs.  Most issues are resolved by Tier’s 1 and 2, with very few issues reaching Tier 3, and it is indeed rare that a problem of this nature goes to development (Tier 4).  Generally speaking, the further up the tiers you go, the more expertise you are bringing to the problem and thus the more expensive it becomes.


Data Path in traditional/legacy networks:
App Server (e.g., Telepresence) -> LAN -> Router -> WAN -> Router -> LAN -> Users


The troubleshooting paradigm in “Cloud” is different.  It is more of a “software debugging” exercise by definition because there are more layers of software needed to offer the same services.  For example, in our telepresence example, we are likely identifying the VNFs involved as a first order of business.  The VNFs are likely running in servers (i.e., “compute nodes”) in different locations, and each VNF is deployed in a cloud context, for example, a “Network Cloud” environment, which may consist of 3 compute nodes and 3 controller nodes.  There are also likely 2 or more layers of a CLOS switch fabric connecting all of the servers, and there are layers of software on each server providing different functions.  For example, the VNFs may have been orchestrated using openstack, there are likely hypervisors that are in the data path between the VNFs and there are likely software-enabled data planes for each VNF, such as DPDK-OVS, with perhaps Virtual Routers on top of those (i.e., Vr over Vs).  Much of the software is likely opensource, which means there could have been literally thousands of developers for a given version of a particular software component.


The software in a cloud context is therefore much “deeper” verically AND horizontally.  That is, the software stack itself is several more layers than traditional router and switch software, and the E2E software data path is several segments longer.


So now the “binary search” method of troubleshooting no longer holds.  If one assumes that the problem exists between a near-end and far-end VNF, then the binary search method can be used to only a limited degree.  Even if one could use it to isolate the issue between two VNFs and a link between them, that “link” now consists of several software layers, hypervisors, operating systems, and the VNF application code.  The VNF application itself may not be developed and tested like legacy software in a legacy switch or router.  The operating systems within which the VNF runs is not a real time OS, but is something like Ubuntu 16.04.  The hypervisors are also now part of the data path.


One Possible Data Path in a Cloud Context: 
App Server (e.g., Telepresence) -> VNF app -> OS -> Hypervisors -> Vr -> Vs -> NIC -> Switch Fabric -> Router -> WAN -> Router -> Switch Fabric -> NIC -> Vs -> Vr -> Hypervisors -> OS -> VNF App -> Users


So what happens in a situation where pixelization occurs in a Telepresence HD video conference call in a network built on cloud technology?  Well, the executives become extremely upset as one would typically expect.  Tier1 support does some sectionalization but unless the problem is obvious, such as a WAN link issue, it is difficult to determine which direction to go to next.  Starting at the WAN router is now only a very small part of the data path, with a very long “software” data path from the WAN router to the VNFs on both sides.  Hence, expertise is required in both traditional Layer 1 thru Layer 3 networking, as well as cloud software technologies, especially “DEVELOPMENT” expertise in Cloud, networking, and operating system domains.


By DEFINITION, issues with software turn into software debugging exercises (Tiers 3&4) which turn into fix development and testing (Tier 4).  This is a LONG process which requires development time, unit test, integration test, then field test & finally field turn up.  It’s true that legacy network issues sometimes require software fixes, but in classic network environments, the software is professionally developed, tested, and deployed in the telecom provider network, the very model that telecoms are moving away from.  However, telecom operations was able to deal with most issues at the Tier1&2 levels, which meant fast MTTR & no SLA payouts.  In the cloud context, because of the greatly extended software path, Tier3&4 are necessarily engaged in most issues, & the outage times can be extensive.  There will certainly be many more $SLA payouts.


So, the bottom line: There is a huge price to pay for the elasticity brought about by Cloud-based services.  Critical and Major field defects that impact enterprises can easily consume several weeks while developers, most of whom have moved on, scramble to fix the defect(s).  The developers who actually wrote the opensource code in fact are no longer around and it is more likely than not that they don’t work for the telecom providing the services.  On legacy switches and routers, the developer may still be around and/or the code is professionally maintained by a maintenance group.
One more very important point: in the cloud space, SQA or QA (System Quality Assurance) has taken a back seat to outsourcing so that the best minds produce code that is feature rich.  Best practices that have been used for decades to make software as reliable as possible, such as Software Reliability Engineering (SRE) and Software Failure Mode and Effects Analysis (SW-FMEA) are no longer used.  This means that even mission critical software, such as real time RAN integrated controllers (for 4G and 5G), are going out the door wih almost a guarantee that Major and Critical defects will in fact occur more frequently.  


Cloud has been around for some time now.  Up until now, services like Web Hosting, E-Commerce, Chat Rooms, etc, have been successful, largely because the end users are consumers of largely non-real time services, where backup systems are easily deployed and where convergence times for restoration of services after failure are within minutes, which has been acceptable to most consumers.  

As applications like NFV are written on top of Cloud to provide telecom and datacom services to enterprises, availability, reliability, resiliency, and performance (ARRP) become paramount and prominent, as they are in legacy networks.  Availability is presumed to be “5 Nines”, convergence times are <= 75ms, latency <= 10ms, jitter <= 2ms, and packet loss is near zero.  

We therefore need to step back for a moment and really think hard about what we are doing: Does the enablement of a great level of elasticity and the resulting significant decrease in OpEx justify the disasters that will definitely occur?  The answer should be obvious to those who build network services that enable real-time and/or mission-critical applications.

Leave a Reply

Your email address will not be published. Required fields are marked *