When it absolutely, positively has to has to work 24/7/365 (Like always! Get it?)
So this morning at 5:30am I'm reading my email & see one exactly twelve hours old from one of our video distribution techs saying something to the effect of, "Any idea why our connection to Atlanta is down?" He had addresses the email to me and our IT support email address.
So I immediately call the 24/7 video tech desk to see if maybe the report was a false positive and the guy who answers the phone says, "No, Bill's not here - everything seems to be fine..." Several hours, emails and phone calls later I discover that this almost critical 10gig point-to-point or "P2P" circuit was indeed down for six and a half hours and the only person who seemed to notice was Bill - which is good because that was part of his job.
What was bad, is our firm is evolving from just moving very large video files around the country to streaming live video and what would have been an inconvenience in the past will become 'Hey, you're fired" when a critical connection fails to stay connected - "like always!". Because I like my employer and my job, I've created the following "Critical Path Network Management Checklist" which I now look at daily to remind myself, "It has to f#@king work all the time, Dan - now go find and fix what's likely to break next!"
The Package is Here, or it's Not! The Light is Green, or it's Not!
As the the FedEx video commercial from the 1980's below aptly illustrates, one does not need to be a transportation engineer to understand & plan around, "the package is either here tomorrow morning or it's not."
The same goes for WAN or wide area network management. I'm an IT project manager not a IP network engineer. All I need to know is: 1) Is the critical connection light green or red? 2) What bad things happen when it turns red? 3) How often does it turn red & when was the last time? 4) What's the short-term & long term cost of providing for one or more failover backup critical paths to ensure the green light always stays green or immediately goes back to green?
So Easy, Even a Vice-President Can Do It! (Work through the checklist, and decide, that is...)
Maintaining critical connections is a two-step process. Step one, that we identified with the video above, is simply comprehending "green is good, red is bad & we'll only pay so much money to keep it green" as discovered with the four questions.
Step two is deciding what to do about spending any money at all to keep critical connections green and never red.
Yes, once the four sub-questions identified above are answered, someone has to press the "let's do this button" to expend the resources of employee time and operating or capital budget. The secret to accomplishing this is to keep the decision process so simple that, like the video above suggests, "Even a vice-president can do it!" That's how we've designed our "Critical Connections Network Management Checklist".
Critical Connections Network Management Checklist
Step One: Who owns what parts of the critical connection?
As the first video above makes clear in just 30-seconds, the path the box of bulbs took from where it started to Ted's Bulbs business involved trusting "Ding Bats Air Express" for the middle part of the critical connection path and now Ted is looking for a new job. When our connection from LA to Atlanta was down for over six hours my first job was to examine each of separate, individual path parts to discover what part failed and why.
Over the past six months the circuit had suffered several outages that we ultimately determined was caused by a bad connector in one of the network appliances at our video studio. While we wanted the problem to be our carrier's fault, ultimately we discovered the fault was ours because we were able to finally replicate the video circuit's carrier testing between the two ends of the circuit. We then found and replaced the failing connector. We assumed this outage was the same - until the carrier later reported that on this instance was indeed their fault as "AT&T found a bad circuit pack at their central office & replaced it".
Step Two: Who's monitoring & reporting on the individual, separate parts that make up the critical connection path?
"AT&T's a pretty big company", I told the AT&T reseller we contracted with for this point-to-point circuit, "why didn't AT&T tell you and then you tell us the circuit was down - why did we have to tell you?" That's when our carrier trouble ticket tech reminded me that our special point-to-point ("P2P") circuit was an unmanaged "layer 2" connection that operates like a single 1,946 mile long computer cable that connects our LA studio to our Atlanta studio that requires no network addressing. This layer 2 connection is different from a managed "layer 3" connection that requires IP addressing schemes to make the connections because the connection travels over the public internet.
"You can pay us a bunch of extra money to put layer 3 managed routers on either end of your layer 2 circuit and then we can tell you when it stops working" he politely added but then I remembered the reason we got the low latency layer 2 circuit was we wanted to ensure we had the very fastest connection between our LA & Atlanta studio without adding extra monitoring equipment that might slow the connection down or introduce an extra point of connection path failure.
Step Three: Who's responsible for monitoring the monitors of the individual critical connection path "plumbing parts"?
As our IT department project manager, my job is the be familiar with and regularly monitor the each of the multiple individual parts that make up a single critical connection path, i.e. notice when the light turns from green to red. To effectively do this job one really only needs to understand the basics of the first three layers or the seven layer OSI model which is learnable in under five minutes by watching the following video.
Layer One - Stuff you can touch with your fingers. When the trouble tech says, "Unplug the computer cable and then plug it back in" or they say, "is there a green light next to the hole on the appliance you plug the computer cable into?" they are conducting a layer one test - testing the physical connections of the boxes and connectors that the connection circuits plug into to see if anything is physically broke that the circuit is "handshaking" with.
Layer Two - Naked, empty pipe or dumb connections. "Try a different computer cable" is usually the next check a trouble ticket tech will ask. This layer two test is designed to determine if there's a breakdown inside the cable that physically connects your laptop to your internet modem at home. Think of a layer two connection like a garden hose - there's nothing smart about a garden hose. Whatever you put in one end comes our the other end with no "addressing" required. With my issue where our LA to Atlanta P2P circuit went down, it was just like having a bad cable between my laptop and modem, if my laptop and modem were 1,946 miles apart from one another. And just as you don't have two extra devices between your laptop and modem at home where the only job of the two extra devices is to tell you when the computer cable between your laptop & modem is bad, we didn't have any gear to tell us the connection was down.
Layer Three - Segmented & route-by-address connections. Every year my wife and I invite our neighbors to our Halloween cul-de-sac party by taping an invitation flyer to their door. This is analogous to a layer two invitation connection - I walked out of my front door and up to their front door with no stops in between. Now if I decided to mail them the invitation by sticking it into an envelope, attaching a stamp and dropping it into a mailbox, that would be a layer three invitation. The nice thing about a layer three connection is all I have to worry about is whether or not I have the address and the postage stamp correct. Assuming I do, I need not worry about who pulls the letter out of the mailbox or what happens to it until my neighbor finds it in their mailbox.
Documenting who's responsible and accountable for each individual and unique part of any critical connection path is itself critical because when the connection breaks, one must first review the same documentation to determine which part has failed by quickly testing each of the individual connection path parts.
Step Four: When was it last working & when was it last failover tested?
The most frustrating part of being an IT project manager is that when a new service or critical connection is first installed, especially if it's a complicated, multi-part connection, all the techs and engineers involved in getting the service to work in the first place all immediately head off to their next engineering task knowing that "the project manager" will take care of any loose ends - like documenting all the IP addresses of all the individually addressable connection points. Once something's working one one ever wants to do something to make it stop working like a failover test
If you're a project manager that finds yourself in this position, no one is ever up for regular failover testing to document when it was last working and when it was last failover tested, there's a secret weapon you can employ that will eliminate most future failover problems if you plan correctly to implement "circuit neutral IP addressing" and always have truly redundant and diverse "last mile paths".
Step Five: Eliminate the "One, Two Punch" that causes most connection outages
There's an obvious reason your home has both a front door and a back door and a similar reason most families have at least two cars. The lack of "two doors and two cars" is the source of most critical connection path outages. If it's super important for my company to ensure the connection is always up between our LA and Atlanta studio then it's super important to have two seperate paths that connect the two sites - which we do, one P2P path and one public internet path. Now even though we have two seperate paths spanning the 1,946 miles between our two sites, the distance between these two diverse paths comes down to nothing over the "last mile" that connects our two separate carrier paths to our studio on either end of the connection.
While a carrier outage is the usual "number one" punch that takes out a connection, the "number two" punch is usually some physical disruption of the last mile path like a backhoe digging up the street half a bloc from your business. While many businesses have redundant carriers to failover in case of a single carrier outage, the most resilient businesses have diverse "last miles" that avoid the backhoe outages by having one or both of their carrier paths primarily or secondarily available to cover the last mile in the air as opposed to the ground.
This airborne connection is called "fixed wireless" and the only difference between a fixed wireless connection and an in-ground connection is the distance between your business and the closest connection carrier POP or point of presence. This wireless connection is analogous to having a helicopter on your roof that's able to bypass street traffic problems and take you directly to the airport.
Once you have multiple paths for both your last mile and your main connection, the other requirement for maximum resiliency is to use carrier neutral IP addressing. This is important because businesses that have implemented failover connection paths have really only addressed one half of the connection path outage problem - getting carrier connected messages out from the business. Most businesses only concern themselves with getting out to the internet. They figure if their employees can connect out to the internet they are able to work - which is correct. If however people and entities need to have an inbound connection to your business from the internet, if your businesses "IP address" is part of the individual carrier's circuit then when that circuit goes down, entities trying to reach your business reach only a dead end.
Carrier neutral IP addressing disconnects your businesses IP address from the carrier circuit that connects to your business to a physical edge appliance at your business that then your multiple carrier connections connect to. With carrier neutral IP addressing, when your primary connection circuit fails the critical connection vendors who provide carrier neutral IP addresses know to switch your connection to the alternate connection path. This is the basic value add from what you may have heard from "SD-Wan" or software defined wide area network vendors.
Step Six: Measure & Monitor your critical connection paths
Solarwinds SNMP. Back before the internet became a world wide web that connected every business to everything on the internet, including their other locations, physically separate business locations were connected to one another via layer 2-like private network data connections. The virtual private network or VPN connections that the internet ushered in was actually the second data connection businesses added to their networks after the usual first private pipe.
Many businesses abandoned their old dumb & expensive pipes and elected to put all their data through their internet enabled VPN data pipes. As the cost of internet bandwidth dropped, businesses simply got increased the size of their internet pipe or got multiple pipes. Unfortunately the latency increase and traffic congestion on the VPN pipes caused their own set of problems that the first generation and now second generation of SD-WAN have somewhat addressed by mostly blindly applying like a thick coat of paint over an old fence that no one took the time to sand or prime.
A thick coat of SD-WAN is fine in the short run to address immediate problems but once a business is past their critical connection emergency that they threw SD-WAN at, a business must at some point specifically measure and monitor all the critical connection paths on their network to account for where measurable challenges actually exist to then apply SD-WAN only where it's needed and for only the bandwidth that warrants the SD-WAN premium because SD-WAN is generally priced by the bandwidth of the pipe. Covering all the traffic on a 10gig pipe is much more expensive that coving the traffic of a 1gig pipe.
By implementing enterprise network monitoring tools like Solarwinds, Auvik or others, in the hands of a professional skilled at accurately measuring and documenting which applications are transiting which critical paths and encountering contention issues using SNMP, you will quickly create for yourself a specific list of network management requirements that an experienced network architect can design a customized but flexible critical connection network management plan that's perfectly sized for your business.
(The short video above presents all a non-engineer needs to understand about SNMP: it's easy to employ to monitor most any part of your company network)
Step Seven: Plan a migration path to utilizing an all-in-one network services vendor through a TSD
AireSpring, Spectrotel, NHC, Momentum, tec.
Step Eight: Ensure you get the best terms and price but employing a dual vendor, dual TSD policy
ddddd
Step Nine: Ensure you get the best "after signature" professional services by selecting your all-in-one network services vendor and TSD through a TMBG affiliated trusted technology advisor
ddddd
Step 10: Download & use TMBG's proprietary "Critical Connections Network Management Checklist" to plan your network migration to perfectly balance resiliency, risk management and cost effectiveness
ddddd
More: Application testing, user engagement & empowerment.