Why 2008 Was a Milestone Year for IPv6
Derek Morr from Penn State with an end-of-year 2008 IPv6 State of the Net. Great reading as always from CircleID.
An introduction and tutorial (part 1): NetFlow, sFlow, and other technologies
This is Part 1 of a 4-part series on NetFlow.
There are a variety of tools available to the network engineer today either to troubleshoot their networks or have awareness of the traffic passing through them. These solutions include passive taps that copy all traffic, active taps that allow selection of traffic, promiscuous sniffers like Ethereal (now WireShark) and their commercial variants, and other tools that tie somewhere into the ecosystem of monitoring toolsets.
One area that has become increasingly important to engineers is flow-based monitoring. Flow-based monitors don’t necessarily seek to capture every packet, but instead they focus on network traffic as it is in reality, allowing a clearer view of data passing over the network. Usually, it’s a sampled amount of random data, but still representative of the true network traffic. Flow data is organized by a variety of keys - source and destination IP (srcIP/dstIP), source and destination port, etc. For traditional flows, that is to say not peer to peer, flow-based monitoring works remarkably well and can give you new forms of insight into your network and its behavior.
Examples of these technologies include Cisco’s NetFlow (also 2, 3), sFlow, the emerging IPFIX standard (which is based off of NetFlow v9), jFlow/cflowd, etc. We’ll focus on NetFlow as our primary example, but we’ll do a more involved comparison and examination of the other technologies in the next post in this series.
Flow-based systems have evolved to give us the following capabilities:
- Autonomous system (AS-based) traffic matrixes
- Intrusion and network anomaly detection
- Network behavior changes
- Billing systems
All of these are interesting in and of their own right.
Earlier in this post, we suggested these tools provide a “key” to the traffic. Flow export works by creating a multidimensional “key” for a flow, with the value being some data about the flow. You might be thinking hey, with source IP, dest IP, and ports used, I already get some useful data. But we can also find out great things like # of bytes and packets, QoS data, and even TCP flags! If the router has a view of the BGP routing table, it can also show us data such as which end autonomous systems our traffic is going to.
An open source package called flow-tools is commonly used to process this data, but there are a ton of other toolsets out there. flow-tools can give us a variety of customizable reports, such as the below, an example of an AS report:
> # --- ---- ---- Report Information --- --- ---
> #
> # Fields: Total
> # Symbols: Enabled
> # Sorting: Descending Field 3
> # Name: Source/Destination AS
> #
> #
> # mode: streaming
> # compress: off
> # byte order: little
> # stream version: 3
> # export version: 7
> #
> #
> # src AS dst AS flows octets packets
> #
> TEST4 SHELL 50122943 51315023448 110768537
> TEST3 IANA-RSVD2 733920 8261023010 18519158
> TEST2 IANA-RSVD2 1106575 6476666638 18169333
>
>
As you can see, it’s pretty valuable data (and that’s just the AS report!) We can utilize this data to determine a rough percentage of our traffic which can help us in manually doing preferencing to shift, say, 20% of our traffic to a given link.
There’s a ton more we can do with flow data, as we’ll show in the next posts in this series.
Japanese FTTH deployments not bearing out higher bandwidth requirements
Nyquist Capital: Still No Japanese Exaflood in Sight. This is an excellent review of some statistics and studies regarding FTTH deployment in Japan. Essentially, the subscriber counts went up as expected (high penetration), but the bandwidth usage didn’t - there’s no significant trend change.
What’s essentially being said here is that there are no killer applications yet. This isn’t surprising given the way the population density is modeled in Japan (not that many large households, a lot of single/double person accommodations). What is kind of surprising is that with all the investment, nobody’s come up with good applications. I guess there isn’t a lot of HD media streaming to the households in Japan. But I think the bandwidth stuff is right around the corner - new technologies like telepresence, graphics cloud computing/other cloud computing, etc will start driving the BW up.
The study referenced in the article at Nyquist has some good IX data, so it’s worth a read.
The Importance of Awareness (And Tuning Out)
A truly disciplined operational environment doesn’t discard any event that occurs. That is to say that if there is an alert you figure out why, irrespective of whether that particular element or component has alerted 50 times today or not. If you have a good operational model, your team is already taking a look at the element after the first alert, and if spurious alerts are occurring, your network management team is working on the NMS in order to evaluate whether this behavior is correct.
An undisciplined operational environment’s approach is fundamentally different. The on-call engineer gets an alert (hopefully), logs on, and fixes the problem. An hour later s/he gets another alert, finds the problem is still fixed. Three or four other alerts follow over the next hour. Eventually the engineer just turns the pager or cell phone to silent, believing all these alerts are related. Problems ensue. Someone probably gets disciplined or fired the next day.
Once you have data you need to have a well-disciplined environment. Ideas like CAR (call-and-response) pages, regular checkins, “flag alerts” that need a reply, etc are all great ones. In a 24×7 NOC, sometimes the activity can be pretty intense, particularly if the roles are blurred and people are doing provisioning, triage, real engineering work to fix a larger problem, etc. Alerts can take many different forms.
I urge people to think outside of the traditional operational view. A pager or SMS isn’t necessarily enough. People need to give careful thought to network alerts firing in given directions: IM, SMS, XMPP, desktop clients like Growl, web alerts, event routers, etc. They all have their place (though too much is too much).
There are also some really innovative ideas that never saw much light of day, I think for a variety of reasons. schearnet is a good example of this. SuperCollider is a programming language for audio events, kind of like a MIDI tracker although that’s not the right terminology. schearnet makes network traffic into audio. This allows you to have “background noise” in your NOC. I speculate that there are studies out there that suggest human beings, being extraordinary pattern recognizers, might find it easier to detect a minor difference in ambient sound (what’s that clicking? that’s not right) that may not trigger a normal network alert.
There are a host of applications out there like schearnet. Worth looking at.
FUD at 10 Gigabits Per Second
Canned Platypus » Blog Archive » FUD at 10 Gigabits Per Second. Come on, get your head right. Jeff, it’s not about bandwidth; it’s about the magical word “Ethernet” and what it means to people. I think we both agree that serious data center builders and HPC people will pick InfiniBand, but the reason people are looking at it for cluster interconnects probably has nothing to do with bandwidth and everything to do with cost, virtualization, and scale.
More here.
The Bullshit of Outage Language
Well worth the read: 37 Signals has a scathing indictment of outage language in the industry.
Is the Mediterranean the Achilles heel of the web?
From New Scientist: Why the Mediterranean is the Achilles heel of the web.
Radio Silence, or not
Sorry about the quietness of the past few days; we’re working on a few big posts, including an in-depth tutorial about NetFlow. More soon.
The Impact of Operational Data
There is an old management adage that says “You can’t manage what you don’t measure”. You frequently hear about people monitoring things like bandwidth and interface errors, neighbor relationship status, and sometimes even NetFlow/sFlow data, but building these scaffolds usually comes after an operational failure and rarely as a preconceived notion. As busy network engineers we rarely think about how the features or services we implement will be monitored and measured (and provisioning is a whole other ballgame).
Larger service providers have whole teams of people dedicated to network management, although sometimes these teams are woefully ineffective, with little inherent knowledge in networking or understanding what truly characterizes a fault or a condition that warrants further investigation.
This is the first post in a series on network management and operational health. We’ll introduce concepts in this series like KPIs and KSIs, continuous improvement plans, measurement methodologies like MIR, DMAIC, and RAND DMI, and key driver charts.
Key Performance, Success, Quality, and Service Indicators
The Key Performance Indicator is often used as the “beachhead” term when talking to network management people, but it is often misunderstood. KPIs can be simple or complex, but usually they’re not as simple as “how many minutes or seconds was my service down for the year or hour or day or month”. KPIs can be things like “the data service is available for 100% of our service hours, which are 8 to 5pm, M-F, local”, or things like “average TCP throughput over the network”. The taxonomy can be pretty complex; throughput and goodput are two separate things, and there are all kinds of perspectives on each KPI - like your management might view the same KPI differently than your user base.
The other thing is that a “performance” indicator isn’t always the best or most appropriate term for what you’re trying to measure. Sometimes you want to measure simple success (percentage of successful deployments this week) or service awareness (the network was working great, but this service wasn’t usable from time X to time Y). Measuring all these things and managing them really helps you not only understand how well you’re doing, but it can help in areas like keeping customers (think of a KPI as a sales tool - and your management of it as a sales tool), employee performance management (when employee Y is on duty, the statistics drop), etc. In the next posts in this series, we’ll show example methodologies for key indicator management.
Continuous Improvement Plans and Business Performance Management
Network engineers hate this crap. But “working smarter” saves us a lot more time and the more you build and realize the benefits of this process, the more work it will save you. More importantly, process implemented by you trumps process implemented by other people who don’t understand your day to day job.
There are all kinds of iterative processes for continually improving the indicators we talked about above, although very few of them have a network focus. Here are some brief terms that we’ll set for some discussion in the future:
- Continuous Improvement Process: Dr. W. Edwards Deming was essentially the founder of this family of processes; CIP is a metaprocess that helps improve other processes in the business. The core principle is feedback - “reflection” of processes - and then improving efficiency (by reducing or changing suboptimal processes) and evolving existing processes.
- Plan Do Check Act (PDCA) - This is the “Shewhart” or “Deming” cycle, and is also called “PDSA”, for Plan-Do-Study-Act. It basically suggests that you make a stab at a problem, then study the changes your response made. It’s based on the scientific method. This cycle is called something differently in Six Sigma - DMAIC.
- DMAIC & Six Sigma: Six Sigma is a set of practices focused around manufacturing and reducing the number of defects in a manufacturing process, but there is some valuable wisdom for network engineers in Six Sigma. DMAIC is a Define-Measure-Analyze-Improve-Control cycle. Six Sigma introduces a great number of concepts around quality management.
- RAND DMI: A process by the RAND Corporation and the US Army focused
- Key-Driver Analysis: This has been called a few different things by a few different people, but it defines what the “key drivers” are that help your customers do the things you want them to do (e.g. buy product). In a network engineering concept, things like fast page load times are key drivers. You can tie KPIs into this to help create a performance dashboard.
We’ll explore these concepts and how they fit in to network engineering life in the next post in the series…
