Out of Nillo's mind: 2015

Wednesday, June 3, 2015

ACI Example - Simplified Infrastructure Upgrades

Most of the ACI literature tends to focus on application automation aspects: the possibility of defining application connectivity using a declarative model to state the needs for the various application components. This brings along the option to continue doing networking the way we were before, and also to move into a new way of doing networking where concepts of VLAN, subnets and so on are not required in the same way as before.

But with all these conversations some of the basics about ACI get lost. And the basics alone are very important. In the little blogging that I get to do, I have already written about how on-boarding new devices into a fabric becomes a very simple task when you use ACI (see here). This is also true for replacing hardware, should you need to do an RMA for a device.

Another advantage that ACI brings over traditional networking is simplified software management. The fabric software, the bits that run in both the APIC controller and the ACI switches, is referred to as firmware in APIC.

In this blog post I describe how you can completely upgrade an entire fabric including the controllers without service disruption. This is a very important part of maintaining and operating an infrastructure.

Customers who are looking to deploy any SDN solution should look at how upgrades are done, and what is the operational complexity involved. I believe they will find out that this is another area where the integrated overlay has significant advantages over a server-only overlay.

Let's look at the steps for doing this in ACI.

(1) Add software to the Firmware Repository

There are various ways to do this. The simplest is to configure a download task to download the controller .iso and the switch .bin images into the firmware repository that is accessible on the FIRMWARE section under ADMIN tab.

Once you set it up, you have to verify the operational status. It should be downloading:

The above tasks can be done through the GUI as shown in the pictures, through the REST API or also via CLI on the APIC with admin privileges. When the download reaches 100%, the firmware is downloaded and added into the Firmware Repository. You can then click there to confirm:

You have to repeat the above to add the .bin image for upgrading the switches. Once done, we move to step 2.

(2) Upgrade the Controller Firmware.

Again, you have to go to the ADMIN are, and click on Firmware tab. You will see the below options, and when you right-click on Controller Firmware you can select to upgrade the controller.

A window will open, where you must select the desired firmware level from the drop down menu. You select to apply now, or at a later moment, then submit.

Because we selected "Apply Now", the upgrade process begins. The upgrade status can be seen clicking on the "controller firmware" option:

It is important to notice that while the upgrade goes on, the fabric is fully operational and traffic flows through without any problem.

Eventually, the APIC controller that is currently upgrading will reboot, you will see a reboot message on the console if you were connected to it, or else you may see an error on your browser that indicated the session is closed.

After some minutes, the controller reboots and you can login again. You can then check if the upgrade was successful:

(3) Upgrade the Fabric nodes.

Now we have to upgrade the fabric nodes. Let's check that the switch .bin is also on the firmware repository (needs to be uploaded too, same procedure as for uploading the .iso using the Download Tasks):

On the right hand side options under ADMIN -> Firmware you see the Fabric Node Firmware menu. We right click on Fabric Node Firmware, and select "Firmware Upgrade Wizard" and we see something like below:

We are going to create a firmware group with all switches, and select the right firmware level that we want for all of them (partial upgrades are possible too):

Then, very important, we are going to create two maintenance groups, one for odd numbered switches, one for even numbered switches (remember that when you commission a switch in the fabric you have to assign it a node ID):

Now both maintenance groups have been created, we can roll out the upgrade first on the odd numbered switches, then on the even numbered switches. Because our servers and external routers are all dual homed, doing it this way ensure no service interruption. We click on the Maintenance Group for odd-switches and click on upgrade now:

After one final confirmation, the upgrade process begins:

The upgrade process progresses:

The switches on the odd-switch group have been upgraded:

Important to mention, during the time of the upgrade we had a ping running between two VM on different servers as well as a ping running to the default gateway from one of the VM (the default gateway is, of course, on the ACI leaf switches):

Of course when the leaf switches reboot, the end hosts will see a link down, so in order to avoid service interruptions they must be dual homed (one port to a switch with an even ID, one to a switch with an odd ID - hence our upgrade policy). In our case, the hosts are running ESXi, we see the link down flagged as an alarm:

Our upgrade is now complete:

When the upgrade is complete, clicking on the Fabric Node Firmware will show the new release for all fabric nodes:

And that is it!

Compared to a traditional network built of individually managed switches, there is need to setup any TFTP servers, download the new code to each switch, script your way into automating every switch to upgrade and reboot, etc.

Thursday, April 23, 2015

Whitebox switches are Black

I am a pragmatic person at heart, and perhaps because of that I have a hard time seeing the value of debating so much about terminology. We've seen so much about that when talking about Software Defined Networking (SDN). What SDN is and is not, etc.

Now we are seeing a lot of debate about what is a white box networking and what is not. And new terms arise as well: brite box, white brand, … I am sure I am missing some.

There is an irony in all of this in that white boxes aren't white. I invite the reader to checkout the models from Acton, Quanta, Penguin, … they are all black!

A black whitebox switch

Leaving the joke aside, I do understand that semantics matter. But at the same time I think we waste time discussing so much about terminology. I think people end up being confused, and sometimes I wonder if certain companies promote that confusion.

What is the point about white-whatever network devices?

I think (and I believe most readers would agree), it's about disaggregation. Disaggregation is about having choice: the choice of running any network OS on any network hardware. Like you do with servers, where you can buy a server from Cisco, and run an OS from Red Hat, or Microsoft, etc. The expectation is that with disaggregation will come cost savings. That last part is not a given, of course, and it is where the "white" part comes in. For if you can run your Network OS of choice on a cheap "whitebox" piece of hardware, you should be able to save money. But the point is really about being able to run any Network OS on any hardware (ideally). This is all assuming of course that all hardware is equal and of minimal value, so you can use cheapest hardware. Because the value is in the software alone. I personally disagree with this reasoning.

Then of course, some people like to (over)simplify things, and they make it all about the boot loader. If you do ONIE, you are cool. ONIE alone means you are disaggregated, open, whitebox, and cool. If you don't do ONIE, then you are not open, you are not whitebox .. whatever. To me this is like when you were at school and you always wanted to play to whatever that cool popular boy that you admired was to propose as a game.

It is almost impossible to imagine that the networking industry is different from the server industry simply because we did not have an open boot loader, and that it will change now that ONIE is around.

In the end the objective seems to be that you are able to run any network OS in any network hardware. That is disaggregation. Mind you, I am not saying that disaggregation is good or bad. Simply stating what is the "new" thing with disaggregation.

Whitebox or not? … well … imagine that you run Open Network Linux (ONL) on an Arista branded box. How do we call that? … I am sure there's a name for it. Someone will come up with one. But I think It does not really matter (It is also not possible anyways, at least not for now, and if it was possible, I hardly see any value on it).

Now that we agree that the point is about disaggregation (and we still have not debated about whether that is good or not, or to what extent it is happening), let's also talk about what is not the point of the "whitebox" discussion. SDN is not the point.

I don't want to get into a heated semantic debate of what SDN is and is not. However, the goal of SDN was to simplify and/or change the way we do networking. It was about getting rid of all that box by box configuration of discrete network units running legacy network control protocols. Wasn't it? … I think we will all agree that if I run ONL on a Quanta box and I build my network using spanning tree, I can't say I am doing SDN. It is a different thing of course if I use ONL on a Quanta box with an ODL controller to build L2 segments. So it's clear to me that we are talking of two different things: SDN is not about disaggregation, and disaggregation is not about SDN.

Is disaggregation in the network industry happening?

It is, and it always has been. I remember attending Supercom in Atlanta back in 2000 and talking to a taiwanese company that offered "whitelabel" (yet another term!) switches. I could have started my own networking company, and produce NilloNet branded switches. "What OS can I run on my NilloNet switches then?" I asked the kind sales person at the booth … "Well, we provide you with one that we could even customise with your brand, or you can bring your own". Impressive. A booming industry we had, back in the year 2000.

But anecdotes aside, clearly what is happening today is somewhat different. ODM vendors have been there for long time. Merchant silicon is not new either. What was missing was a decent network operating system for running on ODM-provided hardware. There were Network OSes you could use. I experienced with Xorplus a while back (maybe in 2010?) … But I was … well … less than impressed. But these days there are more options to do traditional networking on ODM-provided hardware. Better options too.

The server industry and the networking industry are very different

If we look at the server industry, any server has easily a dozen different OS options on the list of supported software, if not more. This has been the case for ages now. And yet, this has not translated into ODM-provided servers becoming dominant. Quite the contrary. People primarily buy branded servers from HP, Cisco, Dell … And for good reasons. For quality, logistics and primarily operational reasons customers see greater value in certain brands. The promise that running low-cost whitebox hardware saves you money has not proved yet true for Enterprise customers.

Of course, on the networking side of the house, things are still different: you cannot choose your network OS with your hardware of choice (and for good reasons IMHO).

But .... this will change, some say. This is already changing, others are saying ... Promoters of the disaggregated model are quick to point out the Dell S6000 (the ONIE model supports only two options, Cumulus Linux and Big Switch Light OS), the Juniper OCX1100 (but it only supports JunOS), or the recently announced HP support for Cumulus Linux. Actually, no HP switch supports Cumulus Linux today (not that I know off at least. The educated reader will be so kind as to correct me). Instead, HP resells one of the Acton models that are on the Cumulus Hardware Compatibility List (HCL). So that if you buy that switch model through HP, you get an HP supported open platform that can run … Cumulus Linux (No, apparently, it cannot run HP's Comware … only Cumulus Linux).

This is what is happening. So much for choice …

I am sure the skeptical reader will be sharp to point out that "these are early days"."Wait and see in one year". Sure they are early days.

Will we get to the point where we see a switch from a vendor (any vendor, white or coloured) running any Network OS? Just imagine, the ordering web page of a hardware company, ... a drop down menu where we choose the operating system from a list with NX-OS, JunOS, EOS, Cumulus Linux, FTOS, … Well, I am very skeptic that we will ever see that day. And I don't think customers are generally looking for that either.

Why the network disaggregation is not like the server disaggregated model

It is always nice to use parallelisms to draw analogies in order to explain something. But parallelisms can be deceptive too. We have seen this before with those that were comparing the server virtualisation industry with "rising" network virtualisation: a great marketing message, a clear failure (look back to 2011, and look at us today ...).

I believe we may be in the same situation whenever we talk of switch disaggregation and compare with server disaggregation. We are comparing similar processes on two industry that are related, but not quite the same.

I think it is very different for three reasons.

First, volume. The server industry is much much larger than the networking industry. This is very evident, but to illustrate it in any case, the reader can think that in every standard rack (if you have proper cooling) you can fit up to 40 1RU servers, and two 1RU ToR switches. It is very clear that the number of physical servers in any datacenter outnumbers the number of network switch devices by multiples of an order of magnitude. The larger volume creates a different dynamic for vendors to deal with margins, R&D, cost of integration and validation of software, etc.

Second reason, servers are general purpose. A server runs a general purpose CPU to run a general purpose operating system to run many different applications. This means that there is both a need, and an interest, of developing multiple operating systems (on a market that has a large volume). A market that is large enough for many large and small players to be profitable on the software and on the hardware side. Networking devices on the other hand have as main purpose (solely purpose perhaps), to move packets securely and efficiently, with minimal failures. This requires specialised hardware. That leads to our next point …

Third reason, lack of an industry standard instruction-set for networking hardware. In the server industry, x86 architecture prevails. It does not take a lot of effort to ensure you can run RHEL, Windows, ESXi, Hyper-V, etc. on a HP server or on a Cisco server because both servers use the same processors. Granted, you need to develop drivers for specific vendor functions or hardware (NICs for instance, power management, etc.) but the processor instruction set is always the same.

In the networking industry the same is not true, at all. Back in 2011, some people thought that OpenFlow was going to be "the x86 of networking". I for one was certain that would not be the case. I think we can all agree today that indeed it isn't the case. But why, and what does it mean for the network vendors? … As for the why, leaving aside the limitations of OpenFlow itself, there is little to no interest at all in the merchant silicon vendors to agree on a common architecture.

Broadcom, Intel, Marvell, Cavium … they all have their own hardware architectures, and their own SDK, and try to differentiate their offering keeping it that way. In some cases (in most cases), there is even more than one SDK and hardware family within a single vendor offering.

For a Network OS vendor this means that you need to develop for multiple ASICs, which translates into greater development efforts, and inconsistent feature sets. Take a look at Cumulus Linux for instance. Switchd today works on a Broadcom chip family, but does not work yet on other Broadcom chip families, or on Intel, Cavium, etc.

At this point some readers may be thinking "well, but ultimately, everybody uses Broadcom Trident chips, that's all that is needed anyways. So Broadcom will provide the standard de-facto". Not really. Vendors seek for differentiation, and will strive to find it. And contrary to the common mantra of the day, differentiation and value do not come exclusively from software. Hardware is needed. Hardware is not but a necessary evil, as some think of it. It is part of the solution to bring value.

As soon as all vendors have table stakes in hardware (by using the same chips), those vendors will seek to add value to avoid a race to the bottom. Those with the capability to try and create additional value by developing better hardware will inevitably do it. Those without it, will seek to use a different merchant vendor offering.

Why else does JNPR work to bring up a new line of switches using their own silicon? … Brocade also developed their own ASICs on certain switches. And if you look at Arista, they currently have products using Intel chips to provide a differentiated product line, and there are rumours that they are working on adding Cavium to the list. In a way, in the end, all these vendors are following the path set by Cisco combining merchant offerings with better silicon (in the case of Cisco, its own silicon).

The challenge then, from a disaggregation point of view, is that you need to develop network operating systems that work with very dissimilar hardware platforms underneath. This is a substantial development effort (alongside with its associated ongoing support). Add to that the fact that networking hardware tends to have a much longer lifetime than server hardware to make it even more complicated.

[As an anecdote to illustrate an extreme case of this point, last year I dealt with a customer who was running a non-negligible part of their datacenter on good old … Catalyst 5500s!! … These equipment have been in service since the late 90s ... How about that?
(An interesting side thought about that: if this customer would have chosen ANY of the competitive offerings to the Catalyst 5500 back in the day, they'd be running equipment from companies that do no longer exist. Although in all probability, if that had been case, they would not have been able to run that hardware for so long at all)]

Net net, when you think that the networking market is (a) an order of magnitude smaller than the server market, (b) the number of players developing for it is also smaller and (c ) there is no standard for network hardware chips, I think it is very unlikely that we will get to see a disaggregation like we've seen in the server industry.

Aren't new software players are changing that

Some may have read about IPInfusion launching its own networking operating system. I don't think this changes anything I've written so far. I look at this like I look at Arista developing EOS to run on multiple hardware merchant vendors (Intel, Broadcom ...). It is definitely possible, and it requires significant development investment with additional support cost.

Ultimately, the point is whether they will be able to deliver more value than a vendor that integrates their offering with their hardware (potentially better hardware).

I see this move from IP Infusion more as response to them being disrupted, than as being part of the disruption. I may be wrong of course, but IP Infusion builds their business on selling protocol stacks to companies that can't develop them by themselves, or chose not to do it for whatever reason. Cisco, JNPR, ALU may have their own routing stacks (BGP, OSPF,...), MPLS stacks, etc. Others rely on companies like IP Infusion for "acquiring" a protocol stack that they can't develop. If that "others" part of the market is now competing with the likes of Cumulus, those parts of the protocol stack are being filled-in by open source projects like Quagga and so on. So I think this is a response to try and stay competitive in that part of the market.

But disaggregation IS happening

So many people say it, it has to be true! ... Again, yes it is. It always was. It always had a part of the market. A very small part of the market. Now that part of market may be larger because cloud providers in particular may be interested in that model and its complications. I keep writing "may" because this is not a given. To explain why, I'd like to also separate another concept here. Just like SDN is not disaggregation, and disaggregation is not SDN, disaggregation is not Linux.

Many who think of Cumulus Networks think of white boxes. Therefore they think customers who buy into Cumulus Linux do it to buy cheaper hardware from white box vendors. But Cumulus does not sell white boxes. They sell software subscriptions. White boxes are a vehicle, a route to market (a necessary one for Cumulus). The value proposition for Cumulus is not really so much about disaggregation, I think. I believe that disaggregation is a great eye catcher to generate attention, to open the door, to create debate and confront established players. The value proposition is about providing a credible networking offering using Linux. Not an operating system that uses Linux, but Linux networking. It is about managing your network devices like you manage your servers. Now that is a much more interesting thought. One that is not for everybody, at least not at the moment.

Conclusion

All the above is just a brain dump. Nothing more. Food for thought. I believe that disaggregation of the networking industry is a big hype at the moment. This hype, and the semantics battle around it are awesome for analysts, bloggers and to some extent investors and vendors in order to create debate, and offer their products along the way (yes, they all offer their products alongs the way …).

Ultimately, competition is great. New players, ODM vendors with new OS vendors, are all welcomed. The more competition, the better for customers, the better for the industry, the better for everybody.

Tuesday, February 3, 2015

ACI Example - Fast deployment of infrastructure

Many people focus on ACI from the point of view of network virtualization. ACI indeed delivers a powerful network virtualization solution, through an integrated VXLAN overlay which can be use in a programmatic way. These virtualization capabilities are built with a policy model concept that links well with application definition. This is what most people stay with.

But the ACI policy model extends beyond using it for providing application connectivity. The APIC also provides many functions that are useful for a network/fabric administrator, in terms of topology management, switch on boarding, policy-based configurations and so on.

This blog is a simple example to illustrate what it takes to bring up a new rack for instance. In this example, I will add a new ToR to an existing fabric and will leverage the APIC policy model to provide connectivity for an ESXi host.

We start with a working fabric of only one leaf, as below:

Now I want to add another physical switch to that fabric. We can imagine that I have just racked a new ToR (leaf), and plugged the uplink ports eth1/49 and eth1/50 using 40GE to the spines. I can already see APIC has seen the switch:

The switch still has no management address, because we have not registered it to the fabric. But the network admin does not need to console to the switch, think of its management address, or use any configuration management tools to provision it. All we have to do now is register the switch:

We give it an ID, a name, and that is it, the switch is now added to the Pod1 we are working with. All that is necessary for it to work with the fabric is taken care off. At the switch console we see the name has been changed already as well:

Because we already had a leaf working with connected servers, we have previously created a switch profile for it, with associated interface selector profiles, all we need to do is add the new switch to the switch selector, for the right interface configuration to become available. In our case, this is setting a number of ports to GE, with CDP enable, etc. and other ports to 10GE, CDP and LLDP, etc.

That is it. In this lab I have created two interface policies, with single links, one for GE connected ESXi hosts and another for 10GE connected ESXi hosts (CDP, MTU and other settings are part of the profile). The same model can be applied if using vPCs of course. The right ports have already been configured for GE and with proper VLANs from a pre-defined pool. As soon as we plug the ESXi hosts and apply their configuration they already show the leaf on CDP:

The ESXi hosts were already configured as well, as they were working before connected to a Nexus 9K standalone lab. ESXi infrastructure traffic is mapped to application profiles, and traffic or a particular kind to its own EPG (vMotion, VSAN, iSCSI, NFS ...). As soon as we add the EPG bindings, we see for instance all iSCSI hosts (herewith statically mapped):

And we can then also benefit from immediate visibility into each of the vSphere traffic types, without adding any other tools (which can of course also be used!):

This is but a really basic example of how a fully programmable fabric is useful, beyond providing network virtualisation …

Thanks to my good friend @alonso_Inigo for helping me ramp up on so many things! :)

Wednesday, January 7, 2015

Let's talk about a 64 Tbps Firewall

Much has been written about SDN, ACI and NSX. The debates are heated, often times perceived as if enemies are fighting one another. In the end, we are just writing about technology, and we should all remember that. Of course we all defend what we think is best, and of course we all try to bring forward considerations that are better suited to whatever position we are defending. This is human.

I certainly do not want to contribute to the perception that it's all about bashing one solution or another. Certainly, I do not want to bash any solution, and NSX in particular. But I believe we all need to contribute somehow to have a more moderate debate, and I just dislike exaggeration. With all my respect for @bradhedlund, tweets like the one below fall into exaggeration, in my humble opinion:

@cobedien @dkalintsev @thomas0002 @jonisick Go for it. All east west traffic secured by a firewall for 96K VMs. Or is that asking too much?
— Brad Hedlund (@bradhedlund) December 24, 2014

@cobedien @dkalintsev @thomas0002 @jonisick 96K vm at 30:1 is 3200 hosts w/ 20G. You add 64TB east west firewall. I'll add the controllers
— Brad Hedlund (@bradhedlund) December 24, 2014

To put some context, this tweet comes out of a discussion over a blog post by @cobendien and @chadh0517 [which you can read here]. The post is about how open is Cisco ACI and what is the Total Cost of Ownership (TCO). A comparison is provided in terms of TCO, presenting a Cisco ACI design and another one built with NSX and Arista commodity switches. Brad's argument is that the TCO analysis is flawed, because NSX provides E/W firewall which should be added to the ACI design. Brad is talking about using NSX DFW to provide for an environment of up to 96,000 virtual machines (which is the size of the design that is presented in the original post).

The logic here goes that to compare ACI and NSX costs, you need to add an enormous amount of hardware to provide for firewall and load balancing capabilities that NSX provides. NSX Load Balancing however appears to be nothing more than HA Proxy with a front end for provisioning. HA Proxy is, as we all know, free. A device package for HA Proxy is easy to do in order to integrate it with ACI. Similarly, NSX Edge firewall capabilities are very similar to what you can get on a Linux VM using IP Tables. The key argument therefore lies on the Distributed Firewall feature, which provides for East-West stateful packet filtering on a per vNIC level.

I am certain that many customers see value in NSX, and specifically on the Distributed Firewall feature. However, based on my understanding of NSX, it is unfair (and not true) to pretend that 3,200 hosts running NSX are equal to a 64 TB firewall. This isn't true in essence, nor in math. In my opinion, that is an exaggeration and I believe it should be avoided, and I have seen similar ones many times.

I will try to provide details about why I think this way, based of course on my humble knowledge of NSX as it is today. I do hope I'll be corrected if and where I am wrong. Comments are opened, and welcomed!

The need for East/West Filtering

I do not contest that some customers have a clear interest, and in some cases a real need, to have per VM East-West traffic filtering. I do contest that per VM East-West traffic filtering is a requirement on every case, and/or that the best implementation in order to accomplish that is by running it at the hypervisor kernel level.

In first place, it is fairly obvious that not everything runs in a VM. Many (many) apps still run in bare metal, therefore a security solution should consider East-West filtering to be comprehensive for both environments. Second, implementing advanced traffic filtering at the hypervisor level complicates having a seamless solution for policy across multiple hypervisors, because the data plane would need to be replicated on every kernel, something which isn't evident to accomplish (even if most vendors today have an open API for accessing the kernel for network functions). Third, the "cost" of filtering at every virtual ingress point may be higher than it appears, and it isn't for free in any case. Fourth, and a direct consequence of the previous point, implementing advanced filtering isn't available today at very high speed on x86 architectures.

This leads me to the tweet line below, part of the same tweet exchange from above:

@savgoust @cobedien @dkalintsev @thomas0002 @jonisick that's your opinion. Point is, NSX provides a stateful firewall for E/W, ACI doesn't
— Brad Hedlund (@bradhedlund) December 26, 2014

What a Firewall is, and what a Firewall isn't

I am not a security expert, but I believe that the definition of a Firewall isn't a matter of "opinion". It is a matter of the functionality provided. That said, not every place in the network needs advanced packet analysis and filtering, so the functionality required from a firewall isn't always the same, and that's the real story.

One of the fundamental arguments for proposing a DFW like that provided by NSX (i.e. a stateful packet filtering engine at the vNIC level) is that it prevents threats from propagating East-West. Explained shortly: if you have a vulnerability exploited on a VM on a particular application tier, the exploited VM cannot be (easily) used to launch attacks against others in the same tier and/or other tiers of VMs.

But I think that this argument contains a bit of a marketing exaggeration.

Imagine a typical three tier application. At the user facing side, the application tier presenting a web front end with a number of VMs running Apache, nginx, IIS … your pick. All those VMs listen on tcp/80, any of those will eventually have vulnerabilities. If one of those VMs is exploited through a vulnerability of the web server accessible on that port, the chances that ALL other VMs on the same tier are equally exploited are very high. This is the reason why your front end perimeter firewall should have capabilities to inspect HTTP traffic in order to protect the web server. For this you need a real firewall that can do HTTP inspection, HTML analysis, etc. The fact that once one Web VM exploited, your DFW impedes each Web VM from talking to other Web VM does very little in this sense, because the attack vector was on the legitimate port. Then, even if those exploited web VMs can only communicate on authorised protocol and ports with other application tiers (something which is also enforced by the ACI fabric since each tier will typically be on different EPG - or EPGs and can communicate only on a contract permitted basis). But we all know that again this protection is also not incredibly strong because, again, it is only filtering on protocols and ports. The real exploit may (will) happen through the legit port and protocol. This is the reason why modern firewalls (aka NGFW) these days inspect traffic without necessarily concerning themselves about port and protocol. To continue our example, SQL traffic may be allowed between the App and DB tiers, but you need the capacity of inspecting inside the SQL conversations to detect malicious attempts to exploit a bug on a particular SQL implementation … something which a NGFW does (but the ACI fabric filtering or NSX FW don't do).

This does not mean that the ACI or NSX DFW filtering aren't important and/or useful. I am not writing this as a means to discredit the NSX DFW feature, which may be interesting in terms of compliance or for other reasons. Multi-tenancy may be a reason for using such filtering, other than security. ACI offers very similar functionality, being a natural stateless networking fabric it may appeal to security ops as well.

This is truly not about bashing a technology, but I do want to put the reader into a mood where they think about these tools as parts of a solution: not as the holy grail (which they are not). I hope the reader will ask her/himself whether a feature is really valuable, or whether it simply is presented in such a way because it is … well … the only thing that a particular vendor has to offer.

In reality, if security is really of top concern within the DC, you will likely still need a real firewall (from a real firewall vendor), and what matters in that case is how service redirection can happen in an effective and consistent manner (for physical, and virtual).

3200 Hosts and a 64Tbps Firewall

There is a bigger point of exaggeration on the tweet line above. It's not only the hype about DFW capabilities and how much security can it really enforce. The other point that I think falls into both hype and exaggeration on the tweet exchange above is performance and scale. I'd like to spend some time on that to put things into perspective.

I want to clarify that this isn't about "software can't scale, and hardware is better" or vice versa. It is about the fact that oversimplifying things isn't helpful for anybody, and that distributed performance is sometimes presented almost in mythical ways.

Let us analyse an environment with up to 96,000 VMs, on 3200 hosts each dual homed with 20Gbps. Such an infrastructure cannot be considered in terms of a SINGLE FEATURE of the underlying software or hardware. There are many more dimensions to consider. And then also even if considering one single feature, the devil is in the details.

Let us consider the following:

(1) Such an infrastructure needs not only E/W filtering, but N/S as well. Perimeter firewalling is what will first and foremost protect that front of end of yours. You will typically want to have a NGFW for this. Depending on the design, the NGFW resources may be virtualised and therefore be shared for both functions (perimeter and east/west). In the case of using NSX, the resources allocated for the DFW are only available for basic E/W filtering. No sharing in that case. This point must be considered also in the TCO.

(2) Following from above, the DFW feature isn't free to run. The tweet line above was presented as if NSX was adding that feature on top of a design at no additional cost. But the real cost of the DFW feature, or any other feature that runs on a hypervisor, cannot be evaluated only in terms of the software licensing, but also the general purpose cpu cores and memory that are consumed by this feature. Cores and memory that aren't available for other applications. Nothing is for free.

(3) To infer that because you have 3,200 hosts each with 20Gbps you have a 64 Tbps firewall is marketing math at its best. First, because to this date, we have not seen any independent validation (or vendor validation for that matter) of the real performance of the DFW (or any other NSX feature). This is a clear contrast with VMware vSphere features, or with VMware VSAN where there's a plethora of vendor and third party benchmark testing done publicly available. Second, because with current NSX scalability limits (as per data sheet), those 3,200 hosts cannot all be part of a single NSX domain, therefore need to be split into smaller Distributed Firewalls with no policy synchronisation. More details on this below.

(4) Understanding that real security (beyond protocol and port filtering) will require a real NGFW filtering at least between critical Apps and/or App tiers, the NSX solution must also be complemented with an offering from a company like Palo Alto (and to this date, only that company as far as I know). This has an impact in performance, in cost, and in overall resources (because each PA VM-Series also requires vCPUs and vRAM which are incremental to those consumed by the DFW feature itself).

(5) Traffic must be routed in and out of that virtual environment. Those 96K VMs will have a need to communicate with users if nothing else, and with their storage in most cases too. In fact, if the server has two 10GE interfaces it is likely that a fair amount of that bandwidth will be dedicated to storage that won't be protected by the DFW feature in any case. This also means that you do not need 20 Gbps of firewall throughput per server. The amount of North South traffic will largely depend on what the infrastructure is being used for: what is running on those 96K VMs. Imagine that you need to onboard lots of data for a virtual Hadoop Cluster that you decide to run there, you may have significant peaks of traffic inbound. Assume North/South is 10% of total bandwidth, you then need in excess of 6Tbps of routed traffic in/out of the overlay … I will come back to this point later.

To expand a bit on point (3) above, on the per-host performance, it is well known however that ESXi in general does not do 10GE at line rate for small packet sizes for instance (http://www.vmware.com/files/pdf/techpaper/VMware-vSphere-VXLAN-Perf.pdf). This continues getting better and better of course, and I'd expect vSphere 6.0 will also improve in this area. The NSX performance figures that I have seen presented at VMworld SFO 2014 [Session NET1883] displayed results that for small packet sizes, DFW was maxing out at around 14Gbps on a test with 32 TCP flows on one direction, whilst for large packet sizes it could be up to 18Gbps per host. Since this test was done with unidirectional traffic only and with a (very) small number of flows (and not knowing how many cores where dedicated to achieving this performance because such information wasn't shared), we cannot infer what the real performance is for IMIX or EMIX, but it is clear that it will not be 20Gbps per server.

The performance considerations of a distributed environment can be looked at in many ways. Distributed performance may scale better in certain cases, and worse in others. The performance of each point of distribution needs to be considered. In this case, there are per-host limitations which may or may not be acceptable. Again, leaving behind whether current NSX DFW performance is 14 or 18 Gbps per host or whatever (in any case, it will get better with time for sure), as you add NGFW capabilities today this is lowered by an order of magnitude (even if that will also get better over time). The PA VM-Series maxes at 1 Gbps per VM as per their data sheet.

Oh, but you only send to the Palo Alto VM-Series the traffic that requires deep inspection … by filtering on port and protocol with the DFW? … This is a possibility, but then you are saying that you won't be sending to deep packet inspection such port and protocol because they are safe? … That argument doesn't hold. And particularly the Palo Alto folks know that, since they make a selling point around … well … being a firewall that cannot be fooled by port and protocol. Anyways, the level of security requirements will be customer dependent of course, not all environments are the same.

Back to performance, arguably you can also say that any particular host won't be bursting at maximum performance all the time. So it does not matter if you can do only 1-2 Gbps, because maybe that is all that your application requires. But it is clear that when an application is busy on a particular host, likely traffic is high on that particular host as well, and you do need the firewall performance for that particular host at that very moment.

In other words, a 3,200 host NSX environment isn't really a 64Tbps DFW at all. If anything, it'd be 10 or 12 smaller ones since you cannot have 96K VM under a single NSX Manager (see below where the 10-12 comes from).

Of course, the same argument about statistical traffic requirements (i.e. all host won't require maximum performance at once) can be made for a semi-distributed NGFW design option. By semi-distributed I mean a scale out model built with appliances and dynamic service chaining. This isn't an all or nothing. You can distribute firewall functions to every host, but the alternative isn't to have a single large, monolithic firewall. You can have a scale-out cluster of dedicated hardware appliances too.

For the sake of argument, consider using Palo Alto physical appliances on a scale out model, vs. per host Palo Alto VM-Series integrated with NSX. A vSphere cluster of 32 hosts could max out at 1 Gbps of firewall throughput per host (as per the Palo Alto VM-Series data sheet here). You could say that the vSphere NSX cluster is a 32 Gbps NGFW for E/W, assuming you use it to filter all traffic (which is probably not required in many environments). To accomplish this you need the VM-Series licenses, the NSX licenses, and dedicated physical hosts for running NSX controllers and manager. Also, you are dedicating a total of 256 cores on that vSphere cluster to Palo Alto alone (or the equivalent of 5-12 servers depending on CPU configuration on a dual socket server). This is not counting the cores required for DFW, which are harder to estimate. On such solution, no single VM on any particular host can exceed the maximum throughput of the VM-Series appliance (1 Gbps today). That is: traffic between any two VMs crossing the VM-Series can never reach 10Gbps.

Consider now a design with Palo Alto physical appliances instead. If you would use two PA-5060s with service chaining, you have 40 Gbps if NGFW capacity for East/West. You also get better performance for any single VM to VM communication which can now fully use all host available bandwidth. More over, that NGFW serves not only ESXi hosts … but any other stuff you have connected to the fabric! …

Now run the numbers for TCO of those two environments ... the numbers do not lie. I do not know the pricing of Palo Alto appliances or VM-Series, so I will let the reader contact their favourite reseller. Of course I can make the same case for ASA and compare (so I know the results :-) ) but I do not want this post to be a Cisco selling speech.

However by using NSX + Palo Alto, you get to use vCenter attributes to configure policy! and you can dynamically change the security settings for VMs as required. Well, actually, that would be possible without NSX as well from a policy point of view because nothing technically prevents Panorama from directly interfacing with vCenter and the latter from communicating the mapping of VM attributes with IP/Mac, etc. The solution has been productised in a way that Panorama must talk to NSX Manager, but I see no (technical) reason why it couldn't be done in other ways since it is vCenter who holds the key information.

Now let's come back to point (5) from above. To do that, we need to consider the design of the 96K VM infrastructure with greater detail. The design possibilities for such environment are so varied that I won't take a shot at doing this here. But I do want to point out some things for consideration, for instance, let's review a possible fabric design first, with a variation of what was proposed above:

- 3,200 hosts could fit in as little as 80 racks, using 1 RU dual socket servers (provided cooling can be done in the facilities)
- using redundant ToRs per rack, this translates into 160 ToRs. We would consider four spines.
- if using ACI, the fabric would have three APIC controllers in a cluster.

Now let's consider the virtual infrastructure:

(1) vCenter 5.5 maxes out 10,000 active VMs [http://www.vmware.com/pdf/vsphere5/r55/vsphere-55-configuration-maximums.pdf]. I do not know if customers would feel comfortable operating the environment at such high utilisation numbers, considering that vCenter is also a single point of failure. Also, if operating at maximum possible scale, the vCenter DB also must be considered carefully. For the sake of an exercise, let us assume you add 20% safety so you operate at 8,000 VMs per vCenter, you would need 12 vCenter. To make an easier math, let's assume 9,600 VM per vCenter and therefore you need a total of 10 vCenter instances.

(2) The point above is important because NSX as of v6.1 has a mapping of 1:1 between the NSX Manager, controller cluster and vCenter. This means that you also need 10 different controller clusters and 10 different NSX Managers. Notice that there is no federation of controllers or managers today. This means that from a DFW standpoint (and from every other standpoint in fact), you have to replicate your policies 10 times (or more if you choose to use less VMs per vCenter). This is why I alluded above that if we are to believe Brad's marketing claim of 3,200 hosts being a 64 TB firewall, it should as a minimum be split into 10 chunks (and likely into more chunks in fact for practical purposes).

(3) We are considering here that each controller cluster and NSX Manager can handle the load for up to 9,600 VMs over 320 physical hosts. We have not talked of how many Logical Switches, Distributed Logical Routers, etc. may be required on the environment, or the impact that such considerations may have on overall scalability of NSX. Again, an area where we all live in the grey mist of ignorance, since no public guidelines are available. But it is clear that if we choose 10 vCenters, the controller clusters will be operating near its maximum advertised capabilities (which is 10,000 VM) in any case. With this in mind if you design for maximum availability and minimal performance compromises, you need:

- 10 servers for vCenter
- 30 servers for NSX Controllers
- 10 servers for NSX Managers

The above must of course pay vSphere licenses as well, and would take up almost two racks worth of space, network connectivity and power. Arguably, you can say that you can run two or even three controllers per server, each belonging to a different cluster. This has an impact in the system availabily, because in such a case a single host failure (or maintenance) would impact two or three clusters, as opposed to only one (so you duplicate or triplicate - or more - the size of your failure domain). If you choose for up to three controllers per host, you'd use 10 servers for controllers, instead of 30, at the expense of a larger failure domain. In any case I doubt anybody would design to operate the environment at 96% of advertised capacity for vCenter nor NSX, so the actual figure will likely be larger. (although vSphere 6.0 is coming out soon and may raise some of these figures up).

It is worth remarking that, effectively, you have 10 isolated islands of connectivity on that infrastructure. How does a VM on one island talk to a VM on another island? … Through an NSX Gateway. So the East/West across islands requires Gateway function resources. Another point to consider in the 64 Tbps DFW marketecture which was ignored (and another cost).

Now comes the perimeter. The North South. Again, let's say you design for 10% of the deployed fabric capacity. No, let's say it's 5%. You want 3 Tbps of North/South capacity available. Let's forget the perimeter firewall because you could argue that whether you use ACI or NSX or both, you need it anyways and it could be made off of the same clusters of ASA or Palo's or whatever. But you have to route those subnets in which your 96K VMs live. For NSX, that means using NSX Edge, peering with the DLR. How many DLRs you need? Depends: on how many tenants you have, on how many subnets you need to isolate and route, etc. Let's forget those too. At VMworld session NET1883 they presented a test case where NSX Edge could route at about 7 Gbps. Again a test with minimal data shared: no RFC level testing, no IMIX consideration, no word on tolerated loss rates, or latency, …

Let's say each NSX Edge routes 10 Gbps though, and you can load balance perfectly across several instances as required. To route 3 Tbps in and out of that overlay you need 300 NSX Edge VMs. If they use any stateful services (i.e. NAT) and if you want redundancy? … then you need to add 300 more … in standby.

Now, let's put two NSX Edge per physical server (each with a dedicated 10GE NIC), you are adding 300 servers to the mix. That is almost 8 racks worth of gear that pays full vSphere (and NSX) licenses that can do one thing alone: run NSX Edge. Let's forget the complexity of operating 300 mini-routers, each of them independent of one another.

In the drawing above I show how NSX Edge is also required if East/West traffic flows between NSX Domains, but notice that I have not considered any requirement for E/W between the different NSX domains on the calculations above about how many NSX Edge VM you need. But that would add more need for NSX Edge VMs. We have also not considered the DLR VMs, that would also require dedicated servers.

Now imagine an upgrade! ...

And what about ACI?

With ACI the fabric can be shared with more than one vCenter. Also, gateway functions are not required to connect VMs with other stuff, whether it is users through the core or WAN, or bare metal applications. As for East/West, it really depends on every environment. Where the intention is to simply have filtering to segregate tenants or apps, it is possible that ACI stateless filtering model intrinsic to the Application Profile definition is sufficient. In other cases the filtering is required at the vSwitch level, and in that case the Application Virtual Switch (AVS) can be leveraged in vSphere environments. Or one could imagine that NSX is also leveraged in that sense, with ACI for service chaining. We have seen that using physical firewall appliances in a scale-out model with dynamic service chaining can be more cost effective and render better performance than the virtual model. The graph below illustrates a possible alternative, built using vCenter with ACI but without NSX.

Because you do not need the NSX Edge Gateway functions and you need less servers for management and control, you will see savings that can probably pay for the NGFW functions. This depends on each case and on the scale and requirements of course. But you can see that for this particular scenario, you probably require between 200-300 less servers, with accompanying licenses, space and power costs.

In the end, it's not all black and white

To wrap it up, this isn't about saying that NSX is a bad product. This isn't about saying that the NSX DFW feature is good or bad. It is about expanding the conversation and considerations beyond hype, and marketing. Fact is, many customers that are looking at both ACI or NSX have scalability requirements well below the design discussed here. Ultimately, customers may see value in both solutions as well, or in none of them! For instance using NSX DFW as a form of advanced PVLANs, but use ACI fabric service chaining for redirecting to NGFW running in scale out clusters.

The important thing IMHO is that customers can make an educated decision. And while all vendors legitimately try to steer the conversations to their advantage, we should all try to avoid falling into blatant exaggerations. Particularly, when it touches security aspects.

Translate