Thursday, February 4, 2010

Portchanneling, or how to bring the LAN down

I had a lot of fun doing LAN refresh implementation on site for our client last 18 months. Sites were somewhere between 100 - 700+ users, and the number of switches were from 5 - 40. Gathering information's for their existing LAN, Preparing the design and configuration is one thing, on site implementation is something different, more challenging and more interesting.

Yesterday I've found out that is extremely easy to break such LAN remotely. With 1 move we've lost access to the core switch, whole site was down for 5-10 minutes, and after reloading of the core switch, and re-configuring everything was fine.

The task was to move a server from one VLAN to another, and to force that server to communicate with the site through the firewall installed on site. The routing function for the new VLAN is done by an UTM-1 Egde firewall, which is connected to the core switch. The server was connected on Access switch (same as the WAN router). I've made a step-by-step explanation for my colleague who had to perform the task, and I've made 1 mistake about portchanneling. I've asked him to modify the physical interfaces, instead of portchanneling interface. As soon as he started with the change, I got call from him that the site is down. I vent to his PC and I see the putty session with last command entered: "switchport trunk allowed vlan add 201" as instructed. Everything was down, so we called on site, they confirmed that site is down, and we asked the switch to be reloaded. It took 5-10 minutes, and we checked the command reference for portchanneling in meantime. One of the mistakes was that switchport configuration was edited on PHYSICAL interface, instead of virtual PORTCHANNEL (Po5) interface. After reload my colleague added the new VLAN on the Portchannel interface of the Access Switch first, and then added it on the Portchannel interface of Core switch, and everything vent ok. (the physical interfaces config got updated automatically as expected). Change vent fine, that server was migrated, and all the NATted connections towards the server were working as expected.

The "mystery" remained... why the heck we lost access to the Core switch? The Core switch have loopback interface and even that was not reachable until the switch got rebooted. I was enlighten by one of our colleagues, a CCIE R&S holder.

On 158 of the 160 sites, the WAN Router (Provided by ISP) is directly physically connected to the Core Switch. On 2 of the sites (I got this info today) the WAN router wasn't placed in the same room as the Core switch, and then we use portchannel bundled with 4 or more Gigabit physical interfaces, between the Core switch and the "Access" switch which is physically connected to the WAN router. Off course I didn't check if this was the case. So the logical L3 diagram was like :WAN->CORE---->ACCESS, but physically they were like: WAN->ACCESS---->CORE. By breaking the portchannel between the Core and Access switch, we lost access to the Core Switch, as the Core Switch wasn't physically connected to the WAN router.

Lessons learned:

1. Verify the network diagram. Verify if the configuration of the device corresponds to the diagram. (This should take less then 10 minutes, you can find outputs below)

2. Check the command reference and/or examples in case you haven't done the task recently (add vlan on a port-channel)

3. Do not make too many assumptions.

CORE#show ip route
S*   0.0.0.0/0 [1/0] via 10.122.134.1

CORE#show arp | inc 10.122.134.1
Internet  10.122.134.1   24   0000.0c07.ac01  ARPA   Vlan100

CORE#show mac address-table | inc 0000.0c07.ac01
100    0000.0c07.ac01    DYNAMIC     Po5

CORE#show int po5 | inc Members
  Members in this channel: Gi1/0/5 Gi1/0/6 Gi2/0/5 Gi2/0/6


CORE#show cdp nei Gi1/0/5
Device ID        Local Intrfce     Holdtme    Capability  Platform  Port
ACCESS           Gig 1/0/5             120           S I      WS-C3750- Gig 1/0/1

No comments: