Quantcast
Channel: VMware Communities : All Content - All Communities
Viewing all articles
Browse latest Browse all 180923

VM cannot ping Host and vise versa

$
0
0

This is a very puzzling problem.  VMWare support has been trying to figure this out as well as Dell.  So, I am just throwing this out to the community to see if anyone else has experienced this issue and may have a solution.  I have 3 identical Dell R720 servers.  2 work with no issues, but 1 (call it vm8) has been giving me problems since day 1.  Dell checked the hardware today and had me update the BIOS, firmware and drivers on vm8, which did not resolve the issue.  VMWare technicians checked every network setting over the past several weeks and they currently cannot find the cause.

 

VM8 has ESXi 5.5.0 installed.  The 4 server has 2 nic cards with 4 ports each.  Current configuration is vmnics 0-3 are connected to our LAN, 4-5 to our DMZ and 6-7 to our SAN (iSCSI). The HA goes up and down because VM8 loses connectivity to our isolation address (gateway).

 

VM8 (Network Mgmt IP is 172.20.100.9) only has 1 VM (172.20.100.40). Same subnet (255.255.255.0).  .9 times out pinging .40 using vmkping.  When I ping .9 from .40, the first packet gets a quick reply, then all following packets timeout.  According to VMWare, when you ping within (VM to host) it does not go out through the physical nic to the physical switch.  Everything is internal with the vmnic and vSwitch.  When I ping my gateway (172.20.100.1), the ping is successful.  When I ping .9 from my workstation, the first packet times out, then the following packets get a reply.  It's the exact opposite of pinging from the VM.

 

Here's a better breakdown-

.9 VM8 Host

.40 VM on VM8 host

.1 Gateway

.122 workstation on LAN

.25 vRanger (physical server on LAN)

 

Ping

.9 to .40 (100% packet loss)

.40 to .9  (75% packet loss)  first packet gets reply, next 3 timeout

.9 to .122 (0 packet loss) good ping

.122 to .9 (0 packet loss) good ping

.9 to .25 (75% loss) vmkping does not display each packet as it is sent.  But from other results, I can safely assume first packet times out.

.25 to .9 (75% loss) first packet timed out, following 3 got a reply

.40 to .122 (0 packet loss)good ping

.122. to .40 (100% packet loss)

 

All 3 can ping to .1 (about every 20 minutes on VM8 I get a "vSphere HA agent on this host could not reach the isolation address 172.20.100.1"

 

Also throughout the day, I get the message - "vSphere HA agent on this host cannot reach some of the management network addresses of other hosts, and HA may not be able to restart VM's if a host failure appears."  I have come to work in the morning and all of my VM's on VM8 have migrated to my other 2 hosts.  My backups don't work on VM's on VM8.  I use vRanger and when I ping VM8 from vRanger (physical server), the first packet times out and the following packets get a reply.  So, when vRanger goes to backup my VM's, if fails because of the initial packet loss.

These are things that I have tried already.  I tested each physical NIC individually.  I removed every port on both NIC's to try and isolate a specific port. All 4 vmnics are active adapters in the Management Network Properties NIC Teaming and I moved each vmnic individually to unused to test each port.  I have replaced the Cat6 cables.  I have used different Dell switches and different ports on the switch.  I even swapped the ports on the switch that another host used, ruling out a switch port configuration issue. Also, port security is disabled on ports.  I upgraded ESXi 5.5.0 to a newer build.  There's a know issue with the tg3 driver, which I have upgraded to the latest version without the problem.  I also used the tg3 workaround by disabling NetQueue.  And we do not use VLANs. Dell tech support states that it is not a hardware issue and believes it is a Layer 2 issue, but is not sure where.  Basically, it's either an internal problem (meaning strictly on VM8) with vSwitches or vmnics or it's a hardware gremlin in our Dell R720 box.

 

Dell's final recommendation is to blow away ESXi on the server and install a clean copy.  This is extremely frustrating and I am running out of ideas.

 

Thanks in advance.


Viewing all articles
Browse latest Browse all 180923

Trending Articles