Ref: 11640123
Title: Case Study of Packet Handling Between IB/3s Over Serial Lines
Date: 4/24/90

Copyright 3Com Corporation, 1991.  All rights reserved.

Problem:  A user has two concurrent T1 lines running between two IB/3s.
Most of his traffic is DEC.  When one of the T1 lines fails, the IB/3
switches the traffic fine, but when the line is restored the data is not
handled properly and it freezes all the LAT terminals.  The IB/3s are
running IB version 2.0 and have set MAXMODE=OFF and ON to no avail.

Solution:  LAT does not normally tolerate out-of-sequence packets; thus,
the LAT terminals may be locking up because packets are getting dropped, or
the servers are receiving packets out of sequence when the second of the two
parallel lines comes back into service.   (Note:  MaxMode=ON is usually
required for LAT.)

To understand specifically what is going on, consider two scenarios--one
with MaxMode=ON, one with MaxMode=OFF.


SCENARIO 1:  MaxMode=ON

With two working Ethernet lines, packets are transmitted as follows:
Ethernet addresses seen as coming from one Ethernet line (E1) are stored in a
routing table as being on E1.  Both lines are the same speed, so we assign
the same number of E1 source addresses to each line.  Packets from an E1
source are always sent across the line to which the E1 source was assigned.
The same thing is happening on the IB/3 over on the second Ethernet (E2).

Note:  There is no guarantee that packets exchanged between a given E1
device and a given E2 device will go over the same line in both directions.

When one of the lines goes out of commission, we assign all source
addresses to the remaining line.  This is not a problem yet because there
have probably been few out-of-sequence packets.  (Out-of-sequence packets may
have resulted from a line deteriorating before going down, causing, for
example, packet #1 to be passed successfully, packet #2 to be lost due to
serial line error, and packet #3 to be passed--the host would see #1, then
#3.  Then the station then would ask for packet #2 and the host would likely
send #2 and #3.)

So, all of the traffic is flowing down a single line, mostly in sequence,
and everything is fine.  Then the second line comes back up.  Now that we
have a second line, we assign about half of the addresses, currently using
the only line that was available until a moment ago, to the newly available
second line.  What happens to the traffic?

Remember, the IB/3 is a Layer 2 device, so it has no concept of "session."
It does not know about packet destinations other than itself, and it views
each packet as something to be forwarded or discarded as appropriate given
its own configuration and placement in the network.

The IB/3 has a transmit queue of packets to forward which it has received
from the local Ethernet.  It sends them all down one line until the address
assignments are made as a result of the second line coming back.  Then, at
least for those addresses assigned to the newly available line, some of the
packets get sent down the second line.  So, for this brief period, it is as
if we were load-balancing the LAT traffic; that is, packet #1 to line #1,
packet #2 (and all later packets) to line #2.  If packet #2 traverses line
#2 faster than packet #1 traverses line #1, then the end-station sees #2
before #1.  The end-station may see even more out-of-sequence packets because
there may be more than one packet in the pipe (on each T-1 span) at a given
instant.  So the end-station may see packets in the following order:  #2, #1,
#4, #3,...(for two or more packets in each pipe).  This is more of a problem
with faster lines, because slower lines usually have fewer packets en route
(on the line itself at one time) as a result of the increased time to put
them on the line.

Two ways to get around this problem are

1.  Transmit all packets currently in the queue over the first line and
transmit no other packets over the second line until these packets are sent;
however, this could cause timeouts.

2.  Flush all packets currently queued; this has the same problem as above
(possible timeouts), but may be preferable.


SCENARIO 2:  MaxMode=OFF

With MaxMode=OFF, we sequence (assign sequence numbers to) packets passing
between IB/3s.  These sequence numbers are meaningful to the IB/3s only and
are seen only on the serial lines.  (The receiving IB/3 strips the IB/3
sequencing number before forwarding the packet received on the serial line
to another serial line or the Ethernet.)

While the two lines are available, packets may arrive at the far IB/3 out of
sequence, but in most cases the far IB/3 will be able to put them into proper
sequence before forwarding them on.  There is a performance hit associated
with doing this, and there is a limit to how many packets can be re-sequenced,
limited by a queue depth of 32, so we can wait until we get the next 31
packets for an expected packet (the expected packet would be, for example,
packet #5 if we had already received packets #1 through #4 and had also
received packet #6--we wait to see if we will receive #5).  If we
receive 31 additional packets without receiving the expected packet,
we have to decide between continuing to wait for the expected packet (which
may never arrive as it may have had a CRC error on the line), or forwarding
the other 31 (and keep forwarding packets) and let end-to-end transport or
session level retransmit the packet that was lost?  We decide to forward the
other 31 and let the end-systems retransmit the lost packet.

So, we drop to a single line when the second line goes down.  As two lines
are configured for the same port, we probably continue to send packets with
sequence numbers even though we are sending all packets down the first line
now; if this is the case, I do not know why we would have a problem unless:

1.  There are so many out-of-sequence packets received at the far IB/3
that it cannot wait for the expected packets; therefore the end-systems
have to retransmit too many times which they cannot tolerate, or (more likely)
due to the number of retransmissions (along with the normal traffic load) we
begin to overflow the IB/3 Ethernet receive buffers and the end-stations
abort the session after reaching their maximum number of retransmissions for
a given packet.  However, a terminal would be unlikely to lock up for this
reason unless the packet lost was control information and not data.

2.  There is a bug that causes our algorithm to get confused and continue to
wait, discarding all other packets received from the serial lines in favor of
the expected packet.  Eventually, the sequence numbers will roll and some
other packet will have the expected sequence number.  If there are about 4000
sequence numbers, we would not forward the next 4000 packets.  This would be
long enough for all sessions to drop.

But, when the second line comes back up what happens with MaxMode=OFF?  All
packets are sequenced, and all arrive in sequence as long as we have only a
single line (as long as both lines have Path State=ON and both lines are
assigned to the port).  When the second line comes back, we begin using
it--with sequence numbers still being used, of course.  This should not be
a problem, unless there are too many out-of-sequence packets or there is a
bug.


CONCLUSIONS

MaxMode=OFF may not work for performance reasons, even if few serial
line errors occur.

MaxMode=ON probably causes out-of-sequence packets to arrive at the
destination when a line is brought back into service.

To determine conclusively what is causing the problem, we need a trace
on the segment where the terminals lock up, showing data to and from a
terminal before the problem occurs, until the problem occurs.

