Packet timeout with unknown reason in Quantum Forc...

PJ_WONG

Hi Checkmates,

Is there anyone using Quantum Force 9100 and facing intermittent unknown packet timeout?

The reason I specify Quantum 9100 is because we perform a tech-refresh from Quantum 5000 series and then this occurs, only in the 9100, and all the configuration remains unchanged.

During the issue, did not see any drops at zdebug and SmartConsole logs as well. I see there are some TX drops at the interface, is this related? Since the packets does not seems to be too many, shown in the image attached.

TAC said it was a hardware issue, but after RMA the issue is still there.

Thanks.

Chris_Atkinson

Please describe the timeout issue in more detail

Additionally could you please share some further information starting with...

Major Version

Jumbo take

SecureXL mode (KPPAK or UPPAK)

Enabled blades

CCSM R77/R80/ELITE

PJ_WONG

Hi Chris,

The application looks like doing a connectivity check continuously on port TCP 4 something similar to a keepalive, where intermittently there will be timeout happening.

Old Device - Check Point 5800

HOTFIX_R81_20_JUMBO_HF_MAIN Take: 84

New Device - Check Point 9100

HOTFIX_R81_20_JUMBO_HF_MAIN Take: 99

Only firewall blade is enabled, by default firewall is UPPAK, we faced the issue and change to KPPAK but the issue persist.

Thanks.

the_rock

Hey @PJ_WONG

I work often with bank that purchased brand new 9100 cluster for one of their branches and so far, no issues. Can you please confirm if there are any relevant logs/drops present about this?

Andy

Chris_Atkinson

Are qos & flow control settings consistent with the switchports connecting the old gateways, any errors/drops reported there?

CCSM R77/R80/ELITE

PhoneBoy

Those interface errors point to a cable and/or switch issue.

PJ_WONG

Hi Guys,

Thanks for your earlier responses.

Unfortunately, we do not have access to the switch, so we’re unable to check for errors, drops, or misconfigurations on the switchport side.

On the firewall side:

Only firewall blade is enabled, no QOS.
We do not observe any drops in "fw ctl zdebug + drop" or via SmartConsole logs.
However, during the issue, we captured the following in "fw ctl kdebug" logs:

@;39117251.1778855408; 5Jun2025 16:55:13.085051;[cpu_2];[fw4_5];fwconn_ent_early_expiration: [now=1749113713] conn <dir 0, 172.30.94.88:55510 -> 172.30.79.5:4 IPP 6> reached early expiration;
@;39117251.1778855409; 5Jun2025 16:55:13.085052;[cpu_2];[fw4_5];fwconn_ent_early_expiration: return expire (timeout=25, aggr_timeout=5, new_ttl=20);
@;39117251.1778855410; 5Jun2025 16:55:13.085053;[cpu_2];[fw4_5];fwconn_ent_eligible_for_del : conn <dir 0, 172.30.94.88:55510 -> 172.30.79.5:4 IPP 6> is eligible for deletion;
@;39117251.1778855411; 5Jun2025 16:55:13.085054;[cpu_2];[fw4_5];fwconn_ent_early_expiration: [now=1749113713] conn <dir 0, 172.30.94.88:55504 -> 172.30.79.5:4 IPP 6> reached early expiration;
@;39117251.1778855412; 5Jun2025 16:55:13.085054;[cpu_2];[fw4_5];fwconn_ent_early_expiration: return expire (timeout=25, aggr_timeout=5, new_ttl=20);
@;39117251.1778855413; 5Jun2025 16:55:13.085055;[cpu_2];[fw4_5];fwconn_ent_eligible_for_del : conn <dir 0, 172.30.94.88:55504 -> 172.30.79.5:4 IPP 6> is eligible for deletion;

We tried disabling aggressive aging and increasing the timeout values for this specific port 4 in SmartConsole, but these messages still appear.

Besides, this is not only traffic that reached early expiration, there are plenty other traffic facing this as well in kdebug.

Therefore, these raised a few questions:

Questions

Does the presence of "fwconn_ent_early_expiration" and "eligible_for_del" in kdebug necessarily indicate aggressive aging is triggering? As I check the firewall, there are still plenty of memory and CPU resource remaining
Could this be related to some other timeout mechanism, or maybe a TCP/IP stack or connection tracking issue?
Could it point to some underlying issue at L2/L1, even if no drops show up in SmartConsole or zdebug?
AFAIK, the peer switch is a Cisco device. Is it possible that microbursts are occurring? Not entirely sure how microbursting manifests for Quantum Force, but could it be sending traffic faster than the switch can handle at times?

Any guidance, especially from those who have dealt with early expiration or similar kdebug patterns, would be greatly appreciated.

Thanks in advance.

Chris_Atkinson

Generally you'll have other signs in your SmartConsole logs that aggressive aging is in effect.

Did you consider using bonds when deploying the new gateways or was the scope simply a like-for-like swap.

I remember an issue years ago tracked to default qos / queue settings on switchports that only became apparent after moving to a bond "solved" a performance issue.

sk106126: TX-DRP usually indicates that there is a downstream issue and gateway has to drop the packets as it is unable to put them on the wire fast enough. Increasing the bandwidth through link aggregation or introducing flow control may be a possible solution to this problem.

CCSM R77/R80/ELITE

Timothy_Hall

TX-DRPs are rare and usually indicate a buffering issue between SecureXL and the NIC driver for outbound traffic. TX-DRPs do not indicate a NIC or cabling problem, especially when they are logged against tagged interfaces only (as in your case) and not the leading physical interface (eth4). At least on the RX-DRP side, DRP hits reported on a tagged interface instead of the physical interface are almost always unknown EtherTypes and/or improperly pruned VLAN tags, or so-called "junk" traffic that we can't handle anyway. I would assume the same is true for TX-DRP.

Given that you are only seeing this behavior on a 9100 which uses UPPAK by default, and that UPPAK has its tendrils sunk deep into the NIC driver code, UPPAK is probably responsible. First step is to try forcing this traffic into the slowpath and see what happens (sk104468: How to exclude traffic from SecureXL) to determine if it is some issue with partial or full acceleration only, as the connection timers are handled somewhat differently there compared to the slowpath. If that still doesn't help it is probably time to disable UPPAK from cpconfig and try again.

I don't think those entries you saw in your debug are related to your problem, the timeout of 25 seen appears to be a TCP connection starting that is going to fail anyway in the slowpath and is getting expired early.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices
Self-Guided Video Series Coming Soon

Are you a member of CheckMates?

Packet timeout with unknown reason in Quantum Force 9100 only

Questions