Frustrating Site to Site VPN Issue with Azure Virtual WAN

So we have been troubleshooting this for months and I’m about to loose my mind. We have a Azure Virtual WAN with a site to site VPN connection, everything is peered back to the Virtual WAN HUB, on premise we have a CISCO ASA. The thing can be up and running just fine for Weeks, sometimes even a month or two when suddenly we’ll start having the strangest issues. The site to site VPN doesn’t go down, but it’s almost like it’s not fully up either. We can remote out to most of our servers in Azure, SaaS/PaaS services (like arc, and storage accounts) all still seem to be able to communicate back on premise, but our servers can’t seem to reach back on premise. They start loosing their trust with our domain, we can’t ping them, RDP may or may not work. It’s also not like everything is on our off, it’s more like we’ll watch as hosts on one peer start to have trouble, then another, then another. So at first we don’t even think to check the VPN.

Once we reset the gateway things come back, sometimes they stay up for weeks/months again, other times certain services go down within 5 minutes of resetting the gateway, while others may be fine. At this point we usually open a ticket with Microsoft. They spend time going through logs, doing the usual back/fourth forever without making any real suggestions. Within a day or two we end up rebooting the gateway for like the billionth time, and voila, everything will be fine for weeks!

Digging through the logs we are seeing " unknown status win32" and “Gateway Tenant instance GatewayTenantWorker_IN_0 starting maintenance” just before this sort of thing starts. We believe the issue is Microsoft performing maintenance on their VPN hardware, but that doesn’t explain why we have so much strange behavior for hours, or days after an even.

We are actually dealing with this right now. Reset the gateway 3 times so far. Some things manage to stay up, others are down within 5-30 minutes.

Any updates on this? I’m running into the same problem that the s2s connection from Azure to Unifi is dropping. The thing is that the connection used to work with “IPsec / IKE policy” set to “Default” and now it’s status “Not connected” . I have to manually configure the IKE configuration which is dropping the connection approximately after 2 hours. Then I have to change the Encryption type to reset it to make it works for another short time before it drops again. The logs says:

6/11/2024 3:52:23 PM|SESSION_ID :{5****0f4-d***b-4***5-a****a-10*****b7b} IkeCleanupMMNegotiation called with error 13868 and flags 102

6/11/2024 3:52:23 PM|SESSION_ID :{5a***0f4-*****-4d***5-a01a-********} Not closing tunnel for mm, MM Owns Tunnel = 262144

6/11/2024 3:52:23 PM|[SEND Network Packet] Remote **.***.**.***:500: Local **.**.**.***:500: Packet : IKEVersion : IKEv2 ; iCookie : 0x4f11b60f66***** ; rCookie : 0xbbb9c0******* ; Exchange type : IKEv2 SA Init Mode ; Length : 36 ; NextPayload : NOTIFY ; Flags :

Any help appreciated

Have you configured tunnel keepalive? I had a similar issue and found the reason to be that this was missing. It could also be that ASA is tearing down the tunnel whenever a certain event happens, so what do ASA logs say? Could you send a snippet of the ASA configuration?

Here’s an example of what I see in the logs around the time I got a notice after resetting the gateway last night. For reference I got a notice about the server being unreachable at 8:29 PM from a seperate monitoring tool.

8:22 - SESSION_ID :{#} IkeCleanupMMNegotiation called with error 13805 and flags 0

8:24 - SESSION_ID :{#} IkeCleanupMMNegotiation called with error 13805 and flags 0

8:26 - SESSION_ID :{#} IkeCleanupMMNegotiation called with error 13805 and flags 0

8:28 - SESSION_ID :{#} Not closing tunnel for mm, MM Owns Tunnel = 262144

8:28 - SESSION_ID :{#} Not closing tunnel for mm, MM Owns Tunnel = 262144

8:28 - SESSION_ID :{#} IkeCleanupMMNegotiation called with error 13805 and flags 0

8:32 - SESSION_ID :{#} IkeCleanupMMNegotiation called with error 13805 and flags 0

8:34 - SESSION_ID :{#} Process Payload Notify - NotifyType = 16393

8:34 - SESSION_ID :{#} Remote LocalIP:500: Local AzureIP:500: Received Traffic Selector payload request- [Tsid 0x158 ]Number of TSIs 2: StartAddress # EndAddress # PortStart 0 PortEnd 65535 Protocol 0, StartAddress # EndAddress # PortStart 0 PortEnd 65535 Protocol 0 Number of TSRs 2:StartAddress # EndAddress #PortStart 0 PortEnd 65535 Protocol 0, StartAddress # EndAddress # PortStart 0 PortEnd 65535 Protocol 0

8:34 - SESSION_ID :{#} Remote #:500: Local #:500: [SEND] Proposed Traffic Selector payload will be (Final Negotiated) - [Tsid 0x158 ]Number of TSIs 2: StartAddress # EndAddress # PortStart 0 PortEnd 65535 Protocol 0, StartAddress # EndAddress # PortStart 0 PortEnd 65535 Protocol 0 Number of TSRs 2:StartAddress # EndAddress # PortStart 0 PortEnd 65535 Protocol 0, StartAddress # EndAddress # PortStart 0 PortEnd 65535 Protocol 0

8:34 - SESSION_ID :{#} Remote #:500: Local #:500: [RECEIVED]Received IPSec payload: Policy1:Cipher=AES-CBC-256 Integrity=SHA256 PfsGroup=PfsNone

8:34 - SESSION_ID :{#} Remote #:500: Local #:500: [SEND][CHILD_QM_SA] Sending CREATE_CHILD QM_SA response message for tunnelId 0x2 and tsId 0x158

8:34 -SESSION_ID :{#} Remote #:500: Local #:500: [SEND]Sending IPSec policy Payload for tunnel Id 0x2, tsId 0x158: Policy1:Integrity=SHA256 Cipher=AES-CBC-256

8:34 - SESSION_ID :{#} Remote #:500: Local #:500: [RECEIVED][SA_DELETE] Received IPSec SA delete message for tunnelid 0x2 and tsid 0x156

8:34 - SESSION_ID :{#} Remote #:500: Local #:500: [SEND][SA_DELETE] Sending IKE SA delete ACK for icookie 0x2D41F56A1A81013F and rCookie 0x6AA4CE2A7AB44889

After the VPN is re-established I starte seeing the IKECleanup and not closing tunnel messages again

8:36 - SESSION_ID :{#} IkeCleanupMMNegotiation called with error 13805 and flags 0

The tunnel appears to be up the entire time we are having connectivity issues but things are just randomly down. I didn’t do a packet capture for this recent incident, and of course after resetting our virtual gateway again this morning things are appearing to be stable.