So we have been troubleshooting this for months and I’m about to loose my mind. We have a Azure Virtual WAN with a site to site VPN connection, everything is peered back to the Virtual WAN HUB, on premise we have a CISCO ASA. The thing can be up and running just fine for Weeks, sometimes even a month or two when suddenly we’ll start having the strangest issues. The site to site VPN doesn’t go down, but it’s almost like it’s not fully up either. We can remote out to most of our servers in Azure, SaaS/PaaS services (like arc, and storage accounts) all still seem to be able to communicate back on premise, but our servers can’t seem to reach back on premise. They start loosing their trust with our domain, we can’t ping them, RDP may or may not work. It’s also not like everything is on our off, it’s more like we’ll watch as hosts on one peer start to have trouble, then another, then another. So at first we don’t even think to check the VPN.
Once we reset the gateway things come back, sometimes they stay up for weeks/months again, other times certain services go down within 5 minutes of resetting the gateway, while others may be fine. At this point we usually open a ticket with Microsoft. They spend time going through logs, doing the usual back/fourth forever without making any real suggestions. Within a day or two we end up rebooting the gateway for like the billionth time, and voila, everything will be fine for weeks!
Digging through the logs we are seeing " unknown status win32" and “Gateway Tenant instance GatewayTenantWorker_IN_0 starting maintenance” just before this sort of thing starts. We believe the issue is Microsoft performing maintenance on their VPN hardware, but that doesn’t explain why we have so much strange behavior for hours, or days after an even.
We are actually dealing with this right now. Reset the gateway 3 times so far. Some things manage to stay up, others are down within 5-30 minutes.