This is a technical article that goes to some depth into networking protocols. Reading this article, you will need some overall knowledge of how UDP networking works, and a basic idea about what NAT and IP MASQ is.
We’ve just spent quite a while investigating issues with a small number of games failing to make connections to perfectly good servers. Grapple, the LGP networking layer, is designed to work around the presence of firewalls, NAT, and all that good stuff, so that users can host a game from any machine on their network, whether it be exposed to the outside world or not.
So, when we found out that some people couldn’t connect to games, we were a bit concerned. After all, our code should be just fine for this. It has worked in all of our test scenarios.
Unfortunately we assumed that all networking at the kernel level worked. This was a mistake.
So, take a scenario, lets say host A (lets call it gateway) is the gateway server, running IP MASQ to NAT all servers on its local network. Host B (call it laptop) is a machine on the internal network.
Now, as we run through this article I will run you through some aspects of NAT, STUN, TURN, and other nice things about firewall and NAT handling. I wont be going through the exact protocols, but I’ll give you a bit of an idea how it works.
So, Laptop is the machine I want to use to start up a game server. I start a server and bind a UDP socket to port 2000 on all interfaces. I then use STUN to find out what kind of connection I am on. The method for STUN is shown below.
STUN tells me that the server is on a symmetric NAT. This means that the only things that can communicate with the server, are things that the server has talked to itself first.
This is the most restrictive type of NAT, under usual circumstances it will mean that it is effectively impossible to make connections to this laptop game server without making manual traffic routes through your NAT routing table on the gateway machine for each client. Any client connecting to the server will try and connect to the laptop, but any routing will be rejected because the laptop has not first connected to the client.
However, Grapple can handle this, as I will demonstrate.
So, on the gateway machine (it is important for later on to note that the gateway machine is also the client trying to connect to the laptop server to play the game), I start up a client. The client tries to directly connect to the server, sending a number of UDP connection requests. The server does not receive them.
This is not unexpected, as demonstrated earlier, the server is behind a symmetric NAT and the server hasn’t sent any data to the client, and so the servers routing will reject anything being sent to it from the client.
So, at this point it looks fairly hopeless, how can the server magically know the client is connecting to it.
Well, the answer lies in the STUN server. The laptop server has already talked to the STUN server and so the NAT routing tables allow the STUN server to send more data to the laptop.
So, the client gateway knows about the same STUN server, and so sends a request to the STUN server to ask the laptop server to connect to it.
The laptop receives this request, and connects to the address that the STUN server requests. The client should then, we hope, be able to communicate.
Now at this point this can go two ways.
1) Client is not behind a symmetric NAT or firewall
In this case, the client receives the data packet, and communication can start. We’re pretty much done handshaking, the client and server can talk to each other.
2) Client IS behind a symmetric NAT or firewall
The client will never receive the packet from the server. No communication is possible.
Now, in the instance in question, we discover the situation is an impossible one. The client receives the packet from the server, but then is unable to reply.
This is completely impossible, the routing table to the client has seen a packet to from the laptop to the gateway, and so should allow 2 way communication between the two machines (on the same ports). But for some reason – it wasn’t!
This was the sticking point. How could this be possible. We knew that the laptop worked this way because we had communicated with the STUN server successfully, and this relies on the fact that an outbound packet creates a tunnel back to the server. This is vitally important for all networking, or you would just send out data and never get anything back, rendering NAT pretty much useless.
The issue turned out to be the IP MASQ NAT on the gateway.
The laptop sent a packet to the external IP address of the client. This means that it should have gone out to the NAT layer, and been NAT’d to the external IP and port of the server socket communicating to the client. The client would have received a request that external IP 123.123.123.123 was trying to talk, and would respond to that address. The response would have gone through the same IP MASQ layer, and the server would get back a response from the location it expected.
HOWEVER
IP MASQ has decided in its infinite wisdom to take a short cut. It spots a packet will be routed to the gateway, and decides that, hey, they are on the same internal network. They can talk directly! And so, the packet presented to the client is not showing as coming from the outside internet, it is shown as coming from 10.2.2.13 (the internal IP address of the laptop). And so, when it gets this reply, it sends a response back. As the response never passes through the IP MASQ because it is going over the internal network, then the response to the server comes from 10.2.1.1 and as the server sent its data to 123.123.123.123 the servers own firewall (independent of the NAT) says ‘hey, I don’t know you’ and rejects the packet.
All because IP MASQ takes a short cut.
In the end, the problem is solved that if the client doesn’t receive further communication from the server, then it falls back to its final communication option, TURN. We have already demonstrated that the STUN server can talk to both sides, as both sides initiated communication with the server. This means that the STUN server can act as a relay between the client and the server, sending any and all data via this third party. However to be honest, we don’t like doing that as obviously it will increase latency of games, and also it means that any and all game traffic that has to resort to TURN has to be passed via our servers. A few games running at once, no problem, but more and more games means more and more bandwidth used, and we’d like to avoid that!
So in the end, the problem is solved, but really, if not for IP MASQ making a shortcut for efficiency, there would never have been a problem in the first place.
[Edit: I am adding the layout of the network below, to explain a little better]