Today I want to talk about an issue I was having with UDP sockets in Go, and how I learned more about my program by making use of two tools that ship with Mac OS X called dtruss, and DTrace.
I do most of my programming on a Mac, and I have been working on a project written in Go which makes use of UDP sockets. When trying to add some unit tests to my code I decided to create one around reconnections. In my test I created two sockets on localhost with two different ports, read/wrote from them, closed them, and then repeated this several hundred times. Eventually I would receive a "bind: address already in use" error upon the creation a socket. I wondered, why was this happening if I was closing the sockets ?
In Go we can use the net.ListenPacket
method to start listening for UDP packets on a given network address. This function returns a net.PacketConn
interface.
According to the Go documentation for PacketConn
, the Close
method should either close the connection, or return an error:
type PacketConn interface {
...
// Close closes the connection.
// Any blocked ReadFrom or WriteTo operations will be unblocked and return errors.
Close() error
...
}
If this were true then why was I getting these EADDRINUSE errors ? Time to investigate 🔎 But first a brief recap on system calls.
Systems calls are the way your program interfaces with your operating system. In unix-based x86 systems specifically your compiled code will contain a software interrupt which is passed a value corresponding to a function in the kernel. In our case we were interested in two system calls; bind
and close
. bind
assigns a file descriptor to a network address. close
deallocates a file descriptor.
How can we see when our program is making system calls like bind
and close
?
dtruss is a tool that ships with Mac OS X. It was written by Brendan Gregg. It is similar to strace on linux systems. You can use it to see which system calls a program is making to the kernel. This means we can see if a program is attempting to close a socket when we expect it to, or if in fact the Go standard library is lying to us. After running dtruss you will see something like this:
$ sudo dtruss -a -t bind go test -run TestReconnect
PID/THRD RELATIVE ELAPSD CPU SYSCALL(args) = return
82830/0x978a1c: 1357371 99 95 close(0x5) = 0 0
82830/0x978a20: 1378137 20 16 socket(0x2, 0x2, 0x0) = 5 0
82830/0x978a20: 1378139 4 0 fcntl(0x5, 0x2, 0x1) = 0 0
82830/0x978a20: 1378142 2 0 fcntl(0x5, 0x3, 0x0) = 2 0
82830/0x978a20: 1378144 4 1 fcntl(0x5, 0x4, 0x6) = 0 0
82830/0x978a20: 1378148 6 2 setsockopt(0x5, 0xFFFF, 0x20) = 0 0
82830/0x978a20: 1378161 8 4 bind(0x5, 0xC821AA576C, 0x10) = -1 Err#48
So here we can see a bunch of interesting information.
One interesting thing here is that on the last line at the very end we see Err#48
. This is our "bind: address already in use" error. In this particular trace we can see on the first line it looks like we close the 0x5
socket with no errors, and in the last line it says 0x5
is already in use when we try to bind
it. Confusingly the 0x5
just refers to the file descriptor, and not the actual network address. The network address is stored in the second argument 0xC821AA576C
above. This is a pointer to the sockaddr
struct passed to the kernel, which when dereferenced would look something like this:
struct sockaddr {
sa_family_t sa_family;
char sa_data[14]; // 14 bytes of protocol address
}
The part we're interested in is the sa_data
field.
So we need to dereference the pointer somehow. How is this possible ? Enter DTrace!
Dtrace is a powerful debugging tool originally built by Sun Microsystems for Solaris. It is also available on Mac OS X. The amazing thing about Dtrace is that it ships with its own scripting language called D. With these scripts you can write tracing programs which will execute upon entry/return of specified system calls. So using this we can actually write some tracing around bind
to dereference the sockaddr
struct, and ultimately find out which port is being bound.
dtrace -c "go test -run TestReconnect" -n 'bind:entry { trace(arg0); trace(copyinstr(arg1)); }' -f bind
This command is saying, "Execute my Go program, but whenever you see a bind
system call invocation, run this D script." In the script itself I'm tracing the first two arguments. With the second argument I am tracing the result of a function called copyinstr
. In D, the copyinstr
function will dereference a pointer in user space and output the result as a string.
Running DTrace with these arguments gives me something like this:
0 358 bind:entry 5
0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef
0: 10 02 0b b9 7f 00 00 00 00 00 00 00 00 00 00 00 ................
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
...
At the end of the first line we see the 5
file descriptor being used like before. On the next lines we see the output of copyinstr(arg1)
. This is the dereferenced pointer to the sockaddr
struct. It looks like mostly unused memory. If we look at 3rd and 4th bytes in the above output we see the port number of the socket which is 0x0bb9
(3001). Now I can see that file descriptor 0x5
is binding to port 3001. Whew! 😅 Using this tool I was able to see that in fact an existing file descriptor was still bound to the same port when the subsequent bind
call was made.
After digging through the Go standard library I eventually found my answer. Go documentation does mention that any blocked ReadFrom
or WriteTo
calls will be unblocked and return errors. What it fails to mention is that until that happens, Go will not close the socket. Meaning that even after calling Close()
on a net.PacketConn
there is no guarantee that the socket has actually been closed. This is due to a mutex around the underlying file descriptor in Go. The solution in my case was to use a semaphore in my code to ensure that all ReadFrom
and WriteTo
calls had returned before attempting to open a new socket on the same port.
The important thing about this exercise for me was being able to know for certain that I understood what my program was doing, and what my operating system was doing. This means I could be confident that the problem was sitting in between these layers in the stack.
Happy debugging! 😄