Unix Socket FAQ

12. Where can I get a library for programming sockets?

You can't force it. Period. TCP makes up its own mind as to when it can send data. Now, normally when you call write() on a TCP socket, TCP will indeed send a segment, but there's no guarantee and no way to force this. There are lots of reasons why TCP will not send a segment: a closed window and the Nagle algorithm are two things to come immediately to mind.

(Snipped suggestion from Andrew Gierth to use TCP_NODELAY)

Setting this only disables one of the many tests, the Nagle algorithm. But if the original poster's problem is this, then setting this socket option will help.

A quick glance at tcp_output() shows around 11 tests TCP has to make as to whether to send a segment or not.

Now from Dr. Charles E. Campbell Jr. (cec@gryphon.gsfc.nasa.gov):

As you've surmised, I've never had any problem with disabling Nagle's algorithm. Its basically a buffering method; there's a fixed overhead for all packets, no matter how small. Hence, Nagle's algorithm collects small packets together (no more than .2sec delay) and thereby reduces the amount of overhead bytes being transferred. This approach works well for rcp, for example: the .2 second delay isn't humanly noticeable, and multiple users have their small packets more efficiently transferred. Helps in university settings where most folks using the network are using standard tools such as rcp and ftp, and programs such as telnet may use it, too.

However, Nagle's algorithm is pure havoc for real-time control and not much better for keystroke interactive applications (control-C, anyone?). It has seemed to me that the types of new programs using sockets that people write usually do have problems with small packet delays. One way to bypass Nagle's algorithm selectively is to use "out-of-band" messaging, but that is limited in its content and has other effects (such as a loss of sequentiality) (by the way, out-of-band is often used for that ctrl-C, too).

More from Vic:

So to sum it all up, if you are having trouble and need to flush the socket, setting the TCP_NODELAY option will usually solve the problem. If it doesn't, you will have to use out-of-band messaging, but according to Andrew, "out-of-band data has its own problems, and I don't think it works well as a solution to buffering delays (haven't tried it though). It is not 'expedited data' in the sense that exists in some other protocols; it is transmitted in-stream, but with a pointer to indicate where it is."

I asked Andrew something to the effect of "What promises does TCP make about when it will get around to writing data to the network?" I thought his reply should be put under this question:

Not many promises, but some.

I'll try and quote chapter and verse on this:

References:

RFC 1122, "Requirements for Internet Hosts" (also STD 3)
RFC 793, "Transmission Control Protocol" (also STD 7)

The socket interface does not provide access to the TCP PUSH flag.
RFC1122 says (4.2.2.2): A TCP MAY implement PUSH flags on SEND calls. If PUSH flags are not implemented, then the sending TCP: (1) must not buffer data indefinitely, and (2) MUST set the PSH bit in the last buffered segment (i.e., when there is no more queued data to be sent).
RFC793 says (2.8): When a receiving TCP sees the PUSH flag, it must not wait for more data from the sending TCP before passing the data to the receiving process. [RFC1122 supports this statement.]
Therefore, data passed to a write() call must be delivered to the peer within a finite time, unless prevented by protocol considerations.
There are (according to a post from Stevens quoted in the FAQ [earlier in this answer - Vic]) about 11 tests made which could delay sending the data. But as I see it, there are only 2 that are significant, since things like retransmit backoff are a) not under the programmers control and b) must either resolve within a finite time or drop the connection.

The first of the interesting cases is "window closed" (ie. there is no buffer space at the receiver; this can delay data indefinitely, but only if the receiving process is not actually reading the data that is available)

Vic asks:

OK, it makes sense that if the client isn't reading, the data isn't going to make it across the connection. I take it this causes the sender to block after the recieve queue is filled?

The sender blocks when the socket send buffer is full, so buffers will be full at both ends.

While the window is closed, the sending TCP sends window probe packets. This ensures that when the window finally does open again, the sending TCP detects the fact. [RFC1122, ss 4.2.2.17]

The second interesting case is "Nagle algorithm" (small segments, e.g. keystrokes, are delayed to form larger segments if ACKs are expected from the peer; this is what is disabled with TCP_NODELAY)

Vic Asks:

Does this mean that my tcpclient sample should set TCP_NODELAY to ensure that the end-of-line code is indeed put out onto the network when sent?

No. tcpclient.c is doing the right thing as it stands; trying to write as much data as possible in as few calls to write() as is feasible. Since the amount of data is likely to be small relative to the socket send buffer, then it is likely (since the connection is idle at that point) that the entire request will require only one call to write(), and that the TCP layer will immediately dispatch the request as a single segment (with the PSH flag, see point 2.2 above).

The Nagle algorithm only has an effect when a second write() call is made while data is still unacknowledged. In the normal case, this data will be left buffered until either: a) there is no unacknowledged data; or b) enough data is available to dispatch a full-sized segment. The delay cannot be indefinite, since condition (a) must become true within the retransmit timeout or the connection dies.

Since this delay has negative consequences for certain applications, generally those where a stream of small requests are being sent without response, e.g. mouse movements, the standards specify that an option must exist to disable it. [RFC1122, ss 4.2.3.4]

Additional note: RFC1122 also says:

[DISCUSSION]:: When the PUSH flag is not implemented on SEND calls, i.e., when the application/TCP interface uses a pure streaming model, responsibility for aggregating any tiny data fragments to form reasonable sized segments is partially borne by the application layer.

So programs should avoid calls to write() with small data lengths (small relative to the MSS, that is); it's better to build up a request in a buffer and then do one call to sock_write() or equivalent.

The other possible sources of delay in the TCP are not really controllable by the program, but they can only delay the data temporarily.

Vic asks:

By temporarily, you mean that the data will go as soon as it can, and I won't get stuck in a position where one side is waiting on a response, and the other side hasn't recieved the request? (Or at least I won't get stuck forever)

You can only deadlock if you somehow manage to fill up all the buffers in both directions... not easy.

If it is possible to do this, (can't think of a good example though), the solution is to use nonblocking mode, especially for writes. Then you can buffer excess data in the program as necessary.

There is the Simple Sockets Library by Charles E. Campbell, Jr. PhD. and Terry McRoberts. The file is called ssl.tar.gz, and you can download it from this faq's home page. For c++ there is the Socket++ library which is on ftp://ftp.virginia.edu/pub/socket++-1.11.tar.gz. There is also C++ Wrappers. The file is called ftp://ftp.huji.ac.il/pub/languages/C++/C++_wrappers.tar.gz. Thanks to Bill McKinnon for tracking it down for me! From http://www.cs.wustl.edu/~schmidt you should be able to find the ACE toolkit. Another C++ library called libtcp++ is also available at http://www.sashanet.com/internet/download.html. PING Software Group has some libraries that include a sockets interface among other things. It seems to be all Java stuff now. You can find their stuff at http://www.dystance.net/ping/pingutil/index.html. Thanks to Jim Kassabian for hunting that down for us (again)!

Philippe Jounin has developed a cross platform library which includes high level support for http and ftp protocols, with more to come. You can find it at http://perso.magic.fr/jounin-ph/P_tcp4u.htm, and you can find a review of it at http://www6.zdnet.com/cgi-bin/texis/swlib/hotfiles/info.html?fcode=000H4F

I don't have any experience with any of these libraries, so I can't recomend one over the other.

13. How come select says there is data, but read returns zero?

The data that causes select to return is the EOF because the other side has closed the connection. This causes read to return zero. For more information see 2.1 How can I tell when a socket is closed on the other end?

14. Whats the difference between select() and poll()?

From Richard Stevens:

The basic difference is that select()'s fd_set is a bit mask and therefore has some fixed size. It would be possible for the kernel to not limit this size when the kernel is compiled, allowing the application to define FD_SETSIZE to whatever it wants (as the comments in the system header imply today) but it takes more work. 4.4BSD's kernel and the Solaris library function both have this limit. But I see that BSD/OS 2.1 has now been coded to avoid this limit, so it's doable, just a small matter of programming. :-) Someone should file a Solaris bug report on this, and see if it ever gets fixed. [Ed. Note - This was fixed in Solaris 7 - see below.]

With poll(), however, the user must allocate an array of pollfd structures, and pass the number of entries in this array, so there's no fundamental limit. As Casper notes, fewer systems have poll() than select, so the latter is more portable. Also, with original implementations (SVR3) you could not set the descriptor to -1 to tell the kernel to ignore an entry in the pollfd structure, which made it hard to remove entries from the array; SVR4 gets around this. Personally, I always use select() and rarely poll(), because I port my code to BSD environments too. Someone could write an implementation of poll() that uses select(), for these environments, but I've never seen one. Both select() and poll() are being standardized by POSIX 1003.1g.

Colm Smyth (Colm.Smyth@Sun.COM) writes:

I thought you might be interested to know that this was resolved in Solaris 7; here is an extract from the select(3C) man-page:

NOTES

The default value for FD_SETSIZE (currently 1024) is larger than the default limit on the number of open files. To accommodate 32-bit applications that wish to use a larger number of open files with select(), it is possible to increase this size at compile time by providing a larger definition of FD_SETSIZE before the inclusion of . The maximum supported size for FD_SETSIZE is 65536. The default value is already 65536 for 64-bit applications.

15. How do I send [this] over a socket

Anything other than single bytes of data will probably get mangled unless you take care. For integer values you can use htons() and friends, and strings are really just a bunch of single bytes, so those should be OK. Be careful not to send a pointer to a string though, since the pointer will be meaningless on another machine. If you need to send a struct, you should write sendthisstruct() and readthisstruct() functions for it that do all the work of taking the structure apart on one side, and putting it back together on the other. If you need to send floats, you may have a lot of work ahead of you. You should read RFC 1014 which is about portable ways of getting data from one machine to another (thanks to Andrew Gabriel for pointing this out).

16. How do I use TCP_NODELAY?

First off, be sure you really want to use it in the first place. It will disable the Nagle algorithm (see 2.11 How can I force a socket to send the data in its buffer?), which will cause network traffic to increase, with smaller than needed packets wasting bandwidth. Also, from what I have been able to tell, the speed increase is very small, so you should probably do it without TCP_NODELAY first, and only turn it on if there is a problem.

Here is a code example, with a warning about using it from Andrew Gierth:


  int flag = 1;
  int result = setsockopt(sock,            /* socket affected */
                          IPPROTO_TCP,     /* set option at TCP level */
                          TCP_NODELAY,     /* name of option */
                          (char *) &flag,  /* the cast is historical 
                                                  cruft */
                          sizeof(int));    /* length of option value */
  if (result < 0)
     ... handle the error ...

TCP_NODELAY is for a specific purpose; to disable the Nagle buffering algorithm. It should only be set for applications that send frequent small bursts of information without getting an immediate response, where timely delivery of data is required (the canonical example is mouse movements).

17. What exactly does the Nagle algorithm do?

It groups together as much data as it can between ACK's from the other end of the connection. I found this really confusing until Andrew Gierth (andrew@erlenstar.demon.co.uk) drew the following diagram, and explained:

This diagram is not intended to be complete, just to illustrate the point better...

Case 1: client writes 1 byte per write() call. The program on host B is tcpserver.c from the FAQ examples.

      CLIENT                                  SERVER
APP             TCP                     TCP             APP
                [connection setup omitted]

 "h" --------->          [1 byte]
                    ------------------>
                                           -----------> "h"
                                   [ack delayed]
 "e" ---------> [Nagle alg.              .
                 now in effect]          .
 "l" ---------> [ditto]                  .
 "l" ---------> [ditto]                  .
 "o" ---------> [ditto]                  .
 "\n"---------> [ditto]                  .
                                         .
                                         .
                       [ack 1 byte]
                    <------------------
                [send queued
                data]
                        [5 bytes]
                    ------------------>
                                          ------------> "ello\n"
                                          <------------ "HELLO\n"
                   [6 bytes, ack 5 bytes]
                    <------------------
 "HELLO\n" <----
              [ack delayed]
                 .
                 .
                 .   [ack 6 bytes]
                    ------------------>

Total segments: 5. (If TCP_NODELAY was set, could have been up to 10.) Time for response: 2*RTT, plus ack delay.

Case 2: client writes all data with one write() call.


      CLIENT                                  SERVER
APP             TCP                     TCP             APP
                [connection setup omitted]

 "hello\n" --->          [6 bytes]
                    ------------------>
                                          ------------> "hello\n"
                                          <------------ "HELLO\n"
                   [6 bytes, ack 6 bytes]
                    <------------------
 "HELLO\n" <----
            [ack delayed]
                 .
                 .
                 .   [ack 6 bytes]
                    ------------------>

Total segments: 3.

Time for response = RTT (therefore minimum possible).

Hope this makes things a bit clearer...

Note that in case 2, you don't want the implementation to gratuitously delay sending the data, since that would add straight onto the response time.

18. What is the difference between read() and recv()?

19. I see that send()/write() can generate SIGPIPE. Is there any advantage to handling the signal, rather than just ignoring it and checking for the EPIPE error?

read() is equivalent to recv() with a flags parameter of 0. Other values for the flags parameter change the behaviour of recv(). Similarly, write() is equivalent to send() with flags == 0.

It is unlikely that send()/recv() would be dropped; perhaps someone with a copy of the POSIX drafts for socket calls can check...

Portability note: non-unix systems may not allow read()/write() on sockets, but recv()/send() are usually ok. This is true on Windows and OS/2, for example.

20. After the chroot(), calls to socket() are failing. Why?

In general, the only parameter passed to a signal handler is the signal number that caused it to be invoked. Some systems have optional additional parameters, but they are no use to you in this case.

My advice is to just ignore SIGPIPE as you suggest. That's what I do in just about all of my socket code; errno values are easier to handle than signals (in fact, the first revision of the FAQ failed to mention SIGPIPE in that context; I'd got so used to ignoring it...)

There is one situation where you should not ignore SIGPIPE; if you are going to exec() another program with stdout redirected to a socket. In this case it is probably wise to set SIGPIPE to SIG_DFL before doing the exec().

Jesse Norell has pointed out that if you are using SO_KEEPALIVE to test the connection, and you aren't doing reads or writes very frequently, you might want to leave SIGPIPE enabled so that your server process gets signalled when the system determines your link is dead. Normally though you will just check returns from read()/write() and act appropriately.

21. Why do I keep getting EINTR from the socket calls?

On systems where sockets are implemented on top of Streams (e.g. all SysV-based systems, presumably including Solaris), the socket() function will actually be opening certain special files in /dev. You will need to create a /dev directory under your fake root and populate it with the required device nodes (only).

Your system documentation may or may not specify exactly which device nodes are required; I can't help you there (sorry). (Editors note: Adrian Hall (adrian@hottub.org) suggested checking the man page for ftpd, which should list the files you need to copy and devices you need to create in the chroot'd environment.)

A less-obvious issue with chroot() is if you call syslog(), as many daemons do; syslog() opens (depending on the system) either a UDP socket, a FIFO or a Unix-domain socket. So if you use it after a chroot() call, make sure that you call openlog() *before* the chroot.

This isn't really so much an error as an exit condition. It means that the call was interrupted by a signal. Any call that might block should be wrapped in a loop that checkes for EINTR, as is done in the example code (See 1.8. Sample Source Code).

22. When will my application receive SIGPIPE?

23. What are socket exceptions? What is out-of-band data?

Very simple: with TCP you get SIGPIPE if your end of the connection has received an RST from the other end. What this also means is that if you were using select instead of write, the select would have indicated the socket as being readable, since the RST is there for you to read (read will return an error with errno set to ECONNRESET).

Basically an RST is TCP's response to some packet that it doesn't expect and has no other way of dealing with. A common case is when the peer closes the connection (sending you a FIN) but you ignore it because you're writing and not reading. (You should be using select.) So you write to a connection that has been closed by the other end and the other end's TCP responds with an RST.

Unlike exceptions in C++, socket exceptions do not indicate that an error has occured. Socket exceptions usually refer to the notification that out-of-band data has arrived. Out-of-band data (called "urgent data" in TCP) looks to the application like a separate stream of data from the main data stream. This can be useful for separating two different kinds of data. Note that just because it is called "urgent data" does not mean that it will be delivered any faster, or with higher priorety than data in the in-band data stream. Also beware that unlike the main data stream, the out-of-bound data may be lost if your application can't keep up with it.

24. How can I find the full hostname (FQDN) of the system I'm running on?

25. How do I monitor the activity of sockets?

Some systems set the hostname to the FQDN and others set it to just the unqualified host name. I know the current BIND FAQ recommends the FQDN, but most Solaris systems, for example, tend to use only the unqualified host name.

Regardless, the way around this is to first get the host's name (perhaps an FQDN, perhaps unaualified). Most systems support the Posix way to do this using uname(), but older BSD systems only provide gethostname(). Call gethostbyname() to find your IP address. Then take the IP address and call gethostbyaddr(). The h_name member of the hostent{} should then be your FQDN.

From: Matthias Rabast (matthias.rabast@ubs.com)

How can I find out,

which sockets have highest throughput ?
how big is the tcp window size for each socket ?
how often does a special socket block and go again ?

For monitoring throughput there are tools such as IPAudit that will monitor throughput. I can't remember which tool I used to use for this purpose, but a quick search found IPAudit. I haven't tried it, so let me know if it works, or if you know some better tools.

You can use netstat -a under solaris and look at the Swind and Rwind columns for send and recieve window sizes.

I'm not aware of any tools for monitoring how often a socket blocks. Someone please add a comment if you have any suggestions for this.

You could parse the output of snoop/tcpdump to get some of this information. Let me know if you know a good parser and I'll list it here.

3. Writing Client Applications (TCP/SOCK_STREAM)

1. How do I convert a string into an internet address?

If you are reading a host's address from the command line, you may not know if you have an aaa.bbb.ccc.ddd style address, or a host.domain.com style address. What I do with these, is first try to use it as a aaa.bbb.ccc.ddd type address, and if that fails, then do a name lookup on it. Here is an example:


/* Converts ascii text to in_addr struct.  NULL is returned if the 
   address can not be found. */
struct in_addr *atoaddr(char *address) {
  struct hostent *host;
  static struct in_addr saddr;

  /* First try it as aaa.bbb.ccc.ddd. */
  saddr.s_addr = inet_addr(address);
  if (saddr.s_addr != -1) {
    return &saddr;
  }
  host = gethostbyname(address);
  if (host != NULL) {
    return (struct in_addr *) *host->h_addr_list;
  }
  return NULL;
}

2. How can my client work through a firewall/proxy server?

If you are running through separate proxies for each service, you shouldn't need to do anything. If you are working through sockd, you will need to "socksify" your application. Details for doing this can be found in the package itself, which is available at:

ftp://coast.cs.purdue.edu/pub/tools/unix/socks/

3. Why does connect() succeed even before my server did an accept()?

4. Why do I sometimes lose a server's address when using more than one server?

Once you have done a listen() call on your socket, the kernel is primed to accept connections on it. The usual UNIX implementation of this works by immediately completing the SYN handshake for any incoming valid SYN segments (connection attempts), creating the socket for the new connection, and keeping this new socket on an internal queue ready for the accept() call. So the socket is fully open before the accept is done.

The other factor in this is the 'backlog' parameter for listen(); that defines how many of these completed connections can be queued at one time. If the specified number is exceeded, then new incoming connects are simply ignored (which causes them to be retried).

5. How can I set the timeout for the connect() system call?

Take a careful look at struct hostent. Notice that almost everything in it is a pointer? All these pointers will refer to statically allocated data.

For example, if you do:


    struct hostent *host = gethostbyname(hostname);

then (as you should know) a subsequent call to gethostbyname() will overwrite the structure pointed to by 'host'.

But if you do:


    struct hostent myhost;
    struct hostent *hostptr = gethostbyname(hostname);
    if (hostptr) myhost = *host;

to make a copy of the hostent before it gets overwritten, then it still gets clobbered by a subsequent call to gethostbyname(), since although myhost won't get overwritten, all the data it is pointing to will be.

You can get round this by doing a proper 'deep copy' of the hostent structure, but this is tedious. My recommendation would be to extract the needed fields of the hostent and store them in your own way.

Robin Paterson (etmrpat@etm.ericsson.se) has added:

It might be nice if you mention MT safe libraries provide complimentary functions for multithreaded programming. On the solaris machine I'm typing at, we have gethostbyname and gethostbyname_r (_r for reentrant). The main difference is, you provide the storage for the hostent struct so you always have a local copy and not just a pointer to the static copy.

6. Should I bind() a port number in my client program, or let the system choose one for me on the connect() call?

Normally you cannot change this. Solaris does let you do this, on a per-kernel basis with the ndd tcp_ip_abort_cinterval parameter.

The easiest way to shorten the connect time is with an alarm() around the call to connect(). A harder way is to use select(), after setting the socket nonblocking. Also notice that you can only shorten the connect time, there's normally no way to lengthen it.

From Andrew Gierth (andrew@erlenstar.demon.co.uk):

First, create the socket and put it into non-blocking mode, then call connect(). There are three possibilities:

connect succeeds: the connection has been successfully made (this usually only happens when connecting to the same machine)
connect fails: obvious
connect returns -1/EINPROGRESS. The connection attempt has begun, but not yet completed.

If the connection succeeds:

the socket will select() as writable (and will also select as readable if data arrives)

If the connection fails:

the socket will select as readable *and* writable, but either a read or write will return the error code from the connection attempt. Also, you can use getsockopt(SO_ERROR) to get the error status - but be careful; some systems return the error code in the result parameter of getsockopt, but others (incorrectly) cause the getsockopt call *itself* to fail with the stored value as the error.

7. Why do I get "connection refused" when the server isn't running?

** Let the system choose your client's port number **

The exception to this, is if the server has been written to be picky about what client ports it will allow connections from. Rlogind and rshd are the classic examples. This is usually part of a Unix-specific (and rather weak) authentication scheme; the intent is that the server allows connections only from processes with root privilege. (The weakness in the scheme is that many O/Ss (e.g. MS-DOS) allow anyone to bind any port.)

The rresvport() routine exists to help out clients that are using this scheme. It basically does the equivalent of socket() + bind(), choosing a port number in the range 512..1023.

If the server is not fussy about the client's port number, then don't try and assign it yourself in the client, just let connect() pick it for you.

If, in a client, you use the naive scheme of starting at a fixed port number and calling bind() on consecutive values until it works, then you buy yourself a whole lot of trouble:

The problem is if the server end of your connection does an active close. (E.G. client sends 'QUIT' command to server, server responds by closing the connection). That leaves the client end of the connection in CLOSED state, and the server end in TIME_WAIT state. So after the client exits, there is no trace of the connection on the client end.

Now run the client again. It will pick the same port number, since as far as it can see, it's free. But as soon as it calls connect(), the server finds that you are trying to duplicate an existing connection (although one in TIME_WAIT). It is perfectly entitled to refuse to do this, so you get, I suspect, ECONNREFUSED from connect(). (Some systems may sometimes allow the connection anyway, but you can't rely on it.)

This problem is especially dangerous because it doesn't show up unless the client and server are on different machines. (If they are the same machine, then the client won't pick the same port number as before). So you can get bitten well into the development cycle (if you do what I suspect most people do, and test client & server on the same box initially).

Even if your protocol has the client closing first, there are still ways to produce this problem (e.g. kill the server).

The connect() call will only block while it is waiting to establish a connection. When there is no server waiting at the other end, it gets notified that the connection can not be established, and gives up with the error message you see. This is a good thing, since if it were not the case clients might wait for ever for a service which just doesn't exist. Users would think that they were only waiting for the connection to be established, and then after a while give up, muttering something about crummy software under their breath.

8. What does one do when one does not know how much information is comming over the socket? Is there a way to have a dynamic buffer?

This question asked by Niranjan Perera (perera@mindspring.com).

When the size of the incoming data is unknown, you can either make the size of the buffer as big as the largest possible (or likely) buffer, or you can re-size the buffer on the fly during your read. When you malloc() a large buffer, most (if not all) varients of unix will only allocate address space, but not physical pages of ram. As more and more of the buffer is used, the kernel allocates physical memory. This means that malloc'ing a large buffer will not waste resources unless that memory is used, and so it is perfectly acceptable to ask for a meg of ram when you expect only a few K.

On the other hand, a more elegant solution that does not depend on the inner workings of the kernel is to use realloc() to expand the buffer as required in say 4K chunks (since 4K is the size of a page of ram on most systems). I may add something like this to sockhelp.c in the example code one day.

9. How can I determine the local port number?

From: Fajun Shi (fajun@cs.msstate.edu):

Hi, my question is: When I write a client, how can I know the port number that the socket bound in my machine?

4. Writing Server Applications (TCP/SOCK_STREAM)

1. How come I get "address already in use" from bind()?

You get this when the address is already in use. (Oh, you figured that much out?) The most common reason for this is that you have stopped your server, and then re-started it right away. The sockets that were used by the first incarnation of the server are still active. This is further explained in 2.7 Please explain the TIME_WAIT state., and 2.5 How do I properly close a socket?.

2. Why don't my sockets close?

When you issue the close() system call, you are closing your interface to the socket, not the socket itself. It is up to the kernel to close the socket. Sometimes, for really technical reasons, the socket is kept alive for a few minutes after you close it. It is normal, for example for the socket to go into a TIME_WAIT state, on the server side, for a few minutes. People have reported ranges from 20 seconds to 4 minutes to me. The official standard says that it should be 4 minutes. On my Linux system it is about 2 minutes. This is explained in great detail in 2.7 Please explain the TIME_WAIT state..

3. How can I make my server a daemon?

There are two approaches you can take here. The first is to use inetd to do all the hard work for you. The second is to do all the hard work yourself.

If you use inetd, you simply use stdin, stdout, or stderr for your socket. (These three are all created with dup() from the real socket) You can use these as you would a socket in your code. The inetd process will even close the socket for you when you are done. For more information on setting this up, look at the man page for inetd.

If you wish to write your own server, there is a detailed explanation in "Unix Network Programming" by Richard Stevens (see 1.6 Where can I get source code for the book [book title]?). I also picked up this posting from comp.unix.programmer, by Nikhil Nair (nn201@cus.cam.ac.uk). You may want to add code to ignore SIGPIPE, because if this signal is not dealt with, it will cause your application to exit. (Thanks to ingo@milan2.snafu.de for pointing this out).

I worked all this lot out from the GNU C Library Manual (on-line
documentation).  Here's some code I wrote - you can adapt it as necessary:


#include 
#include 
#include 
#include 
#include 
#include 
#include 

/* Global variables */
...
volatile sig_atomic_t keep_going = 1; /* controls program termination */


/* Function prototypes: */
...
void termination_handler (int signum); /* clean up before termination */


int
main (void)
{
  ...

  if (chdir (HOME_DIR))         /* change to directory containing data 
                                    files */
   {
     fprintf (stderr, "`%s': ", HOME_DIR);
     perror (NULL);
     exit (1);
   }

   /* Become a daemon: */
   switch (fork ())
     {
     case -1:                    /* can't fork */
       perror ("fork()");
       exit (3);
     case 0:                     /* child, process becomes a daemon: */
       close (STDIN_FILENO);
       close (STDOUT_FILENO);
       close (STDERR_FILENO);
       if (setsid () == -1)      /* request a new session (job control) */
         {
           exit (4);
         }
       break;
     default:                    /* parent returns to calling process: */
       return 0;
     }

   /* Establish signal handler to clean up before termination: */
   if (signal (SIGTERM, termination_handler) == SIG_IGN)
     signal (SIGTERM, SIG_IGN);
   signal (SIGINT, SIG_IGN);
   signal (SIGHUP, SIG_IGN);

   /* Main program loop */
   while (keep_going)
     {
       ...
     }
   return 0;
}

void
termination_handler (int signum)
{
  keep_going = 0;
  signal (signum, termination_handler);
}

4. How can I listen on more than one port at a time?

The best way to do this is with the select() call. This tells the kernel to let you know when a socket is available for use. You can have one process do i/o with multiple sockets with this call. If you want to wait for a connect on sockets 4, 6 and 10 you might execute the following code snippet:


fd_set socklist;

FD_ZERO(&socklist); /* Always clear the structure first. */
FD_SET(4, &socklist);
FD_SET(6, &socklist);
FD_SET(10, &socklist);
if (select(11, NULL, &socklist, NULL, NULL) < 0)
  perror("select");

The kernel will notify us as soon as a file descriptor which is less than 11 (the first parameter to select()), and is a member of our socklist becomes available for writing. See the man page on select() for more details.

5. What exactly does SO_REUSEADDR do?

This socket option tells the kernel that even if this port is busy (in the TIME_WAIT state), go ahead and reuse it anyway. If it is busy, but with another state, you will still get an address already in use error. It is useful if your server has been shut down, and then restarted right away while sockets are still active on its port. You should be aware that if any unexpected data comes in, it may confuse your server, but while this is possible, it is not likely.

It has been pointed out that "A socket is a 5 tuple (proto, local addr, local port, remote addr, remote port). SO_REUSEADDR just says that you can reuse local addresses. The 5 tuple still must be unique!" by Michael Hunter (mphunter@qnx.com). This is true, and this is why it is very unlikely that unexpected data will ever be seen by your server. The danger is that such a 5 tuple is still floating around on the net, and while it is bouncing around, a new connection from the same client, on the same system, happens to get the same remote port. This is explained by Richard Stevens in 2.7 Please explain the TIME_WAIT state..

6. What exactly does SO_LINGER do?

On some unixes this does nothing. On others, it instructs the kernel to abort tcp connections instead of closing them properly. This can be dangerous. If you are not clear on this, see 2.7 Please explain the TIME_WAIT state..

7. What exactly does SO_KEEPALIVE do?

8. 4.8 How can I bind() to a port number < 1024?

The SO_KEEPALIVE option causes a packet (called a 'keepalive probe') to be sent to the remote system if a long time (by default, more than 2 hours) passes with no other data being sent or received. This packet is designed to provoke an ACK response from the peer. This enables detection of a peer which has become unreachable (e.g. powered off or disconnected from the net). See 2.8 Why does it take so long to detect that the peer died? for further discussion.

Note that the figure of 2 hours comes from RFC1122, "Requirements for Internet Hosts". The precise value should be configurable, but I've often found this to be difficult. The only implementation I know of that allows the keepalive interval to be set per-connection is SVR4.2.

9. How do I get my server to find out the client's address / hostname?

The restriction on access to ports < 1024 is part of a (fairly weak) security scheme particular to UNIX. The intention is that servers (for example rlogind, rshd) can check the port number of the client, and if it is < 1024, assume the request has been properly authorised at the client end.

The practical upshot of this, is that binding a port number < 1024 is reserved to processes having an effective UID == root.

This can, occasionally, itself present a security problem, e.g. when a server process needs to bind a well-known port, but does not itself need root access (news servers, for example). This is often solved by creating a small program which simply binds the socket, then restores the real userid and exec()s the real server. This program can then be made setuid root.

10. How should I choose a port number for my server?

After accept()ing a connection, use getpeername() to get the address of the client. The client's address is of course, also returned on the accept(), but it is essential to initialise the address-length parameter before the accept call for this will work.

Jari Kokko (jkokko@cc.hut.fi) has offered the following code to determine the client address:

int t;
int len;
struct sockaddr_in sin;
struct hostent *host;

len = sizeof sin;
if (getpeername(t, (struct sockaddr *) &sin, &len) < 0)
        perror("getpeername");
else {
        if ((host = gethostbyaddr((char *) &sin.sin_addr,
                                  sizeof sin.sin_addr,
                                  AF_INET)) == NULL)
            perror("gethostbyaddr");
        else printf("remote host is '%s'\n", host->h_name);
}

The list of registered port assignments can be found in STD 2 or RFC 1700. Choose one that isn't already registered, and isn't in /etc/services on your system. It is also a good idea to let users customize the port number in case of conflicts with other un-registered port numbers in other servers. The best way of doing this is hardcoding a service name, and using getservbyname() to lookup the actual port number. This method allows users to change the port your server binds to by simply editing the /etc/services file.

11. What is the difference between SO_REUSEADDR and SO_REUSEPORT?

SO_REUSEADDR allows your server to bind to an address which is in a TIME_WAIT state. It does not allow more than one server to bind to the same address. It was mentioned that use of this flag can create a security risk because another server can bind to a the same port, by binding to a specific address as opposed to INADDR_ANY. The SO_REUSEPORT flag allows multiple processes to bind to the same address provided all of them use the SO_REUSEPORT option.

12. How can I write a multi-homed server?

This is a newer flag that appeared in the 4.4BSD multicasting code (although that code was from elsewhere, so I am not sure just who invented the new SO_REUSEPORT flag).

What this flag lets you do is rebind a port that is already in use, but only if all users of the port specify the flag. I believe the intent is for multicasting apps, since if you're running the same app on a host, all need to bind the same port. But the flag may have other uses. For example the following is from a post in February:

From Stu Friedberg (stuartf@sequent.com):

SO_REUSEPORT is also useful for eliminating the try-10-times-to-bind hack in ftpd's data connection setup routine. Without SO_REUSEPORT, only one ftpd thread can bind to TCP (lhost, lport, INADDR_ANY, 0) in preparation for connecting back to the client. Under conditions of heavy load, there are more threads colliding here than the try-10-times hack can accomodate. With SO_REUSEPORT, things work nicely and the hack becomes unnecessary.

I have also heard that DEC OSF supports the flag. Also note that under 4.4BSD, if you are binding a multicast address, then SO_REUSEADDR is condisered the same as SO_REUSEPORT (p. 731 of "TCP/IP Illustrated, Volume 2"). I think under Solaris you just replace SO_REUSEPORT with SO_REUSEADDR.

From a later Stevens posting, with minor editing:

Basically SO_REUSEPORT is a BSD'ism that arose when multicasting was added, even thought it was not used in the original Steve Deering code. I believe some BSD-derived systems may also include it (OSF, now Digital Unix, perhaps?). SO_REUSEPORT lets you bind the same address *and* port, but only if all the binders have specified it. But when binding a multicast address (its main use), SO_REUSEADDR is considered identical to SO_REUSEPORT (p. 731, "TCP/IP Illustrated, Volume 2"). So for portability of multicasting applications I always use SO_REUSEADDR.

The original question was actually from Shankar Ramamoorthy (shankar@viman.com):

I want to run a server on a multi-homed host. The host is part of two networks and has two ethernet cards. I want to run a server on this machine, binding to a pre-determined port number. I want clients on either subnet to be able to send broadcast packates to the port and have the server receive them.

And answered by Andrew Gierth (andrew@erlenstar.demon.co.uk):

Your first question in this scenario is, do you need to know which subnet the packet came from? I'm not at all sure that this can be reliably determined in all cases.

If you don't really care, then all you need is one socket bound to INADDR_ANY. That simplifies things greatly.

If you do care, then you have to bind multiple sockets. You are obviously attempting to do this in your code as posted, so I'll assume you do.

I was hoping that something like the following would work. Will it? This is on Sparcs running Solaris 2.4/2.5.

I don't have access to Solaris, but I'll comment based on my experience with other Unixes.

[Shankar's original code omitted]

What you are doing is attempting to bind all the current hosts unicast addresses as listed in hosts/NIS/DNS. This may or may not reflect reality, but much more importantly, neglects the broadcast addresses. It seems to be the case in the majority of implementations that a socket bound to a unicast address will not see incoming packets with broadcast addresses as their destinations.

The approach I've taken is to use SIOCGIFCONF to retrieve the list of active network interfaces, and SIOCGIFFLAGS and SIOCGIFBRDADDR to identify broadcastable interfaces and get the broadcast addresses. Then I bind to each unicast address, each broadcast address, and to INADDR_ANY as well. That last is necessary to catch packets that are on the wire with INADDR_BROADCAST in the destination. (SO_REUSEADDR is necessary to bind INADDR_ANY as well as the specific addresses.)

This gives me very nearly what I want. The wrinkles are:

I don't assume that getting a packet through a particular socket necessarily means that it actually arrived on that interface.
I can't tell anything about which subnet a packet originated on if its destination was INADDR_BROADCAST.
On some stacks, apparently only those with multicast support, I get duplicate incoming messages on the INADDR_ANY socket.

13. How can I read only one character at a time?

This question is usually asked by people who are testing their server with telnet, and want it to process their keystrokes one character at a time. Without special direction from the server telnet will buffer each line of text that you type, so when you press a key, telnet won't send it until you press enter. The correct way to read a single character is (as you would expect):

read(s,buf,1) or recv(s,buf,1,flags)

The rest of this answer assumes that you want to force telnet to send individual characters and not do line buffering.

According to Roger Espel Llima (espel@drakkar.ens.fr), you can have your server send a sequence of control characters: 0xff 0xfb 0x01 0xff 0xfb 0x03 0xff 0xfd 0x0f3, which translates to IAC WILL ECHO IAC WILL SUPPRESS-GO-AHEAD IAC DO SUPPRESS-GO-AHEAD. For more information on what this means, check out std8, std28 and std29. Roger also gave the following tips:

This code will suppress echo, so you'll have to send the characters the user types back to the client if you want the user to see them.
Carriage returns will be followed by a null character, so you'll have to expect them.
If you get a 0xff, it will be followed by two more characters. These are telnet escapes.

Thanks to Cyrus Patel (cyp@fb14.uni-mainz.de) for emailing me some pointers on clarifying this answer.

14. I'm trying to exec() a program from my server, and attach my socket's IO to it, but I'm not getting all the data across. Why?

If the program you are running uses printf(), etc (streams from stdio.h) you have to deal with two buffers. The kernel buffers all socket IO, and this is explained in section 2.11. The second buffer is the one that is causing you grief. This is the stdio buffer, and the problem was well explained by Andrew:

(The short answer to this question is that you want to use a pty rather than a socket; the remainder of this article is an attempt to explain why.)

Firstly, the socket buffer controlled by setsockopt() has absolutly nothing to do with stdio buffering. Setting it to 1 is guaranteed to be the Wrong Thing(tm).

Perhaps the following diagram might make things a little clearer:

        Process A                   Process B
    +---------------------+     +---------------------+
    |                     |     |                     |
    |    mainline code    |     |    mainline code    |
    |         |           |     |         ^           |
    |         v           |     |         |           |
    |      fputc()        |     |      fgetc()        |
    |         |           |     |         ^           |
    |         v           |     |         |           |
    |    +-----------+    |     |    +-----------+    |
    |    | stdio     |    |     |    | stdio     |    |
    |    | buffer    |    |     |    | buffer    |    |
    |    +-----------+    |     |    +-----------+    |
    |         |           |     |         ^           |
    |         |           |     |         |           |
    |      write()        |     |       read()        |
    |         |           |     |         |           |
    +-------- | ----------+     +-------- | ----------+
              |                           |                  User space
  ------------|-------------------------- | ---------------------------
              |                           |                Kernel space
              v                           |
         +-----------+               +-----------+
         | socket    |               | socket    |
         | buffer    |               | buffer    |
         +-----------+               +-----------+
              |                           ^
              v                           |
      (AF- and protocol-          (AF- and protocol-
       dependent code)             dependent code)

Assuming these two processes are communicating with each other (I've deliberately omitted the actual comms mechanisms, which aren't really relevent), you can see that data written by process A to its stdio buffer is completely inaccessible to process B. Only once the decision is made to flush that buffer to the kernel (via write()) can the data actually be delivered to the other process.

The only guaranteed way to affect the buffering within process A is to change the code. However, the default buffering for stdout is controlled by whether the underlying FD refers to a terminal or not; generally, output to terminals is line-buffered, and output to non-terminals (including but not limited to files, pipes, sockets, non-tty devices, etc.) is fully buffered. So the desired effect can usually be achieved by using a pty device; this, for example, is what the 'expect' program does.

Since the stdio buffer (and the FILE structure, and everything else related to stdio) is user-level data, it is not preserved across an exec() call, hence trying to use setvbuf() before the exec is ineffective.

A couple of alternate solutions were proposed by Roger Espel Llima (espel@drakkar.ens.fr):

If it's an option, you can use some standalone program that will just run something inside a pty and buffer its input/output. I've seen a package by the name pty.tar.gz that did that; you could search around for it with archie or AltaVista.

Another option (**warning, evil hack**) , if you're on a system that supports this (SunOS, Solaris, Linux ELF do; I don't know about others) is to, on your main program, putenv() the name of a shared executable (*.so) in LD_PRELOAD, and then in that .so redefine some commonly used libc function that the program you're exec'ing is known to use early. There you can 'get control' on the running program, and the first time you get it, do a setbuf(stdout, NULL) on the program's behalf, and then call the original libc function with a dlopen() + dlsym(). And you keep the dlsym() value on a static var, so you can just call that the following times.

(Editors note: I still haven't done an expample for how to do pty's, but I hope I will be able to do one after I finish the non-blocking example code.)

5. Writing UDP/SOCK_DGRAM applications

1. When should I use UDP instead of TCP?

UDP is good for sending messages from one system to another when the order isn't important and you don't need all of the messages to get to the other machine. This is why I've only used UDP once to write the example code for the faq. Usually TCP is a better solution. It saves you having to write code to ensure that messages make it to the desired destination, or to ensure the message ordering. Keep in mind that every additional line of code you add to your project in another line that could contain a potentially expensive bug.

If you find that TCP is too slow for your needs you may be able to get better performance with UDP so long as you are willing to sacrifice message order and/or reliability.

Philippe Jounin would like to add...

In chapter 5.1 you say UDP allows more throughput than TCP. It is rarely the case if you have to pass several routers.
For instance, if you connect two LANs via X25 (a common way in Europe!), every UDP datagram will :

establish a Virtual Channel (VC)

send the data

close the VC,

whereas the VC remains during a TCP dialog.

UDP must be used to multicast messages to more than one other machine at the same time. With TCP an application would have to open separate connections to each of the destination machines and send the message once to each target machine. This limits your application to only communicate with machines that it already knows about.

2. What is the difference between "connected" and "unconnected" sockets?

3. Does doing a connect() call affect the receive behaviour of the socket?

If a UDP socket is unconnected, which is the normal state after a bind() call, then send() or write() are not allowed, since no destination address is available; only sendto() can be used to send data.

Calling connect() on the socket simply records the specified address and port number as being the desired communications partner. That means that send() or write() are now allowed; they use the destination address and port given on the connect call as the destination of the packet.

4. How can I read ICMP errors from "connected" UDP sockets?

Yes, in two ways. First, only datagrams from your "connected peer" are returned. All others arriving at your port are not delivered to you.

But most importantly, a UDP socket must be connected to receive ICMP errors. Pp. 748-749 of "TCP/IP Illustrated, Volume 2" give all the gory details on why this is so.

If the target machine discards the message because there is no process reading on the requested port number, it sends an ICMP message to your machine which will cause the next system call on the socket to return ECONNREFUSED. Since delivery of ICMP messages is not guarenteed you may not recieve this notification on the first transaction.

Remember that your socket must be "connected" in order to receive the ICMP errors. I've been told, and Alan Cox has verified that Linux will return them on "unconnected" sockets. This may cause porting problems if your application isn't ready for it, so Alan tells me they've added a SO_BSDCOMPAT flag which can be set for Linux kernels after 2.0.0.

5. How can I be sure that a UDP message is received?

You have to design your protocol to expect a confirmation back from the destination when a message is received. Of course is the confirmation is sent by UDP, then it too is unreliable and may not make it back to the sender. If the sender does not get confirmation back by a certain time, it will have to re-transmit the message, maybe more than once. Now the receiver has a problem because it may have already received the message, so some way of dropping duplicates is required. Most protocols use a message numbering scheme so that the receiver can tell that it has already processed this message and return another confirmation. Confirmations will also have to reference the message number so that the sender can tell which message is being confirmed. Confused? That's why I stick with TCP.

6. How can I be sure that UDP messages are received in order?

You can't. What you can do is make sure that messages are processed in order by using a numbering system as mentioned in 5.5 How can I be sure that a UDP message is received?. If you need your messages to be received and be received in order you should really consider switching to TCP. It is unlikely that you will be able to do a better job implementing this sort of protocol than the TCP people already have, without a significant investment of time.

7. How often should I re-transmit un-acknowleged messages?

The simplest thing to do is simply pick a fairly small delay such as one second and stick with it. The problem is that this can congest your network with useless traffic if there is a problem on the lan or on the other machine, and this added traffic may only serve to make the problem worse.

A better technique, described with source code in "UNIX Network Programming" by Richard Stevens (see 1.6 Where can I get source code for the book [book title]?), is to use an adaptive timeout with an exponential backoff. This technique keeps statistical information on the time it is taking messages to reach a host and adjusts timeout values accordingly. It also doubles the timeout each time it is reached as to not flood the network with useless datagrams. Richard has been kind enough to post the source code for the book on the web. Check out his home page at http://www.kohala.com/~rstevens.

8. How come only the first part of my datagram is getting through?

This has to do with the maximum size of a datagram on the two machines involved. This depends on the sytems involved, and the MTU (Maximum Transmission Unit). According to "UNIX Network Programming", all TCP/IP implementations must support a minimum IP datagram size of 576 bytes, regardless of the MTU. Assuming a 20 byte IP header and 8 byte UDP header, this leaves 548 bytes as a safe maximum size for UDP messages. The maximum size is 65516 bytes. Some platforms support IP fragmentation which will allow datagrams to be broken up (because of MTU values) and then re-assembled on the other end, but not all implementations support this.

This information is taken from my reading of "UNIX Netowrk Programming" (see 1.6 Where can I get source code for the book [book title]?).

Andrew has pointed out the following regarding large UDP messages:

Another issue is fragmentation. If a datagram is sent which is too large for the network interface it is sent through, then the sending host will fragment it into smaller packets which are reassembled by the receiving host. Also, if there are intervening routers, then they may also need to fragment the packet(s), which greatly increases the chances of losing one or more fragments (which causes the entire datagram to be dropped). Thus, large UDP datagrams should be avoided for applications that are likely to operate over routed nets or the Internet proper.

9. Why does the socket's buffer fill up sooner than expected?

From Paul W. Nelson (nelson@thursby.com):

In the traditional BSD socket implementation, sockets that are atomic such as UDP keep received data in lists of mbufs. An mbuf is a fixed size buffer that is shared by various protocol stacks. When you set your receive buffer size, the protocol stack keeps track of how many bytes of mbuf space are on the receive buffer, not the number of actual bytes. This approach is used because the resource you are controlling is really how many mbufs are used, not how many bytes are being held in the socket buffer. (A socket buffer isn't really a buffer in the traditional sense, but a list of mbufs).

For example: Lets assume your UNIX has a small mbuf size of 256 bytes. If your receive socket buffer is set to 4096, you can fit 16 mbufs on the socket buffer. If you receive 16 UDP packets that are 10 bytes each, your socket buffer is full, and you have 160 bytes of data. If you receive 16 UDP packets that are 200 bytes each, your socket buffer is also full, but contains 3200 bytes of data. FIONREAD returns the total number of bytes, not the number of messages or bytes of mbufs. Because of this, it is not a good indicator of how full your receive buffer is.

Additionaly, if you receive UDP messages that are 260 bytes, you use up two mbufs, and can only recieve 8 packets before your socket buffer is full. In this case, only 2080 bytes of the 4096 are held in the socket buffer.

This example is greatly simplified, and the real socket buffer algorithm also takes into account some other parameters. Note that some older socket implementations use a 128 byte mbuf.

6. Advanced Socket Programming

1. How would I put my socket in non-blocking mode?

2. How can I put a timeout on connect()?

Technically, fcntl(soc, F_SETFL, O_NONBLOCK) is incorrect since it clobbers all other file flags. Generally one gets away with it since the other flags (O_APPEND for example) don't really apply much to sockets. In a similarly rough vein, you would use fcntl(soc, F_SETFL, 0) to go back to blocking mode.

To do it right, use F_GETFL to get the current flags, set or clear the O_NONBLOCK flag, then use F_SETFL to set the flags.

And yes, the flag can be changed either way at will.

Andrew Gierth (andrew@erlenstar.demon.co.uk) has outlined the following procedure for using select() with connect(), which will allow you to put a timeout on the connect() call:

First, create the socket and put it into non-blocking mode, then call connect(). There are three possibilities:

connect succeeds: the connection has been successfully made (this usually only happens when connecting to the same machine)
connect fails: obvious
connect returns -1/EINPROGRESS. The connection attempt has begun, but not yet completed.

If the connection succeeds:

the socket will select() as writable (and will also select as readable if data arrives)

If the connection fails:

the socket will select as readable *and* writable, but either a read or write will return the error code from the connection attempt. Also, you can use getsockopt(SO_ERROR) to get the error status - but be careful; some systems return the error code in the result parameter of getsockopt(), but others (incorrectly) cause the getsockopt call itself to fail with the stored value as the error.

Sample code that illustrates this can be found in the file .

3. How do I complete a read if I've only read the first part of something, without again calling select()?

4. How to use select routine

5. RAW sockets

6. Restricting a socket to a given interface

7. Receiving all incoming traffic through a RAW-socket?