Passive TCP Fingerprinting

To better understand the nature of flow rates in the Internet, we sought to built on our aggregate flow analysis, by inferring the prevalence of different TCP variants in our trace samples. Because TCP is a complex protocol with many configurable options and vendor implementations, each TCP stack has unique idiosyncrasies. Several existing tools (nmap, TBIT) exploit these unique fingerprints to actively probe a host with various packets to learn what operating system and TCP stack are likely running on that host. Our passive traces are both an advantage, as we have a complete cross section of real client and servers, and a disadvantage because we cannot use unusual probe packets to reveal host details. However, our analysis yields results representative of all flows in a portion of the Internet. These results are useful to build accurate simulations and models as well as to observe TCP deployment.

As a starting point for our passive TCP fingerprinting, we used the fingerprints from the p0f program to build an entirely new tool. Our program, sieve is designed to probabilistically classify IP addresses in a libpcap format capture or device by operating system, operating system group and TCP variant. Two important design goals of our tool were speed and the ability to handle very large packet captures. sieve builds a list of fingerprints from the input fingerprint file (sieve.fp) and then attempts to classify each packet. Packet state is kept in a hash table where later matching is done. We are building additional functionality into sieve to detect packet retransmissions, estimate RTTs and estimate the congestion algorithm.

By pairing TCP fingerprinting with flow capture, we can analyze the flow characteristics of different systems in the Internet. We are specifically interested in whether TCP variant and operating system significantly influence flow rates. In addition to flow records, we modified our passive monitor to capture the SYN packet of each unique IP address. Using sieve, we analyzed the SYN capture file and built a mapping between IP and likely operating system. Table 1 shows the operating system class composition of unique IP addresses from an hour-long SYN capture at a major peering point.

Table 1. Percentage of Unique IP Addresses per Operating System
OSCountPercent
Windows9676192.10%
Linux40813.88%
Solaris5370.51%
BSD8290.79%
Mac18051.72%
Other10430.99%

The flow records from this hour-long capture were then divided by operating system. The number of flows, packets and bytes for each is given in Table 2. The first row presents the full capture totals while subsequent rows represent the flows divided by sieve. Note that 78% of the flows were matched using sieve to an operating system while the remainder were not due to packets missing from the TCP connection establishment.

Table 2. Flow Composition by Operating System Class
CaptureFlowsPacketsBytes
Full51802427430312021250002764
Windows30084774953013914191681234
Linux77967185042822426971685
Solaris313461886730550507605
BSD1214831401550867149117
Mac4473253777392109301
Other755411043317354203279
Total40612506290379118482622221

Figure 1 shows the complementary flow byte volume distribution for each operating system. We see significant differences between operating systems. While with the aggregate flow byte count approximately 90% of the flows were 10,000 bytes or smaller, our results show, at a 90% distribution, a maximum range between approximately 2000 and 10,000 bytes depending on operating system. The maximum flow size varied by more than an order of magnitude with a small number of Solaris boxes accountable for large flows.

Figure 1. Complementary CDF of Flow Volume

Figure 2 presents the complementary flow duration for each operating system. Variations between operating system are again evident with BSD systems exhibiting a mode in their distribution at approximately two seconds.

Figure 2. Complementary CDF of Flow Duration

Figure 3 shows the rate distributions seen in our flow trace. Windows based boxes exhibited the highest rate but had one of the lowest average rates. Conversely, Solaris machines appear bounded to approximately 1Mbps, but showed a higher percentage of faster flows.

Figure 3. Complementary CDF of Flow Rates

For each operating system flow rate distribution, we again attempt to fit the log of the flow rate samples to a normal distribution in Figure 4. We see that while the aggregate flow rates are generally well described a log-normal distribution, the individual flows from different operating systems fit only to varying degrees, sometimes poorly. We note that the dominance of Windows flows in our trace may skew results as too few sample flows for other operating systems were captured.

Windows
Linux
Solaris
BSD
Macintosh
Other
Figure 4. Quantile Plots


[1] nmap, http://www.insecure.org/nmap/
[2] p0f, http://www.stearns.org/p0f/
[3] J. Padhye, S. Floyd, "On Inferring TCP Behavior", ACM SIGCOMM, 2001.