Learning and Exploiting Structural Properties of the Internet
Learning and Exploiting Structural Properties of the Internet
Robert Beverly

Overview

The Internet's topological structure, traffic properties and policy are complex and governed by many inter-related factors. Ideally, a "knowledge plane" would provide oracle information on every node in the network. Unfortunately, the size and dynamics of the Internet preclude complete information. Instead, this research examines the ability to leverage the Internet's inherent structure due to physical, logical and administrative boundaries. Specifically, we turn to machine learning techniques to learn and exploit Internet structure.

Because IP addresses are overloaded (identifying both end hosts and topological locations) they define security policy, provide reputation, and influence service selection. Thus, learning network structure endows Internet agents with many new abilities. For example:

  • Source reputation
  • Intelligent reorganization/distribution
  • Fine-grained/user-directed routing
  • Service selection
  • Latency prediction

Clustering

A common problem in many Internet contexts is making inferences on sparse data. For example, the distribution of IP addresses of spam received by an email server may be skewed toward particular address ranges. Can the email server determine which blocks are more likely to source spam? What is the correct size of clustered blocks given sparse samples? How many samples are required for a given network prefix? Can the server form predictions based on these inferred clusters? Can the server determine when a structural change has taken place that influences its prior clustering? Often, these questions are addressed in a static manner (e.g. relying on /24's as the ideal prefix size).

Our research considers these questions through an IP address clustering algorithm. More details available in:

An Internet Protocol Address Clustering Algorithm
Robert Beverly and Karen Sollins

Latency prediction

We examine the ability to exploit the hierarchical structure of Internet addresses in order to endow network agents with predictive capabilities. Specifically, we consider Support Vector Machines (SVMs) for prediction of round-trip latency to random network destinations the agent has not previously interacted with. We use kernel functions to transform the structured, yet fragmented and discontinuous, IP address space into a feature space amenable to SVMs. Our SVM approach is accurate, fast, suitable to on-line learning and generalizes well. SVM regression on a large, randomly collected data set of 30,000 Internet latencies yields a mean prediction error of 25ms using only 20% of the samples for training. Our results are promising for equipping end-nodes with intelligence for service selection, user-directed routing, resource scheduling and network inference. Finally, feature selection analysis finds that the eight most significant IP address bits provide surprisingly strong discriminative power.

More details available in:

SVM Learning of IP Address Structure for Latency Prediction
Robert Beverly, Karen Sollins and Arthur Berger.

Data Set

To collect data for our experiments, we use a simple active measurement procedure. We select unsigned 32-bit integers at random until one is found as a valid IP address in a public global routing table. Based on the approximately 1.8B publicly advertised addresses, filtering with the BGP table reduces our search space by approximately half. If the randomly selected destination responds to ICMP echo requests, i.e. ``ping,'' we record the average of five ping times from our measurement host as the round-trip latency. Our data set consists of approximately 30,000 randomly selected (IP,Latency) pairs. The dataset used in the paper can be found at:

http://ana.csail.mit.edu/rbeverly/latency/latency.dat.gz