planetlab.pdf

Experiences Building PlanetLabLarry Peterson, Andy Bavier, Marc E. Fiuczynski, Steve Muir

Department of Computer SciencePrinceton University

Abstract. This paper reports our experiences buildingPlanetLab over the last four years. It identifies the re-quirements that shaped PlanetLab, explains the designdecisions that resulted from resolving conflicts amongthese requirements, and reports our experience imple-menting and supporting the system. Due in large partto the nature of the “PlanetLab experiment,” the discus-sion focuses on synthesis rather than new techniques, bal-ancing system-wide considerations rather than improvingperformance along a single dimension, and learning fromfeedback from a live system rather than controlled exper-iments using synthetic workloads.

1 IntroductionPlanetLab is a global platform for deploying and eval-uating network services [21, 3]. In many ways, it hasbeen an unexpected success. It was launched in mid-2002 with 100 machines distributed to 40 sites, but to-day includes 700 nodes spanning 336 sites and 35 coun-tries. It currently hosts 2500 researchers affiliated with600 projects. It has been used to evaluate a diverse set ofplanetary-scale network services, including content dis-tribution [33, 8, 24], anycast [35, 9], DHTs [26], robustDNS [20, 25], large-file distribution [19, 1], measurementand analysis [30], anomaly and fault diagnosis [36], andevent notification [23]. It supports the design and evalua-tion of dozens of long-running services that transport anaggregate of 3-4TB of data every day, satisfying tens ofmillions of requests involving roughly one million uniqueclients and servers.To deliver this utility, PlanetLab innovates along two

main dimensions:

• Novel management architecture. PlanetLab ad-ministers nodes owned by hundreds of organiza-tions, which agree to allow a worldwide communityof researchers—most complete strangers—to accesstheir machines. PlanetLab must manage a complexrelationship between node owners and users.

• Novel usage model. Each PlanetLab node shouldgracefully degrade in performance as the number ofusers grows. This gives the PlanetLab communityan incentive to work together to make best use of itsshared resources.

In both cases, the contribution is not a new mechanism oralgorithm, but rather a synthesis (and full exploitation) ofcarefully selected ideas to produce a fundamentally newsystem.Moreover, the process by which we designed the sys-

tem is interesting in its own right:

• Experience-driven design. PlanetLab’s designevolved incrementally based on experience gainedfrom supporting a live user community. This isin contrast to most research systems that are de-signed and evaluated under controlled conditions,contained within a single organization, and evalu-ated using synthetic workloads.

• Conflict-driven design. The design decisions thatshaped PlanetLab were responses to conflicting re-quirements. The result is a comprehensive architec-ture based more on balancing global considerationsthan improving performance along a single dimen-sion, and on real-world requirements that do not al-ways lend themselves to quantifiable metrics.

One could view this as a new model of system design, butof course it isn’t [6, 27].This paper identifies the requirements that shaped the

system, explains the design decisions that resulted fromresolving conflicts among these requirements, and reportsour experience building and supporting the system. Aside-effect of the discussion is a fairly complete overviewof PlanetLab’s current architecture, but the primary goalis to describe the design decisions that went into buildingPlanetLab, and to report the lessons we have learned inthe process. For a comprehensive definition of the Plan-etLab architecture, the reader is referred to [22].

2 BackgroundThis section identifies the requirements we understood atthe time PlanetLab was first conceived, and sketches thehigh-level design proposed at that time. The discussionincludes a summary of the three main challenges we havefaced, all of which can be traced to tensions between therequirements. The section concludes by looking at therelationship between PlanetLab and similar systems.

OSDI ’06: 7th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association 351

2.1 Requirements

PlanetLab’s design was guided by five major require-ments that correspond to objectives we hoped to achieveas well as constraints we had to live with. Although werecognized all of these requirements up-front, the follow-ing discussion articulates them with the benefit of hind-sight.(R1) It must provide a global platform that supports

both short-term experiments and long-running services.Unlike previous testbeds, a revolutionary goal of Planet-Lab was that it support experimental services that couldrun continuously and support a real client workload. Thisimplied that multiple services be able to run concurrentlysince a batch-scheduled facility is not conducive to a24×7 workload. Moreover, these services (experiments)should be isolated from each other so that one servicedoes not unduly interfere with another.(R2) It must be available immediately, even though

no one knows for sure what “it” is. PlanetLab faced adilemma: it was designed to support research in broad-coverage network services, yet its management (control)plane is itself such a service. It was necessary to deployPlanetLab and start gaining experience with network ser-vices before we fully understood what services would beneeded to manage it. As a consequence, PlanetLab hadto be designed with explicit support for evolution. More-over, to get people to use PlanetLab—so we could learnfrom it—it had to be as familiar as possible; researchersare not likely to change their programming environmentto use a new facility.(R3) We must convince sites to host nodes running

code written by unknown researchers from other organi-zations. PlanetLab takes advantage of nodes contributedby research organizations around the world. These nodes,in turn, host services on behalf of users from other re-search organizations. The individual users are unknownto the node owners, and to make matters worse, the ser-vices they deploy often send potentially disruptive pack-ets into the Internet. That sites own and host nodes, buttrust PlanetLab to administer them, is unprecedented atthe scale PlanetLab operates. As a consequence, we mustcorrectly manage the trust relationships so that the risksto each site are less than the benefits they derive.

(R4) Sustaining growth depends on support for auton-omy and decentralized control. PlanetLab is a world-wide platform constructed from components owned bymany autonomous organizations. Each organization mustretain some amount of control over how their resourcesare used, and PlanetLab as a whole must give geographicregions and other communities as much autonomy as pos-sible in defining and managing the system. Generally,sustaining such a system requires minimizing centralizedcontrol.

(R5) It must scale to support many users with mini-mal resources. While a commercial variant of PlanetLabmight have cost recovery mechanisms to provide resourceguarantees to each of its users, PlanetLab must operatein an under-provisioned environment. This means con-servative allocation strategies are not practical, and it isnecessary to promote efficient resource sharing. This in-cludes both physical resources (e.g., cycles, bandwidth,and memory) and logical resources (e.g., IP addresses).Note that while the rest of this paper discusses the

many tensions between these requirements, two of themare quite synergistic. The requirement that we evolvePlanetLab (R2) and the need for decentralized control(R4) both point to the value of factoring PlanetLab’s man-agement architecture into a set of building block compo-nents with well-defined interfaces. A major challenge ofbuilding PlanetLab was to understand exactly what thesepieces should be.To this end, PlanetLab originally adopted an organiz-

ing principle called unbundled management, which ar-gued that the services used to manage PlanetLab shouldthemselves be deployed like any other service, rather thanbundled with the core system. The case for unbundledmanagement has three arguments: (1) to allow the sys-tem to more easily evolve; (2) to permit third-party de-velopers to build alternative services, enabling a softwarebazaar, rather than rely on a single development teamwith limited resources and creativity; and (3) to permitdecentralized control over PlanetLab resources, and ulti-mately, over its evolution.

2.2 Initial Design

PlanetLab supports the required usage model through dis-tributed virtualization—each service runs in a slice ofPlanetLab’s global resources. Multiple slices run con-currently on PlanetLab, where slices act as network-widecontainers that isolate services from each other. Sliceswere expected to enforce two kinds of isolation: resourceisolation and security isolation, the former concernedwith minimizing performance interference and the latterconcerned with eliminating namespace interference.At a high-level, PlanetLab consists of a centralized

front-end, called PlanetLab Central (PLC), that remotelymanages a set of nodes. Each node runs a node man-ager (NM) that establishes and controls virtual machines(VM) on that node. We assume an underlying virtual ma-chine monitor (VMM) implements the VMs. Users createslices through operations available on PLC, which resultsin PLC contacting the NM on each node to create a localVM. A set of such VMs defines the slice.We initially elected to use a Linux-based VMM due to

Linux’s high mind-share [3]. Linux is augmented withVservers [16] to provide security isolation and a set ofschedulers to provide resource isolation.

OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation USENIX Association352

2.3 Design ChallengesLike many real systems, what makes PlanetLab interest-ing to study—and challenging to build—is how it dealswith the constraints of reality and conflicts among re-quirements. Here, we summarize the three main chal-lenges; subsequent sections address each in more detail.First, unbundled management is a powerful design

principle for evolving a system, but we did not fully un-derstand what it entailed nor how it would be shaped byother aspects of the system. Defining PlanetLab’s man-agement architecture—and in particular, deciding how tofactor management functionality into a set of independentpieces—involved resolving three main conflicts:

• minimizing centralized components (R4) yet main-taining the necessary trust assumptions (R3);

• balancing the need for slices to acquire the resourcesthey need (R1) yet coping with scarce resources(R5);

• isolating slices from each other (R1) yet allowingsome slices to manage other slices (R2).

Section 3 discusses our experiences evolving PlanetLab’smanagement architecture.Second, resource allocation is a significant challenge

for any system, and this is especially true for PlanetLab,where the requirement for isolation (R1) is in conflictwith the reality of limited resources (R5). Part of our ap-proach to this situation is embodied in the managementstructure described in Section 3, but it is also addressedin how scheduling and allocation decisions are made on aper-node basis. Section 4 reports our experience balanc-ing isolation against efficient resource usage.Third, we must maintain a stable system on behalf of

the user community (R1) and yet evolve the platform toprovide long-term viability and sustainability (R2). Sec-tion 5 reports our operational experiences with Planet-Lab, and the lessons we have learned as a result.2.4 Related SystemsAn important question to ask about PlanetLab is whetherits specific design requirements make it unique, or if ourexperiences can apply to other systems. Our response isthat PlanetLab shares “points of pain” with three simi-lar systems—ISPs, hosting centers, and the GRID—butpushes the envelope relative to each.First, PlanetLab is like an ISP in that it has many

points-of-presence and carries traffic to/from the rest ofthe Internet. Like ISPs (but unlike hosting centers and theGRID), PlanetLab has to provide mechanisms that canby used to identify and stop disruptive traffic. PlanetLabgoes beyond traditional ISPs, however, in that it has todeal with arbitrary (and experimental) network services,not just packet forwarding.

Second, PlanetLab is like a hosting center in that itsnodes support multiple VMs, each on behalf of a differ-ent user. Like a hosting center (but unlike the GRID orISPs), PlanetLab has to provide mechanisms that enforceisolation between VMs. PlanetLab goes beyond hostingcenters, however, because it includes third-party servicesthat manage other VMs, and because it must scale to largenumbers of VMs with limited resources.

Third, PlanetLab is like the GRID in that its resourcesare owned by multiple autonomous organizations. Likethe GRID (but unlike an ISP or hosting center), PlanetLabhas to provide mechanisms that allow one organization togrant users at another organization the right to use its re-sources. PlanetLab goes far beyond the GRID, however,in that it scales to hundreds of “peering” organizations byavoiding pair-wise agreements.

PlanetLab faces new and unique problems because itis at the intersection of these three domains. For exam-ple, combining multiple independent VMs with a singleIP address (hosting center) and the need to trace disrup-tive traffic back to the originating user (ISP) results ina challenging problem. PlanetLab’s experiences will bevaluable to other systems that may emerge where any ofthese domains intersect, and may in time influence thedirection of hosting centers, ISPs, and the GRID as well.

3 Slice ManagementThis section describes the slice management architecturethat evolved over the past four years. While the discus-sion includes some details, it primarily focuses on the de-sign decisions and the factors that influenced them.

3.1 Trust AssumptionsGiven that PlanetLab sites and users span multiple orga-nizations (R3), the first design issue was to define the un-derlying trust model. Addressing this issue required thatwe identify the key principals, explicitly state the trust as-sumptions among them, and provide mechanisms that areconsistent with this trust model.Over 300 autonomous organizations have contributed

nodes to PlanetLab (they each require control over thenodes they own) and over 300 research groups want todeploy their services across PlanetLab (the node own-ers need assurances that these services will not be dis-ruptive). Clearly, establishing 300×300 pairwise trustrelationships is an unmanageable task, but it is well-understood that a trusted intermediary is an effective wayto manage such an N×N problem.PLC is one such trusted intermediary: node owners

trust PLC to manage the behavior of VMs that run ontheir nodes while preserving their autonomy, and re-searchers trust PLC to provide access to a set of nodesthat are capable of hosting their services. Recognizingthis role for PLC, and organizing the architecture around

OSDI ’06: 7th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association 353

Owner1

2

3

4

PLCServiceDeveloper(User)

Node

Figure 1: Trust relationships among principals.

it, is the single most important aspect of the design be-yond the simple model presented in Section 2.2.With this backdrop, the PlanetLab architecture recog-

nizes three main principals:

• PLC is a trusted intermediary that manages nodeson behalf a set of owners, and creates slices on thosenodes on behalf of a set of users.

• An owner is an organization that hosts (owns) Plan-etLab nodes. Each owner retains ultimate controlover their own nodes, but delegates management ofthose nodes to the trusted PLC intermediary. PLCprovides mechanisms that allow owners to define re-source allocation policies on their nodes.

• A user is a researcher that deploys a service on a setof PlanetLab nodes. PlanetLab users are currentlyindividuals at research organizations (e.g., universi-ties, non-profits, and corporate research labs), butthis is not an architectural requirement. Users createslices on PlanetLab nodes via mechanisms providedby the trusted PLC intermediary.

Figure 1 illustrates the trust relationships between nodeowners, users, and the PLC intermediary. In this figure:

1. PLC expresses trust in a user by issuing it credentialsthat let it access slices. This means that the usermust adequately convince PLC of its identity (e.g.,affiliation with some organization or group).

2. A user trusts PLC to act as its agent, creating sliceson its behalf and checking credentials so that onlythat user can install and modify the software runningin its slice.

3. An owner trusts PLC to install software that is ableto map network activity to the responsible slice.This software must also isolate resource usage ofslices and bound/limit slice behavior.

4. PLC trusts owners to keep their nodes physically se-cure. It is in the best interest of owners to not cir-cumvent PLC (upon which it depends for accuratepolicing of its nodes). PLC must also verify that ev-ery node it manages actually belongs to an ownerwith which it has an agreement.

Given this model, the security architecture includes thefollowing mechanisms. First, each node boots from animmutable file system, loading (1) a boot manager pro-gram, (2) a public key for PLC, and (3) a node-specificsecret key. We assume that the node is physically securedby the owner in order to keep the key secret, although ahardware mechanism such as TCPA could also be lever-aged. The node then contacts a boot server running atPLC, authenticates the server using the public key, anduses HMAC and the secret key to authenticate itself toPLC. Once authenticated, the boot server ensures that theappropriate VMM and the NM are installed on the node,thus satisfying the fourth trust relationship.Second, once PLC has vetted an organization through

an off-line process, users at the site are allowed to cre-ate accounts and upload their private keys. PLC theninstalls these keys in any VMs (slices) created on be-half of those users, and permits access to those VMs viassh. Currently, PLC requires that new user accountsare authorized by a principal investigator associated witheach site—this provides some degree of assurance thataccounts are only created by legitimate users with a con-nection to a particular site, thus satisfying the first trustrelationship.Third, PLC runs an auditing service that records in-

formation about all packet flows coming out of the node.The auditing service offers a public, web-based interfaceon each node, through which anyone that has received un-wanted network traffic from the node can determine theresponsible users. PLC archives this auditing informationby periodically downloading the audit log.

3.2 Virtual Machines and Resource Pools

Given the requirement that PlanetLab support long-livedslices (R1) and accommodate scarce resources (R5), thesecond design decision was to decouple slice creationfrom resource allocation. In contrast to a hosting cen-ter that might create a VM and assign it a fixed set ofresources as part of an SLA, PlanetLab creates new VMswithout regard for available resources—each such VM isgiven a fair share of the available resources on that nodewhenever it runs—and then expects slices to engage oneor more brokerage services to acquire resources.To this end, the NM supports two abstract objects: vir-

tual machines and resource pools. The former is a con-tainer that provides a point-of-presence on a node for aslice. The latter is a collection of physical and logicalresources that can be bound to a VM. The NM supportsoperations to create both objects, and to bind a pool to aVM for some fixed period of time. Both types of objectsare specified by a resource specification (rspec), which isa list of attributes that describe the object. A VM can runas soon as it is created, and by default is given a fair shareof the node’s unreserved capacity. When a resource pool

OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation USENIX Association354

is bound to a VM, that VM is allocated the correspondingresources for the duration of the binding.Global management services use these per-node opera-

tions to create PlanetLab-wide slices and assign resourcesto them. Two such service types exist today: slice cre-ation services and brokerage services. These services canbe separate or combined into a single service that bothcreates and provisions slices. At the same time, differentimplementations of brokerage services are possible (e.g.,market-based services that provide mechanisms for buy-ing and selling resources [10, 14], and batch schedulingservices that simply enforce admission control for use ofa finite resource pool [7]).

As part of the resource allocation architecture, it wasalso necessary to define a policy that governs how re-sources are allocated. On this point, owner autonomy(R4) comes into play: only owners are allowed to invokethe “create resource pool” operation on the NM that runson their nodes. This effectively defines the one or more“root” pools, which can subsequently be split into sub-pools and reassigned. An owner can also directly allocatea certain fraction of its node’s resources to the VM of aspecific slice, thereby explicitly supporting any servicesthe owner wishes to host.

3.3 DelegationPlanetLab’s management architecture was expected toevolve through the introduction of third-party services(R2). We viewed the NM interface as the key feature,since it would support the many third-party creation andbrokerage services that would emerge. We regarded PLCas merely a “bootstrap” mechanism that could be used todeploy such new global management services, and thus,we expected PLC to play a reduced role over time.However, experience showed this approach to be

flawed. This is for two reasons, one fundamental andone pragmatic. First, it failed to account for PLC’s cen-tral role in the trust model of Section 3.1. Maintainingtrust relationships among participants is a critical roleplayed by PLC, and one not easily passed along to otherservices. Second, researchers building new managementservices on PlanetLab were not interested in replicatingall of PLCs functionality. Instead of using PLC to boot-strap a comprehensive suite of management services, re-searchers wanted to leverage some aspects of PLC andreplace others.To accommodate this situation, PLC is today struc-

tured as follows. First, each owner implicitly assigns allof its resources to PLC for redistribution. The owner canoverride this allocation by granting a set of resources toa specific slice, or divide resources among multiple bro-kerage services, but by default all resources are allocatedto PLC.Second, PLC runs a slice creation service—called

pl conf—on each node. This service runs in a stan-dard VM and invokes the NM interface without any ad-ditional privilege. It also exports an XML-RPC interfaceby which anyone can invoke its services. This is impor-tant because it means other brokerage and slice creationservices can use pl conf as their point-of-presence oneach node rather than have to first deploy their own slice.Originally, the PLC/pl conf interface was private as weexpected management services to interact directly withthe node manager. However, making this a well-defined,public interface has been a key to supporting delegation.Third, PLC provides a front-end—available either as

a GUI or as a programmatic interface at www.planet-lab.org—through which users create slices. The PLCfront-end interacts with pl conf on each node with thesame XML-RPC interface that other services use.

Finally, PLC supports two methods by which slicesare actually instantiated on a set of nodes: direct anddelegated. Using the direct method, the PLC front-endcontacts pl conf on each node to create the correspond-ing VM and assign resources to it. Using delegation, aslice creation service running on behalf of a user con-tacts PLC for a ticket that encapsulates the right to createa VM or redistribute a pool of resources. A ticket is asigned rspec; in this case, it is signed by PLC. The agentthen contacts pl conf on each node to redeem this ticket,at which time pl conf validates it and calls the NM tocreate a VM or bind a pool of resources to an existingVM. The mechanisms just described currently supporttwo slice creation services (PLC and Emulab [34], thelatter uses tickets granted by the former), and two bro-kerage services (Sirius [7] and Bellagio [2], the first ofwhich is granted capacity as part of a root resource allo-cation decision).

Note that the delegated method of slice creation ispush-based, while the direct method is pull-based. Withdelegation, a slice creation service contacts PLC to re-trieve a ticket granting it the right to create a slice, andthen performs an XML-RPC call to pl conf on each node.For a slice spanning a significant fraction of PlanetLab’snodes, an implementation would likely launch multiplesuch calls in parallel. In contrast, PLC uses a pollingapproach: each pl conf contacts PLC periodically to re-trieve a set of tickets for the slices it should run.While the push-based approach can create a slice in

less time, the advantage of pull-based approach is thatit enables slices to persist across node reinstalls. Nodescannot be trusted to have persistent state since they arecompletely reinstalled from time to time due to unrecov-erable errors such as corrupt local file systems. The pull-based strategy views all nodes as maintaining only softstate, and gets the definitive list of slices for that nodefrom PLC. Therefore, if a node is reinstalled, all of itsslices are automatically recreated. Delegation makes it

OSDI ’06: 7th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association 355

possible for others to develop alternative slice creationsemantics—for example, a “best effort” system that ig-nores such problems—but PLC takes the conservativeapproach because it is used to create slices for essentialmanagement services.3.4 FederationGiven our desire to minimize the centralized elements ofPlanetLab (R4), our next design decision was to make itpossible for multiple independent PlanetLab-like systemsto co-exist and federate with each other. Note that thisissue is distinct from delegation, which allows multiplemanagement services to co-exist withing a single Planet-Lab.There are three keys to enabling federation. First, there

must be well-defined interfaces by which independent in-stances of PLC invoke operations on each other. To thisend, we observe that our implementation of PLC natu-rally divides into two halves: one that creates slices onbehalf of users and one that manages nodes on behalf ofowners, and we say that PLC embodies a slice authorityand a management authority, respectively. Correspond-ing to these two roles, PLC supports two distinct inter-faces: one that is used to create and control slices, andone that is used to boot and manage nodes. We claimthat these interfaces are minimal, and hence, define the“narrow waist” of the PlanetLab hourglass.Second, supporting multiple independent PLCs im-

plies the need to name each instance. It is PLC inits slice authority role that names slices, and its namespace must be extended to also name slice authori-ties. For example, the slice cornell.cobweb is implicitlyplc.cornell.cobweb, where plc is the top-level slice au-thority that approved the slice. (As we generalize the slicename space, we adopt “.” instead of “ ” as the delimiter.)Note that this model enables a hierarchy of slice author-ities, which is in fact already the case with plc.cornell,since PLC trusts Cornell to approve local slices (and theusers bound to them).This generalization of the slice naming scheme leads

to several possibilities:

• PLC delegates the ability to create slices to regionalslice authorities (e.g., plc.japan.utokyo.ubiq);

• organizations create “private” PlanetLab’s (e.g.,epfl.chawla) that possibly peer with each other, orwith the “public” PlanetLab; and

• alternative “root” naming authorities come into exis-tence, such as one that is responsible for commercial(for-profit) slices (e.g., com.startup.voip).

The third of these is speculative, but the first two scenar-ios have already happened or are in progress, with fiveprivate PlanetLabs running today and two regional slice

Service Lines of Code LanguageNode Manager 2027 Python

Proper 5752 Cpl conf 1975 PythonSirius 850 PythonStork 12803 Python

CoStat + CoMon 1155 CPlanetFlow 5932 C

Table 1: Source lines of code for various management services

authorities planned for the near future. Note that theremust be a single global naming authority that ensures alltop-level slice authority names are unique. Today, PLCplays that role.

The third key to federation is to design pl conf so thatit is able to create slices on behalf of many different sliceauthorities. Node owners allocate resources to the sliceauthorities they want to support, and configure pl conf toaccept tickets signed by slice authorities that they trust.Note that being part of the “public” PlanetLab carries thestipulation that a certain minimal amount of capacity beset aside for slices created by the PLC slice authority, butowners can reserve additional capacity for other slice au-thorities and for individual slices.

3.5 Least PrivilegeWe conclude our description of PlanetLab’s managementarchitecture by focusing on the node-centric issue of howmanagement functionality has been factored into self-contained services, moved out of the NM and isolated intheir own VMs, and granted minimal privileges.

When PlanetLab was first deployed, all managementservices requiring special privilege ran in a single rootVM as part of a monolithic node manager. Over time,stand-alone services have been carved off of the NMand placed in their own VMs, multiple versions of someservices have come and gone, and new services haveemerged. Today, there are five broad classes of manage-ment services. The following summarizes one particular“suite” of services that a user might engage; we also iden-tify alternative services that are available.

Slice Creation Service: pl conf is the default slice cre-ation service. It requires no special privilege: thenode owner creates a resource pool and assigns it topl conf when the node boots. Emulab [34] offersan alternative slice creation service that uses ticketsgranted by PLC and redeemed by pl conf.

Brokerage Service: Sirius [7] is the most widely usedbrokerage service. It performs admission control ona resource pool set aside for one-hour experiments.

OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation USENIX Association356

Sirius requires no special privilege: pl conf allo-cates a sub-pool of resources to Sirius. Bellagio [2]and Tycoon [14] are alternative market-based bro-kerage services that are initialized in the same way.

Monitoring Service: CoStat is a low-level instrumen-tation program that gathers data about the state ofthe local node. It is granted the ability to read/proc files that report data about the underlyingVMM, as well as the right to execute scripts (e.g.,ps and top) in the root context. Multiple ad-ditional services—e.g., CoMon [31], PsEPR [5],SWORD [18]—then collect and process this infor-mation on behalf of users. These services require noadditional privilege.

Environment Service: Stork [12] deploys, updates, andconfigures services and experiments. Stork isgranted the right to mount the file system of a clientslice, which Stork then uses to install software pack-ages required by the slice. It is also granted the rightto mark a file as immutable, so that it can safely beshared among slices without any slice being able tomodify the file. Emulab and AppManager [28] pro-vide alternative environment services without extraprivilege; they simply provide tools for uploadingsoftware packages.

Auditing Service: PlanetFlow [11] is an auditing ser-vice that logs information about packet flows, andis able to map externally visible network activity tothe responsible slice. PlanetFlow is granted the rightto run ulogd in the root context to retrieve log in-formation from the VMM.

The need to grant narrowly-defined privileges to cer-tain management services has led us to define a mecha-nism called Proper (PRivileged OPERation) [17]. Properuses an ACL to specify the particular operations that canbe performed by a VM that hosts a management service,possibly including argument constraints for each opera-tion. For example, the CoStat monitoring service gath-ers various statistics by reading /proc files in the rootcontext, so Proper constrains the set of files that can beopened by CoStat to only the necessary directories. Foroperations that affect other slices directly, such as mount-ing the slice’s file system or executing a process in thatslice, Proper also allows the target slice to place addi-tional constraints on the operations that can be performede.g., only a particular directory may be mounted by Stork.In this way we are able to operate each management ser-vice with a small set of additional privileges above a nor-mal slice, rather than giving out coarse-grained capabili-ties such as those provided by the standard Linux kernel,or co-locating the service in the root context.

Finally, Table 1 quantifies the impact of moving func-tionality out of the NM in terms of lines-of-code. TheLOC data is generated using David A. Wheeler’s ’SLOC-Count’. Note that we show separate data for Proper andthe rest of the node manager; Proper’s size is in part afunction of its implementation in C.One could argue that these numbers are conservative,

as there are additional services that this list of manage-ment services employ. For example, CoBlitz is a largefile transfer mechanism that is used by Stork and Emulabto disseminate large files across a set of nodes. Similarly,a number of these services provide a web interface thatmust run on each node, which would greatly increase thesize of the TCB if the web server itself had to be includedin the root context.

4 Resource AllocationOne of the most significant challenges for PlanetLab hasbeen to maximize the platform’s utility for a large usercommunity while dealing with the reality of limited re-sources. This challenge has led us to a model of weakresource isolation between slices. We implement thismodel through fair sharing of CPU and network band-width, simple mechanisms to avoid the worst kinds of in-terference on other resources like memory and disk, andtools to give users information about resource availabilityon specific nodes. This section reports our experienceswith this model in practice, and describes some of thetechniques we’ve adopted to make the system as effec-tive and stable as possible.

4.1 WorkloadPlanetLab supports a workload mixing one-off experi-ments with long-running services. A complete character-ization of this workload is beyond the scope of this paper,but we highlight some important aspects below.

CoMon—one of the performance-monitoring servicesrunning on PlanetLab—classifies a slice as active on anode if it contains a process, and live if, in the last fiveminutes, it used at least 0.1% (300ms) of the CPU. Fig-ure 2 shows, by quartile, the number of active and liveslices across PlanetLab during the past year. Each graphshows five lines; 25% of PlanetLab nodes have valuesthat fall between the first and second lines, 25% betweenthe second and third, and so on.

Looking at each graph in more detail, Figure 2(a) illus-trates that the number of active slices on most PlanetLabnodes has grown steadily. The median active slice counthas increased from 40 slices in March 2005 to the mid-50s in April 2006, and the maximum number of activeslices has increased from 60 to 90. PlanetLab nodes cansupport large numbers of mostly idle slices because eachVM is very lightweight. Additionally, the data shows that75% of PlanetLab nodes have consistently had at least 40

OSDI ’06: 7th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association 357

0

20

40

60

80

100

05/May 05/Jul 05/Sep 05/Nov 06/Jan 06/Mar

Slic

esw

itha

proc

ess

Min 1st Q Median 3rd Q Max

(a) Active slices, by quartile

0

5

10

15

20

25

30

05/May 05/Jul 05/Sep 05/Nov 06/Jan 06/Mar

Slic

esus

ing

>0.1

%C

PU

Min 1st Q Median 3rd Q Max

(b) Live slices, by quartile

Figure 2: Active and live slices on PlanetLab

active slices during the past year.Figure 2(b) shows the distribution of live slices. Note

that at least 50% of PlanetLab nodes consistently have alive slice count within two of the median. Additional dataindicates that this is a result of three factors. First, somemonitoring slices (like CoMon and PlanetFlow) are liveeverywhere, and so create a lower bound on the numberof live slices. Second, most researchers do not appearto greedily use more nodes than they need; for example,only 10% of slices are deployed on all nodes, and 60% aredeployed on less than 50 nodes. We presume researchersare self-organizing their services and experiments ontodisjoint sets of nodes so as to distribute load, althoughthere are a small number of popular nodes that supportover 25 live slices. Third, the slices that are deployed onall nodes are not live on all of them at once. For instance,in April 2006 we observed that CoDeeN was active on436 nodes but live on only 269. Robust (and adaptive)long-running services are architected to dynamically bal-ance load to less utilized nodes [33, 26].

Of course we did not know what PlanetLab’s work-load would look like when we made many early designdecisions. As reported in Section 2.2, one such decisionwas to use Linux+Vservers as the VMM, primarily be-cause of the maturity of the technology. Since this time,alternatives like Xen have advanced considerably, but wehave not felt compelled to reconsider this decision. A keyreason is that PlanetLab nodes run up to 25 live VMs,

and up to 90 active VMs, at at time. This is possiblebecause we could build a system that supports resourceoverbooking and graceful degradation on a framework ofVserver-based VMs. In contrast, Xen allocates specificamounts of resources, such as physical memory and disk,to each VM. For example, on a typical PlanetLab nodewith 1GB memory, Xen can support only 10 VMs with100MB memory each, or 16 with 64MB memory. There-fore, it’s not clear how a PlanetLab based on Xen couldsupport our current user base. Note that the managementarchitecture presented in the previous section is generalenough to support multiple VM types (and a Xen proto-type is running in the lab), but resource constraints makeit likely that most PlanetLab slices will continue to useVservers for the foreseeable future.

4.2 Graceful Degradation

PlanetLab’s usage model is to allow as many users on anode as want to use it, enable resource brokers that areable to secure guaranteed resources, and gracefully de-grade the node’s performance as resources become overutilized. This section describes the mechanisms that sup-port such behavior and evaluates how well they work.

4.2.1 CPU

The earliest version of PlanetLab used the standard LinuxCPU scheduler, which provided no CPU isolation be-tween slices: a slice with 400 Java threads would get 400times the CPU of a slice with one thread. This situationoccasionally led to collapse of the system and revealedthe need for a slice-aware CPU scheduler.Fair share scheduling [32] does not collapse under

load, but rather supports graceful degradation by givingeach scheduling container proportionally fewer cycles.Since mid-2004, PlanetLab’s CPU scheduler has per-formed fair sharing among slices. During that time, how-ever, PlanetLab has run three distinct CPU schedulers: v2used the SILK scheduler [3], v3.0 introduced CKRM (acommunity project in its early stages), and v3.2 (the cur-rent version) uses a modification of Vserver’s CPU ratelimiter to implement fair sharing and reservations. Thequestion arises, why so many CPU schedulers?The answer is that, for the most part, we switched CPU

schedulers for reasons other than scheduling behavior.We switched from SILK to CKRM to leverage a com-munity effort and reduce our code maintenance burden.However, at the time we adopted it, CKRM was far fromproduction quality and the stability of PlanetLab sufferedas a result. We then dropped CKRM and wrote anotherCPU scheduler, this time based on small modifications tothe Vservers code that we had already incorporated intothe PlanetLab kernel. This CPU scheduler gave us the ca-pability to provide slices with CPU reservations as wellas shares, which we lacked with SILK and CKRM. Per-

OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation USENIX Association358

0

20

40

60

80

100

05/May 05/Jul 05/Sep 05/Nov 06/Jan 06/Mar

Ava

ilabl

eC

PU

%

Min 1st Q Median 3rd Q Max

Figure 3: CPU % Available on PlanetLab

haps more importantly, the scheduler was more robust, soPlanetLab’s stability dramatically improved, as shown inSection 5. We are solving the code maintenance problemby working with the Vservers developers to incorporateour modifications into their main distribution.

The current (v3.2) CPU scheduler implements fairsharing and work-conserving CPU reservations by over-laying a token bucket filter on top of the standard LinuxCPU scheduler. Each slice has a token bucket that accu-mulates tokens at a specified rate; every millisecond, theslice that owns the running process is charged one token.A slice that runs out of tokens has its processes removedfrom the runqueue until its bucket accumulates a mini-mum amount of tokens. This filter was already presentin Vservers, which used it to put an upper bound on theamount of CPU that any one VM could receive; we sim-ply modified it to provide a richer set of behaviors.

The rate that tokens accumulate depends on whetherthe slice has a reservation or a share. A slice with a reser-vation accumulates tokens at its reserved rate: for exam-ple, a slice with a 10% reservation gets 100 tokens persecond, since a token entitles it to run a process for onemillisecond. The default share is actually a small reser-vation, providing the slice with 32 tokens every second,or 3% of the total capacity.

The main difference between reservations and sharesoccurs when there are runnable processes but no slice hasenough tokens to run: in this case, slices with shares aregiven priority over slices with reservations. First, if thereis a runnable slice with shares, tokens are given out fairlyto all slices with shares (i.e., in proportion to the numberof shares each slice has) until one can run. If there areno runnable slices with shares, then tokens are given outfairly to slices with reservations. The end result is that theCPU capacity is effectively partitioned between the twoclasses of slices: slices with reservations get what they’vereserved, and slices with shares split the unreserved ca-pacity of the machine proportionally.

CoMon indicates that the average PlanetLab node hasits CPU usage pegged at 100% all the time. However,fair sharing means that an individual slice can still ob-

0

20

40

60

80

100

0 500 1000 1500 2000

Per

cent

ofsl

ices

MB used when reset

Figure 4: CDF of memory consumed when slice reset

tain a significant percentage of the CPU. Figure 3 shows,by quartile, the CPU availability across PlanetLab, ob-tained by periodically running a spinloop in the CoMonslice and observing how much CPU it receives. The datashows large amounts of CPU available on PlanetLab: atleast 10% of the CPU is available on 75% of nodes, atleast 20% CPU on 50% of nodes, and at least 40% CPUon 25% of nodes.4.2.2 MemoryMemory is a particularly scarce resource on PlanetLab,and we were faced with with chosing between four de-signs. One is the default Linux behavior, which eitherkernel panics or randomly kills a process when memorybecomes scarce. This clearly does not result in grace-ful degradation. A second is to statically allocate a fixedamount of memory to each slice. Given that there are upto 90 active VMs on a node, this would imply an imprac-tically small 10MB allocation for each VM on the typicalnode with 1GB of memory. A third option is to explic-itly allocate memory to live VMs, and reclaim memoryfrom inactive VMs. This implies the need for a controlmechanism, but globally synchronizing such a mecha-nism across PlanetLab (i.e., to suspend a slice) is prob-lematic at fine-grained time scales. The fourth option isto dynamically allocate memory to VMs on demand, andreact in a more predictable way when memory is scarce.We elected the fourth option, implementing a simple

watchdog daemon, called pl mom, that resets the sliceconsuming the most physical memory when swap has al-most filled. This penalizes the memory hog while keep-ing the system running for everyone else.Although pl mom was noticably active when first

deployed—as users learned to not keep log files in mem-ory and to avoid default heap sizes—it now typically re-sets an average of 3-4 VMs per day, with higher rates dur-ing heavy usage (e.g., major conference deadlines). Forexample, 200 VMs were reset during the two week run-up to the OSDI deadline. We note, however, that roughlyone-third of these resets were on under-provisioned nodes(i.e., nodes with less than 1GB of memory).

Figure 4 shows the cumulative distribution function of

OSDI ’06: 7th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association 359

0

20

40

60

80

100

05/Dec 06/Jan 06/Feb 06/Mar 06/Apr

Ava

ilabl

eM

BM

emor

y

Min 1st Q Median 3rd Q Max

Figure 5: Memory availability on PlanetLab

how much physical memory individual VMs were con-suming when they were reset between November 2004and April 2006. We note that about 10% of the resets(corresponding largely to the first 10% of the distribution)occurred on nodes with less than 1GB memory, wherememory pressure was tighter. Over 80% of all resets hadbeen allocated at least 128MB. Half of all resets occurredwhen the slice was using more than 400MB of memory,which on a shared platform like PlanetLab indicates ei-ther a memory leak or poor experiment design (e.g., alarge in-memory logfile).

Figure 5 shows CoMon’s estimate of how many MB ofmemory are available on each PlanetLab node. CoMonestimates available memory by allocating 100MB, touch-ing random pages periodically, and then observing thesize of the in-memory working set over time. This servesas a gauge of memory pressure, since if physical memoryis exhausted and another slice allocates memory, thesepages would be swapped out. The CoMon data showsthat a slice can keep a 100MB working set in memoryon at least 75% of the nodes (since only the minimumand first quartile line are really visible), so it appears thatthere is not as much memory pressure on PlanetLab as weexpected. This also reinforces our intuition that pl momresets slices mainly on nodes with too little memory orwhen the slice’s application has a memory leak.4.2.3 BandwidthHosting sites can cap the maximum rate at which thelocal PlanetLab nodes can send data. PlanetLab fairlyshares the bandwidth under the cap among slices, usingLinux’s Hierarchical Token Bucket traffic filter [15]. Thenode bandwidth cap allows sites to limit the peak rate atwhich nodes send data so that PlanetLab slices cannotcompletely saturate the site’s outgoing links.

The sustained rate of each slice is limited by thepl mom watchdog daemon. The daemon allows eachslice to send a quota of bytes each day at the node’s caprate, and if the slice exceeds its quota, it imposes a muchsmaller cap for the rest of the day. For example, if theslice’s quota is 16GB/day, then this corresponds to a sus-tained rate of 1.5Mbps; once the slice sends more than

100

1000

10000

100000

06/Jan/21 06/Feb/04 06/Feb/18 06/Mar/04 06/Mar/18 06/Apr/01 06/Apr/15

Tx

Ban

dwid

th(K

b/s)

1st Q Median 3rd Q Max

(a) Transmit bandwidth in Kb/s, by quartile

100

1000

10000

100000

06/Jan/21 06/Feb/04 06/Feb/18 06/Mar/04 06/Mar/18 06/Apr/01 06/Apr/15

Rx

Ban

dwid

th(K

b/s)

1st Q Median 3rd Q Max

(b) Receive bandwidth in Kb/s, by quartile

Figure 6: Sustained network rates on PlanetLab

16GB, it is capped at 1.5Mbps until midnight GMT. Thegoal is to allow most slices to burst data at the node’s caprate, but prevents slices that are sending large amounts ofdata from badly abusing the site’s local resources.There are two weaknesses of PlanetLab’s bandwidth

capping approach. First, some sites pay for bandwidthbased on the total amount of traffic they generate permonth, and so they need to control the node’s sustainedbandwidth rather than the peak. As mentioned, pl momlimits sustained bandwidth, but it operates on a per-slice(rather than per-node) basis and cannot currently be con-trolled by the sites. Second, PlanetLab does not currentlycap incoming bandwidth. Therefore, PlanetLab nodescan still saturate a bottleneck link by downloading largeamounts of data. We are currently investigating ways tofix both of these limitations.

Figure 6 shows, by quartile, the sustained rates atwhich traffic is sent and received on PlanetLab nodessince January 2006. These are calculated as the sums ofthe average transmit and receive rates for all the slicesof the machine over the last 15 minutes. Note that they axis is logarithmic, and the Minimum line is omit-ted from the graph. The typical PlanetLab node trans-mits about 1Mb/s and receives 500Kb/s, correspondingto about 10.8GB/day sent and 5.4GB/day received. Thesenumbers are well below the typical node bandwidth capof 10Mb/s. On the other hand, some PlanetLab nodes doactually have sustained rates of 10Mb/s both ways.

OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation USENIX Association360

0

20

40

60

80

100

05/May 05/Jul 05/Sep 05/Nov 06/Jan 06/Mar

Dis

kF

ull%

Min 1st Q Median 3rd Q Max

Figure 7: Disk usage, by quartile, on PlanetLab

4.2.4 Disk

PlanetLab nodes do not provide permanent storage: datais not backed up, and any node may be reinstalled withoutwarning. Services adapt to this environment by treatingdisk storage as a cache and storing permanent data else-where, or else replicating data on multiple nodes. Still,a PlanetLab node that runs out of disk space is essen-tially useless. In our experience, disk space is usually ex-hausted by runaway log files written by poorly-designedexperiments. This problem was mitigated, but not en-tirely solved, by the introduction of per-slice disk quotasin June 2005. The default quota is 5GB, with larger quo-tas granted on a case-by-case basis.Figure 7 shows, by quartile, the disk utilization on

PlanetLab. The visible dip shortly after May 2005 iswhen quotas were introduced. We note that, though diskutilization grows steadily over time, 75% of Planetlabnodes still have at least 50% of the disks free. SomePlanetLab nodes do occasionally experience full disks,but most are old nodes that do not meet the current sys-tem requirements.

4.2.5 Jitter

CPU scheduling latency can be a serious problem forsome PlanetLab applications. For example, in a packetforwarding overlay, the time between when a packet ar-rives and when the packet forwarding process runs willappear as added network latency to the overlay clients.Likewise, many network measurement applications as-sume low scheduling latency in order to produce pre-cisely spaced packet trains. Many measurement applica-tions can cope with latency by knowing which samples totrust and which must be discarded, as described in [29].Scheduling latency is more problematic for routing over-lays, which may have to drop incoming packets.A simple experiment indicates how scheduling latency

can affect applications on PlanetLab. We deploy a packetforwarding overlay, constructed using the Click modularsoftware router [13], on six Planetlab nodes co-locatedat Abilene PoPs between Washington, D.C. and Seattle.Our experiment then uses ping packets to compare the

0

20

40

60

80

100

75 80 85 90 95 100

Per

cent

Millisecond RTT from ping

NetworkOverlay w/SCHED_RR

Overlay

Figure 8: RTT CDF on network, overlay with and without SCHED RR

RTT between the Seattle and D.C. nodes on the networkand on the six-hop overlay. Each of the six PlanetLabnodes running our overlay had load averages between 2and 5, and between 5 and 8 live slices, during the ex-periment. We observe that the network RTT between thetwo nodes is a constant 74ms over 1000 pings, while theoverlay RTT varies between 76ms and 135ms. Figure 8shows the CDF of RTTs for the network (leftmost curve)and the overlay (rightmost curve). The overlay CDF hasa long tail that is chopped off at 100ms in the graph.There are several reasons why the overlay could have

its CPU scheduling latency increased, including: (1) ifanother task is running when a packet arrives, Click mustwait to forward the packet until the running task blocksor exhausts its timeslice; (2) if Click is trying to use morethan its “fair share”, or exceeds its CPU guarantee, thenits token bucket CPU filter will run out of tokens andit will be removed from the runqueue until it acquiresenough tokens to run; (3) even though the Click processmay be runnable, the Linux CPU scheduler may still de-cide to schedule a different process; and (4) interruptsand other kernel activity may preempt the Click processor otherwise prevent it from running.We can attack the first three sources of latency using

existing scheduling mechanisms in PlanetLab. First, wegive the overlay slice a CPU reservation to ensure thatit will never run out of tokens during our experiment.Second, we use chrt to run the Click process on eachmachine with the SCHED RR scheduling policy, so thatit will immediately jump to the head of the runqueueand preempt any running task. The Proper service de-scribed in Section 3.5 enables our slice to run the privi-leged chrt command on each PlanetLab node.

The middle curve in Figure 8 shows the results of re-running our experiment with these new CPU schedulingparameters. The overhead of the Click overlay, around3ms, is clearly visible as the difference between the twoleft-most curves. In the new experiment, about 98% ofoverlay RTTs are within 3ms of the underlying networkRTT, and 99% are within 6ms. These CPU schedulingmechanisms are employed by PL-VINI, the VINI (VIr-

OSDI ’06: 7th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association 361

tual Network Infrastructure) prototype implemented onPlanetLab, to reduce latency in an overlay network as anartifact of CPU scheduling delay [4].We note two things. First, the obstacle to making

this solution available on PlanetLab is primarily one ofpolicy—choosing which slices should get CPU reserva-tions and bumps to the head of the runqueue, since itis not possible to reduce everyone’s latency on a heav-ily loaded system. We plan to offer this option to short-term experiments via the Sirius brokerage service, butlong-running routing overlays will need to be handled ona case-by-case basis. Second, while our approach canprovide low latency to the Click forwarder in our exper-iment 99% of the time, it does not completely solve thelatency problem. We hypothesize that the remaining CPUscheduling jitter is due to the fourth source of latencyidentified earlier, i.e., kernel activity. If so, we may beable to further reduce it by enabling kernel preemption, afeature already available in the Linux 2.6 kernel.4.2.6 RemarksNote that only limited conclusions can be drawn from thefact that there is unused capacity available on PlanetLabnodes. Users are adapting to the behavior of the system(including electing to not use it) and they are writing ser-vices that adapt to the available resources. It is impossibleto know how many resources would have been used, evenby the same workload, had more been available. How-ever, the data does document that PlanetLab’s fair shareapproach is behaving as expected.

5 Operational StabilityThe need to maintain a stable system, while at the sametime evolving it based on user experience, has been a ma-jor complication in designing PlanetLab. This sectionoutlines the general strategies we adopted, and presentsdata that documents our successes and failures.5.1 StrategiesThere is no easy way to continually evolve a system thatis experiencing heavy use. Upgrades are potentially dis-ruptive for at least two reasons: (1) new features intro-duce new bugs, and (2) interface changes force users toupgrade their applications. To deal with this situation, weadopted three general strategies.First, we kept PlanetLab’s control plane (i.e., the ser-

vices outlined in Section 3) orthogonal from the OS. Thismeant that nearly all of the interface changes to the sys-tem affected only those slices running management ser-vices; the vast majority of users were able to program toa relatively stable Linux API. In retrospect this is an ob-vious design principle, but when the project began, webelieved our main task was to define a new OS interfacetailored for wide-area services. In fact, the one examplewhere we deviated from this principle—by changing the

socket API to support safe raw sockets [3]—proved to bean operational diaster because the PlanetLab OS lookedenough like Linux that any small deviation caused dis-proportionate confusion.Second, we leveraged existing software whereever

possible. This was for three reasons: (1) to improve thestability of the system; (2) to lower the barrier-to-entryfor the user community; and (3) to reduce the amount ofnew code we had to implement and maintain. This lastpoint cannot be stressed enough. Even modest changesto existing software packages have to be tracked as thosepackages are updated over time. In our eagerness to reuserather than invent, we made some mistakes, the most no-table of which is documented in the next subsection.Third, we adopted a well-established practice of rolling

new releases out incrementally. This is for the obviousreason—to build confidence that the new release actuallyworked under realistic loads before updating all nodes—but also for a reason that’s somewhat unique to Planet-Lab: some long-running services maintain persistent datarepositories, but doing so depends on a minimal numberof copies being available at any time. Updates that rebootnodes must happen incrementally if long-running storageservices are to survive.Note that while few would argue with these

principles—and it is undoubtedly the case that we wouldhave struggled had we not adhered to them—our experi-ence is that many other factors (some unexpected) had asignificant impact on the stability of the system. The restof this section reports on these operational experiences.

5.2 Node StabilityWe now chronicle our experience operating and evolvingPlanetLab. Figure 9 illustrates the availability of Plan-etLab nodes from September 2004 through April 2006,as inferred from CoMon. The bottom line indicates thePlanetLab nodes that have been up continuously for thelast 30 days (stable nodes), the middle line is the countof nodes that came online within the last 30 days, andthe top line is all registered PlanetLab nodes. Note thatthe difference between the bottom and middle lines repre-sents the “churn” of PlanetLab over a month’s time; andthe difference between the middle and top lines indciatesthe number of nodes that are offline. The vertical lines inFigure 9 are important dates, and the letters at the top ofthe graph let us refer to the intervals between the dates.There have clearly been problems providing the com-

munity with a stable system. Figure 9 illustrates severalreasons for this:

• Sometimes instability stems from the communitystressing the system in new ways. In Figure 9, inter-val A is the run-up to the NSDI’05 deadline. Duringthis time, heavy use combined with memory leaks in

OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation USENIX Association362

GFEDCBA

A: Runup to NSDI ’05 deadlineB: After NSDI ’05 deadlineC: 3.0 rollout beginsD: 3.0 stable release

E: 3.1 stable releaseF: 3.2 rollout beginsG: 3.2 stable release

0

100

200

300

400

500

600

700

04/Sep 04/Nov 05/Jan 05/Mar 05/May 05/Jul 05/Sep 05/Nov 06/Jan 06/Mar

Nod

eco

unt

Stable nodes (up > 30 days)Active in last 30 days

Registered nodes

Figure 9: Node Availability

some experiments caused kernel panics due to Out-of-Memory errors. This is the common behaviorof Linux when the system runs out of memory andswap space. pl mom (Section 4.2.2) was introducedin response to this experience.

• Software upgrades that require a reboot obviouslyeffect the set of stable nodes (e.g., intervals C andD), but installing buggy software has a longer-termaffect on stability. Interval C shows the release ofPlanetLab 3.0. Although the release had undergoneextensive off-line testing, it had bugs and a relativelylong period of instability followed. PlanetLab wasstill usable during this period but nodes rebooted atleast once per month.

• The pl mom watchdog is not perfect. There is aslight dip in the number of stable core nodes (bottomline) in interval D, when about 50 nodes were re-booted because of a slice with a fast memory leak; asmemory pressure was already high on those nodes,pl mom could not detect and reset the slice beforethe nodes ran out of memory.

We note, however, that the 3.2 software release from lateDecember 2005 is the best so far in terms of stability:as of February 2006, about two-thirds of active Planet-Lab nodes have been up for at least a month. We at-tribute most of this to abandoning CKRM in favor of theVservers native resource management framework and anew CPU scheduler.

One surprising fact to emerge from Figure 9 is that alot of PlanetLab nodes are dead (denoted by the differ-ence between the top and middle lines). Research orga-nizations gain access to PlanetLab simply by hooking uptwo machines at their local site. This formula for growth

has worked quite well: a low barrier-to-entry providedthe right incentive to grow PlanetLab. However, therehave never been well-defined incentives for sites to keepnodes online. Providing such incentives is obviously theright thing to do, but we note that that the majority ofthe off-line nodes are at sites that no longer have activeslices—and at the time of this writing only 12 sites hadslices but no functioning nodes—so it’s not clear whatincentive will work.

Now that we have reached a fairly stable system, it be-comes interesting to study the “churn” of nodes that areactive yet are not included in the stable core. We find ituseful to differentiate between three categories of nodes:those that came up that day (and stayed up), those thatwent down that day (and stayed down), and those thathave rebooted at least once. Our experience is that, ona typical day, about 2 nodes come up, about 2 nodes godown, and about 4 nodes reboot. On 10% of days, at least6 nodes come up or go down, and at least 8 nodes reboot.

Looking at the archives of the support andplanetlab-users mailing lists, we are able to iden-tify the most common reasons nodes come up or go down:(1) a site takes its nodes offline to move them or changetheir network configuration, (2) a site takes its nodesoffine in response to a security incident, (3) a site acci-dently changes a network configuration that renders itsnodes unreachable, or (4) a node goes offline due to ahardware failure. The last is the most common reasonfor nodes being down for an extended period of time; thethird reason is the most frustrating aspect of operating asystem that embeds its nodes in over 300 independent ITorganizations.

Understanding the relative frequency of different sortsof site events may be important for designers of otherlarge-scale distributed systems; this is a topic for furtherstudy.

5.3 Security Complaints

Of the operational issues that PlanetLab faces, respond-ing to security complaints is perhaps the most interesting,if only because of what they say about the current state ofthe Internet. We comment on three particular types ofcomplaints.

The most common complaints are the result of IDSalerts. One frequent scenario corresponds to a perceivedDoS attack. These are sometimes triggered by a poorlydesigned experiment (in which case the responsible re-searchers are notified and expected to take corrective ac-tion), but they are more likely to be triggered by totallyinnocent behavior (e.g., 3 unsolicited UDP packets havetriggered the threat of legal action). In other cases, thealerts are triggered by simplistic signitures for malwarethat could not be running on our Linux-based environ-ment. In general, we observe that any traffic that devi-

OSDI ’06: 7th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association 363

ates from a rather narrow range of acceptable behavior isincreasingly viewed as suspect, which makes innovatingwith new types of network services a challenge.

An increasingly common type of complaint comesfrom home users monitoring their firewall logs. They seeconnections to PlanetLab nodes that they do not recog-nize, assume PlanetLab has installed spyware on theirmachines, and demand that it be removed. In reality,they have unknowingly used a service (e.g., a CDN) thathas imposed itself between them and a server. Receivingpackets from a location service that also probes the clientto select the most appropriate PlanetLab node to servicea request only exacerbates the situation [35, 9]. The take-away is that even individual users are becoming increas-ingly security-senstive (if less security-sophisticated thantheir professional counterparts), which makes the task ofdeploying alternative services increasingly problematic.Finally, PlanetLab nodes are sometimes identified as

the source or sink of illegal content. In reality, the con-tent is only cached on the node by a slice running a CDNservice, but an overlay node looks like an end node tothe rest of the Internet. PlanetLab staff use PlanetFlowto identify the responsible slice, which in turn typicallymaintains a log that can be used to identify the ultimatesource or destination. This information is passed alongto the authorities, when appropriate. While many hostingsites are justifiably gun-shy about such complaints, themain lesson we have learned is that trying to police con-tent is not a viable solution. The appropriate approachis to be responsive and cooperative when complaints areraised.

6 DiscussionPerhaps the most fundamental issue in PlanetLab’s de-sign is how to manage trust in the face of pressure to de-centralize the system, where decentralization is motivatedby the desire to (1) give owners autonomous control overtheir nodes and (2) give third-party service developers theflexibility they need to innovate.

At one end of the spectrum, individual organizationscould establish bilateral agreements with those organiza-tions that they trust, and with which they are willing topeer. The problem with such an approach is that reachingthe critical mass needed to foster a large-scale deploy-ment has always proved difficult. PlanetLab started at theother end of the spectrum by centralizing trust in a sin-gle intermediary—PLC—and it is our contention that do-ing so was necessary to getting PlanetLab off the ground.To compensate, the plan was to decentralize the systemthrough two other means: (1) users would delegate theright to manage aspects of their slices to third-party ser-vices, and (2) owners would make resource allocation de-cisions for their nodes. This approach has had mixed suc-cess, but it is important to ask if these limitations are fun-

damental or simply a matter of execution.With respect to owner autonomy, all sites are allowed

to set bandwidth caps on their nodes, and sites that havecontributed more than the two-node minimum requiredto join PlanetLab are allowed to give excess resourcesto favored slices, including brokerage services that redis-tribute those resources to others. In theory, sites are alsoallowed to blacklist slices they do not want running lo-cally (e.g., because they violate the local AUP), but wehave purposely not advertised this capability in an effortto “unionize” the research community: take all of our ex-periments, even the risky ones, or take none of them. (Asa compromise, some slices voluntarily do not run on cer-tain sites so as to not force the issue.) The interface bywhich owners express their wishes is clunky (and some-times involves assistance from the PlanetLab staff), butthere does not seem to be any architectural reason whythis approach cannot provide whatever control over re-source allocation that owners require (modulo meetingrequirements for joining PlanetLab in the first place).With respect to third-party management services, suc-

cess has been much more mixed. There have been somesuccesses—Stork, Sirius, and CoMon being the most no-table examples—but this issue is a contentious one in thePlanetLab developer community. There are many pos-sible explanations, including there being few incentivesand many costs to providing 24/7 management services;users preferring to roll their own management utilitiesrather than learn a third-party service that doesn’t exactlysatisfy their needs; and the API being too much of a mov-ing target to support third-party development efforts.While these possibilities provide interesting fodder for

debate, there is a fundamental issue of whether the cen-tralized trust model impacts the ability to deploy third-party management services. For those services that re-quire privileged access on a node (see Section 3.5) the an-swer is yes—the PLC support staff must configure Properto grant the necessary privilege(s). While in practice suchprivileges have been granted in all cases that have not vio-lated PlanetLab’s underlying trust assumptions or jeopar-dized the stability of the operational system, this is clearlya limitation of the architecture.Note that choice is not just limited to what manage-

ment services the central authority approves, but alsoto what capabilities are included in the core system—e.g., whether each node runs Linux, Windows, or Xen.Clearly, a truly scalable system cannot depend on a sin-gle trusted entity making these decisions. This is, in fact,the motivation for evolving PlanetLab to the point that itcan support federation. To foster federation we have puttogether a software distribution, called MyPLC, that al-lows anyone to create their own private PlanetLab, andpotentially federate that PlanetLab with others (includingthe current ”public” PlanetLab).

OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation USENIX Association364

This returns us to the original issue of centralized ver-sus decentralized trust. The overriding lesson of Plan-etLab is that a centralized trust model was essential toachieving some level of critical mass—which in turn al-lowed us to learn enough about the design space to definea candidate minimal interface for federation—but that itis only by federating autonomous instances that the sys-tem will truly scale. Private PlanetLabs will still need bi-lateral peering agreements with each other, but there willalso be the option of individual PlanetLabs scaling inter-nally to non-trivial sizes. In other words, the combinationof bilateral agreements and trusted intermediaries allowsfor flexible aggregation of trust.

7 ConclusionsBuilding PlanetLab has been a unique experience. Ratherthan leveraging a new mechanism or algorithm, it has re-quired a synthesis of carefully selected ideas. Rather thanbeing based on a pre-conceived design and validated withcontrolled experiments, it has been shaped and proventhrough real-world usage. Rather than be designed tofunction within a single organization, it is a large-scaledistributed system that must be cognizant of its place ina multi-organization world. Finally, rather than having tosatisfy only quantifiable technical objectives, its successhas depended on providing various communities with theright incentives and being equally responsive to conflict-ing and difficult-to-measure requirements.

AcknowledgmentsMany people have contributed to PlanetLab. TimothyRoscoe, Tom Anderson, and Mic Bowman have providedsignificant input to the definition of its architecture. Sev-eral researchers have also contributed management ser-vices, including David Lowenthal, Vivek Pai and Ky-oungSoo Park, John Hartman and Justin Cappos, and JayLepreau and the Emulab team. Finally, the contributionsof the PlanetLab staff at Princeton—Aaron Klingaman,Mark Huang, Martin Makowiecki, Reid Moran, FaiyazAhmed, Brian Jones, and Scott Karlin—have been im-measurable.We also thank the anonymous referees, and our shep-

herd, Jim Waldo, for their comments and help in improv-ing this paper.This work was funded in part by NSF Grants CNS-

0520053, CNS-0454278, and CNS-0335214.

References[1] ANNAPUREDDY, S., FREEDMAN, M. J., AND MAZIERES, D. Shark: Scal-

ing File Servers via Cooperative Caching. In Proc. 2nd NSDI (Boston, MA,May 2005).

[2] AUYOUNG, A., CHUN, B., NG, C., PARKES, D., SHNEI-DMAN, J., SNOEREN, A., AND VAHDAT, A. Bellagio: AnEconomic-Based Resource Allocation System for PlanetLab.http://bellagio.ucsd.edu/about.php.

[3] BAVIER, A., BOWMAN, M., CULLER, D., CHUN, B., KARLIN, S., MUIR,S., PETERSON, L., ROSCOE, T., SPALINK, T., AND WAWRZONIAK, M.Operating System Support for Planetary-Scale Network Services. In Proc.1st NSDI (San Francisco, CA, Mar 2004).

[4] BAVIER, A., FEAMSTER, N., HUANG, M., PETERSON, L., AND REX-FORD, J. In VINI Veritas: Realistic and Controlled Network Experimenta-tion. In Proc. SIGCOMM 2006 (Pisa, Italy, Sep 2006).

[5] BRETT, P., KNAUERHASE, R., BOWMAN, M., ADAMS, R., NATARAJ,A., SEDAYAO, J., AND SPINDEL, M. A Shared Global Event Propaga-tion System to Enable Next Generation Distributed Services. In Proc. 1stWORLDS (San Francisco, CA, Dec 2004).

[6] CLARK, D. D. The Design Philosophy of the DARPA Internet Protocols.In Proc. SIGCOMM ’88 (Stanford, CA, Aug 1988), pp. 106–114.

[7] DAVID LOWENTHAL. Sirius: A Calendar Service for PlanetLab.http://snowball.cs.uga.edu/dkl/pslogin.php.

[8] FREEDMAN, M. J., FREUDENTHAL, E., AND MAZIERES, D. Democratiz-ing content publication with Coral. In Proc. 1st NSDI (San Francisco, CA,Mar 2004).

[9] FREEDMAN, M. J., LAKSHMINARAYANAN, K., AND MAZIERES, D. OA-SIS: Anycast for Any Service. In Proc. 3rd NSDI (San Jose, CA, May2006).

[10] FU, Y., CHASE, J., CHUN, B., SCHWAB, S., AND VAHDAT, A. SHARP:An Architecture for Secure Resource Peering. In Proc. 19th SOSP (LakeGeorge, NY, Oct 2003).

[11] HUANG, M., BAVIER, A., AND PETERSON, L. PlanetFlow: MaintainingAccountability for Network Services. ACM SIGOPS Operating SystemsReview 40, 1 (Jan 2006).

[12] JUSTON CAPPOS AND JOHN HARTMAN. Stork: A Soft-ware Packagement Management Service for PlanetLab.http://www.cs.arizona.edu/stork.

[13] KOHLER, E., MORRIS, R., CHEN, B., JANNOTTI, J., AND KAASHOEK,M. F. The Click Modular Router. ACM Transactions on Computer Systems18, 3 (Aug 2000), 263–297.

[14] LAI, K., RASMUSSON, L., ADAR, E., SORKIN, S., ZHANG, L., ANDHUBERMAN, B. A. Tycoon: An Implemention of a Distributed Market-Based Resource Allocation System. Tech. Rep. arXiv:cs.DC/0412038, HPLabs, Palo Alto, CA, USA, Dec. 2004.

[15] LINUX ADVANCED ROUTING AND TRAFFIC CONTROL.http://lartc.org/.

[16] LINUX VSERVERS PROJECT.http://linux-vserver.org/.

[17] MUIR, S., PETERSON, L., FIUCZYNSKI, M., CAPPOS, J., AND HART-MAN, J. Privileged Operations in the PlanetLab Virtualised Environment.SIGOPS Operating Systems Review 40, 1 (2006), 75–88.

[18] OPPENHEIMER, D., ALBRECH, J., PATTERSON, D., AND VAHDAT, A.Distributed Resource Discovery on PlanetLab with SWORD. In Proc. 1stWORLDS (San Francisco, CA, 2004).

[19] PARK, K., AND PAI, V. S. Scale and Performance in the CoBlitz Large-FileDistribution Service. In Proc. 3rd NSDI (San Jose, CA, May 2006).

[20] PARK, K., PAI, V. S., PETERSON, L. L., AND WANG, Z. CoDNS: Im-proving DNS Performance and Reliability via Cooperative Lookups. InProc. 6th OSDI (San Francisco, CA, Dec 2004), pp. 199–214.

[21] PETERSON, L., ANDERSON, T., CULLER, D., AND ROSCOE, T. ABlueprint for Introducing Disruptive Technology into the Internet. In Proc.HotNets–I (Princeton, NJ, Oct 2002).

[22] PETERSON, L., BAVIER, A., FIUCZYNSKI, M., MUIR, S., AND ROSCOE,T. PlanetLab Architecture: An Overview. Tech. Rep. PDN–06–031, Plan-etLab Consortium, Apr 2006.

[23] RAMASUBRAMANIAN, V., PETERSON, R., AND SIRER, E. G. Corona: AHigh Performance Publish-Subscribe System for the World Wide Web. InProc. 3rd NSDI (San Jose, CA, May 2006).

OSDI ’06: 7th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association 365

[24] RAMASUBRAMANIAN, V., AND SIRER, E. G. Beehive: O(1) Lookup Per-formance for Power-Law Query Distributions in Peer-to-Peer Overlays. InProc. 1st NSDI (San Francisco, CA, Mar 2004), pp. 99–112.

[25] RAMASUBRAMANIAN, V., AND SIRER, E. G. The Design and Imple-mentation of a Next Generation Name Service for the Internet. In Proc.SIGCOMM 2004 (Portland, OR, Aug 2004), pp. 331–342.

[26] RHEA, S., GODFREY, B., KARP, B., KUBIATOWICZ, J., RATNASAMY,S., SHENKER, S., STOICA, I., AND YU, H. OpenDHT: A Public DHTService and its Uses. In Proc. SIGCOMM 2005 (Philadelphia, PA, Aug2005), pp. 73–84.

[27] RITCHIE, D. M., AND THOMPSON, K. The UNIX Time-Sharing System.Communications of the ACM 17, 7 (Jul 1974), 365–375.

[28] RYAN HUEBSCH. PlanetLab Application Manager.http://appmanager.berkeley.intel-research.net/.

[29] SPRING, N., BAVIER, A., PETERSON, L., AND PAI, V. S. Using PlanetLabfor Network Research: Myths, Realities, and Best Practices. In Proc. 2ndWORLDS (San Francisco, CA, Dec 2005).

[30] SPRING, N., WETHERALL, D., AND ANDERSON, T. Scriptroute: A PublicInternet Measurement Facility. In Proc. 4th USITS (Seattle, WA,Mar 2003).

[31] VIVEK PAI AND KYOUNGSOO PARK. CoMon: A Monitoring Infrastruc-ture for PlanetLab. http://comon.cs.princeton.edu.

[32] WALDSPURGER, C. A., AND WEIHL, W. E. Lottery Scheduling: FlexibleProportional-Share Resource Management. In Proc. 1st OSDI (Monterey,CA, Nov 1994), pp. 1–11.

[33] WANG, L., PARK, K., PANG, R., PAI, V. S., AND PETERSON, L. L. Reli-ability and Security in the CoDeeN Content Distribution Network. In Proc.USENIX ’04 (Boston, MA, Jun 2004).

[34] WHITE, B., LEPREAU, J., STOLLER, L., RICCI, R., GURUPRASAD, S.,NEWBOLD, M., HIBLER, M., BARB, C., AND JOGLEKAR, A. An Inte-grated Experimental Environment for Distributed Systems and Networks.In Proc. 5th OSDI (Boston, MA, Dec 2002), pp. 255–270.

[35] WONG, B., SLIVKINS, A., AND SIRER, E. G. Meridian: A LightweightNetwork Location Service without Virtual Coordinates. InProc. SIGCOMM2005 (Philadelphia, PA, Aug 2005).

[36] ZHANG, M., ZHANG, C., PAI, V. S., PETERSON, L. L., AND WANG,R. Y. PlanetSeer: Internet Path Failure Monitoring and Characterization inWide-Area Services. In Proc. 6th OSDI (San Francisco, CA, Dec 2004),pp. 167–182.

OSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation USENIX Association366