Fermilab Computing Division

Lark brings distributed high throughput computing to the network

Simple document list
(1 extra document)

Full Title: Ark brings distributed high throughput computing to the network
Date & Time: 03 May 2013 at 14:00
Event Location: WH One West
Event Moderator(s):
Event Info: Speaker: Brian Bockelman, OSG Developer, University of Nebraska–Lincoln

Abstract: Distributed High Throughput Computing has a long history of finding resources for user jobs. This involves a delicate matchmaking process. A job will try to describe the resources it needs (number of cores, megabytes of RAM, gigabytes of disk), and systems such as HTCondor attempt to find a matching worker node.

On highly distributed platforms such as the OSG, we've found that networking resources are relevant to the matchmaking process: Does the job need an incoming network connection? Does it require access to a special network? How much bandwidth is necessary? How much bandwidth is available? On a university cluster, the networking between the scheduler and worker nodes may be relatively homogeneous – but, on the OSG, the bandwidth between a scheduler and worker node may differ by an order of magnitude.

The NSF-funded Lark project (award #1245864) aims to study the matchmaking language, policy, and technical mechanisms needed to make HTCondor aware of the network layer. We hope to enable HTCondor to 1) reactively make scheduling decisions based on perfSONAR network monitoring and 2) proactively reconfigure each batch slot's network based on the job description.

To manage the batch slot's network resource, we use a Linux feature called "network namespaces" to provide a per-batch-slot network device. This isolates each batch slot from the host network and other jobs, allowing us to provide job-specific configurations. We can further bridge the job's device onto the external network and give it an externally routable address (similar to how bridge networking works with the KVM hypervisor). If a job is addressable, a sufficiently intelligent network can treat it separately from other jobs. For example, some jobs will get access to a private network, while others stay on the public network.

To learn how we create per-job network devices and hook them into the network.

For more OSG Technology Area updates, with insights into life in Distributed High Throughput Computing, follow our blog.

No talks in agenda

Other documents for this event

CS-doc-# Title Author(s) Topic(s) Last Updated
5152-v1 Bringing High Throughput Computing to the Network with Lark Brian Bockelman Networks
28 Jun 2013

DocDB Home ]  [ Search ] [ Authors ] [ Events ] [ Topics ]

DocDB Version 8.8.9, contact Document Database Administrators
Execution time: 0 wallclock secs ( 0.16 usr + 0.03 sys = 0.19 CPU)