Lark brings distributed high throughput computing to the network
Simple document list
(1 extra document)
- Computing Techniques Seminars
- ArtG4: A Generic Geant4 Framework for ART and maybe Your Experiment
- Hosted services and application-aware protocols to accelerate data-intensive science
- Lark brings distributed high throughput computing to the network
- "Big Science On the Move: Transporting the Muon g-2 magnet from New York to Fermilab"
- A Fresh Perspective on Large-Scale Distributed Cyberinfrastracture
|Full Title:||Ark brings distributed high throughput computing to the network|
|Date & Time:||03 May 2013 at 14:00|
|Event Location:||WH One West|
|Event Info:||Speaker: Brian Bockelman, OSG Developer, University of NebraskaLincoln
Abstract: Distributed High Throughput Computing has a long history of finding resources for user jobs. This involves a delicate matchmaking process. A job will try to describe the resources it needs (number of cores, megabytes of RAM, gigabytes of disk), and systems such as HTCondor attempt to find a matching worker node.
On highly distributed platforms such as the OSG, we've found that networking resources are relevant to the matchmaking process: Does the job need an incoming network connection? Does it require access to a special network? How much bandwidth is necessary? How much bandwidth is available? On a university cluster, the networking between the scheduler and worker nodes may be relatively homogeneous but, on the OSG, the bandwidth between a scheduler and worker node may differ by an order of magnitude.
The NSF-funded Lark project (award #1245864) aims to study the matchmaking language, policy, and technical mechanisms needed to make HTCondor aware of the network layer. We hope to enable HTCondor to 1) reactively make scheduling decisions based on perfSONAR network monitoring and 2) proactively reconfigure each batch slot's network based on the job description.
To manage the batch slot's network resource, we use a Linux feature called "network namespaces" to provide a per-batch-slot network device. This isolates each batch slot from the host network and other jobs, allowing us to provide job-specific configurations. We can further bridge the job's device onto the external network and give it an externally routable address (similar to how bridge networking works with the KVM hypervisor). If a job is addressable, a sufficiently intelligent network can treat it separately from other jobs. For example, some jobs will get access to a private network, while others stay on the public network.
To learn how we create per-job network devices and hook them into the network.
For more OSG Technology Area updates, with insights into life in Distributed High Throughput Computing, follow our blog.