Fermilab Computing Division

Physics and data science: reflections from a physicist in industry

Simple document list
(2 extra documents)

Full Title: Physics and data science: reflections from a physicist in industry
Date & Time: 30 Oct 2015 at 14:00
Event Location: WH1W
Event Topic(s): Computing Techniques Seminar
Event Moderator(s):
Event Info: Speaker:
Jim Pivarski

In the past few years, "data science" has become a catch-all phrase for everything from statistical analysis of business problems to dealing with large datasets. These are traditionally the domain of HEP, so data science has become a popular destination for physicists beyond academia. In this talk, I argue that wisdom can flow in the other direction, too: physicists can learn from industry. I'll start with a personal view of how physicists and data scientists work and how they use data mining models. Unsupervised techniques, predictive models, change detection, and modeling-as-compression are less widely used in HEP than they could be.

The majority of the talk will focus on software tools: the "big data" ecosystem that includes Hadoop, Spark, Storm, Akka, lambda architectures, and more. Some of these systems are for distributing calculations that require internal coupling: parts of the problem depend on others. Although most physics analyses are "embarrassingly parallel," I argue that alignment and calibration are expectation-minimization problems, ideally suited to iterative map-reduce in Spark. Other tools are for making real-time data monitoring more robust, which has an obvious application in DAQ.

Whereas most physics analyses are performed in C++ and ROOT, data scientists in industry use Java for almost all distributed systems and Python and R for laptop-analyses. I will discuss the relative merits these three classes of languages, and how some shortcomings can be mitigated.

I'll finish by talking about what I see as growing trends. Immutable data, monoids, and actors are programming constraints that make distributed calculations easier to reason about. Static typing, in particular a type-safe null, adds robustness to large-scale and long-running calculations. These techniques are becoming more popular in industry, and may be put to good use in physics as well.

Jim Pivarski did his graduate work at Cornell where he studied QCD in the Upsilon system with the CLEO detector. He then contributed to the commissioning of CMS at the LHC from 2006 to 2011 as a postdoc with Texas A&M University. In particular, he performed an alignment of the CMS muon system and related tracking and magnetic field studies. He later searched for exotic leptonic jets in early LHC data.

Almost five years ago, he joined Open Data Group, a small data analytics company. There, he works as a consultant for clients with big data problems. He has encountered datasets as varied as hyperspectral satellite photos, automobile traffic, network traffic, web trends, Twitter sentiment analyses, real-time geolocation, U.S. Census mining, and virtual machine performance. He also helped pure analysts transition their models to production environments, and invented a specification for platform-independent mining models called PFA. This specification is now being standardized by the Data Mining Group, a vendor-neutral consortium.

Remote connection details:

Vidyo (requires CERN account, lightweight accounts can be requested here: https://account.cern.ch/account/Externals/)

pin: 1010582680
link: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=b7SM4a9wTxrt

phone call link: http://information-technology.web.cern.ch/services/fe/howto/users-join-vidyo-meeting-phone

DocDB Home ]  [ Search ] [ Authors ] [ Events ] [ Topics ]

DocDB Version 8.8.9, contact Document Database Administrators
Execution time: 0 wallclock secs ( 0.16 usr + 0.02 sys = 0.18 CPU)