Fermilab Computing Division

CS Document 5643-v1

Physics and Data Science: Reflections from a physicist in industry

Document #:
CS-doc-5643-v1
Document type:
Presentation
Submitted by:
Oliver Gutsche
Updated by:
Oliver Gutsche
Document Created:
02 Nov 2015, 16:53
Contents Revised:
02 Nov 2015, 16:53
Metadata Revised:
02 Nov 2015, 16:53
Viewable by:
  • Public document
Modifiable by:

Quick Links:
Latest Version

Abstract:
In the past few years, "data science" has become a catch-all phrase for everything from statistical analysis of business problems to dealing with large datasets. These are traditionally the domain of HEP, so data science has become a popular destination for physicists beyond academia. In this talk, I argue that wisdom can flow in the other direction, too: physicists can learn from industry. I'll start with a personal view of how physicists and data scientists work and how they use data mining models. Unsupervised techniques, predictive models, change detection, and modeling-as-compression are less widely used in HEP than they could be.

The majority of the talk will focus on software tools: the "big data" ecosystem that includes Hadoop, Spark, Storm, Akka, lambda architectures, and more. Some of these systems are for distributing calculations that require internal coupling: parts of the problem depend on others. Although most physics analyses are "embarrassingly parallel," I argue that alignment and calibration are expectation-minimization problems, ideally suited to iterative map-reduce in Spark. Other tools are for making real-time data monitoring more robust, which has an obvious application in DAQ.

Whereas most physics analyses are performed in C++ and ROOT, data scientists in industry use Java for almost all distributed systems and Python and R for laptop-analyses. I will discuss the relative merits these three classes of languages, and how some shortcomings can be mitigated.

I'll finish by talking about what I see as growing trends. Immutable data, monoids, and actors are programming constraints that make distributed calculations easier to reason about. Static typing, in particular a type-safe null, adds robustness to large-scale and long-running calculations. These techniques are becoming more popular in industry, and may be put to good use in physics as well.

Files in Document:
Associated with Events:
Physics and data science: reflections from a physicist in industry held on 30 Oct 2015 in WH1W
DocDB Home ]  [ Search ] [ Authors ] [ Events ] [ Topics ]

DocDB Version 8.8.9, contact Document Database Administrators