Filecules and Small Worlds in the DZero Workload: Characteristics and Relevance for Data Management

Most of today's science depends on the processing of massive amounts of data in multi-institutional and even international collaborations. Usage patterns are particularly relevant in designing and evaluating resource management solutions, yet it is rare that user workloads from production-mode wide-area science collaborations are publicly available. This lack of evidence in usage characteristics has three significant outcomes: (1) resource management solutions are evaluated on irrelevant traces; (2) quantitative comparison of alternative solutions to the same problem becomes impossible due to different experimental assumptions and synthetically generated workloads; and (3) solutions are designed in isolation, to fit the particular and possibly transitory needs of specific groups.

These concerns led us to analyze more than two years of workloads from DZero. In addition to contradicting previously accepted models, we discovered two novel data-usage patterns. First, a data-centric analysis reveals the existence of "filecules", groups of files that are always processed together. Second, a user-centric analysis
discovers small-world properties in data sharing that show emergent, interest-based grouping of users. We show that exploiting these patterns for designing resource management solutions leads to better scalability, lower costs, and
increased adaptability to changing environments.

