Minutes of December 1, 2003 CD Operations Meeting

- 692 days without a lost time injury.
- Two walkthroughs.
- Six machine shop safety classes.

-- dCache delivered an average of 23 TB/day last week, peak of 33 TB on
-- Production farms re-processed 33 M events
   Time was lost to dFarm problem, startup of new data stream. Currently
   investigating an as yet unidentified limitation in throughput.
   Seeking help from experts of several sub-systems, as well as exploring
   possible changes in architecture.

   On Thursday, discovered a problem with calibrations that have prevented
   processing of new raw data using standard mechanisms. This should be
   fixed today. Also addressing the reason that the problem escaped

- Prechallenge event production.  Processing supposed to be completed
  by Feb. 2004.  Using onsite CPUs and GRID available offsite CPUs.  Running at
  95% efficiency the CPU bound jobs.  Disks are filling up at remote sites
  causing jobs to be lost.  OSCAR application has 300 MB output logs.
  Enocuntered a stage out problem causing lots of work by hand.  Want to
  deploy a storage element with storage resource management capabilities,
  not just ftp.
- Discussion of dCache server software deployment at Fermilab.
- In preparation of Pile-up processing have setup dCache on all worker nodes,
  to serve input for pileup.
- Worker nodes: a handful of problems encountered in burn-in. Roughly 7%
  failure rate so far. Also, company did not install proper network cables.
  Not happy at all with performance of this vendor.
- Discussion about interactions with Angstrom vendor.

Farm: 9.7 M events collected, 25.4 M processessed at FNAL, 20 M processed
offsite. p14 reprocessing 524.7 M events to process, 325.4 processed.
The network cables for the new farm and cab nodes are unacceptable.  Talking
to Angstrom

Tape usage
167 TB 9940b,
113.9 TB LTO,
209 TB 9940a

Analysis stations:
                 Data Analyzed  Events Analysed        Data transferred in
D0mino              18  TB          150 M                          -
CAB                 10  TB          325 M                          9 TB
Clued0              0.8 TB           13 M                         0.32 TB
D0Karlsruhe         3.5 TB           45 M                         0.2  TB

D0 also unhappy with Angstrom installation of new worker nodes.

CAB CPU usage shown.  Has been steadily climbing with time.
Data transfer into CAB shown.  data transfer peaked in June-July for
Lepton Photon 03. 90 TB/month.

- Pictures of the week: M81 spiral galaxy, supernova factory galaxy, blue
  galaxy, all in the same piece of the sky.
- Two ITNAs completed.
- Processed all of the imaging data up through Thanksgiving evening.

- Finished processing on the farm of the Minos calibration detector.
  Ongoing use will be processing the Far Detector data.
- Issue with dcache discussed.
- Possibility of using DCCP instead of ENCP for farms data.
- Problems with CVS repository on Minos1.  check-access script dying.
- Network problems on WH12 for Windows systems using DHCP.

- Networking
 - Maintenance this Thursday. Minute of downtime.
 - DHCP node registration project continuing to make progress.  Hope
   to cutover CD within next week and a half.
- Fiber
  - Upgrades in WH.
  - Minos near end equipment package.
- Storage
 - New microcode to fix the tape problem.
   Successfully read only 1 of 2 problem tapes.
   Took another dump and sent it back to STK.
- Security
 - Service Certificate expires on Thursday.  Replacement available on web page.
 - Couple of events with KDCs.  Failure of power controller.
 - Automation system paged computer security before support team knew about
   the automation system. Discussion of the automatic paging logic.

FInalizing the order for Microsoft software support. Cost has come down as
divs/secs/exps have sharpened their pencils to a fine point. Will be bugging
CD dep't hds about expensive packages -- Visio, Project, FrontPage.

Reminder: briefing for Angstrom post-mortem., Dec. 10.

1.      CSI

No major operational issues.

2.      DSG

the move of the cdfofline development boxes from fcdfora2 to fcdfora1 is
complete, except for reestablishing replication.  sam and datafile catalog
schemas are up and running.

the size of catalog stores of events to the d0ofprd/sam database increased
by 10 fold last week(20 million events in < 24 hours). found out 'test' or
'garbage' data was being stored into the production database, thus, slowing
the store process and filling the archive log area.  cd asks that d0 stop
storing test data into the production database using integration instead,
and alert cd if/when testing needs to be done.

3.      ESS

-Off-site Loan Activity:
 NEW Loans: Bill Finstrom reported, C2054 - Affolder - CMS - UCSB - all
 signed off and waiting for HACK to go to Shipping - Liz has paperwork.

(I'm including this as a reminder that Logistics Support is available to
assist with off-site loan and shipping needs.)

4.      SCS

* Slow connections to Enstor from GP farms is being investigated. Datacomm
 and Enstor admins are helping. It seems to be software as we have tested
 the hardware and it works ok to various other nodes. We are still

5.      CSG

There were about 100 FNAL Services certificates renewed last week.  This was
the first experience with a high volume of certificates to renew. The process
is not very robust, since each certificate has to be done one at a time.
There are a handful of certificates to renew this morning.  This week Rick
will again request that CDF and D0 clients specify experiment affiliation
in the comments field.

The 4 day holiday weekend 7 Remedy tickets were created.  Five from automation:
     KDC     I-KRB-3         11/28 08:08
     IA      FNCOPS1         11/29 15:42
     KDC     I-KRB-6         11/29 22:35
     CSI     FNALDBC         11/30 00:12
     SCS     CDFFarm1        11/30 01:22

Operator reported items (2):
     SCS     FNCDF86         11/28 17:09
     DCN     NT Domain Srvr  11/28 21:06

- Discussion of certificates.
- Property Audit: concerned about process of wiping the disks when
  excessing equipment, and about possiblity of sensitive data on
  equipment taken offsite.

- No report.

Planning and Customer Support
- Briefing this Wednesday 9 a.m. about Metrics.
- Budget meeting a week from tomorrow.  Dec. 9 at 1 p.m.
- Christmas lunch Dec. 19.

- Status meetings continue.
- Proposed article in Fermilab Today about our contribution to SC2003.

- Doing well with power and cooling in FCC so far.
- CDF UPS running at 95% currently, fcdfsgi2 halving will get some more power.
- Run 2 farms power look OK.
- Have begun the New Muon power upgrade for CDF and other systems.
- Wideband: going out to bid for demolition phase. Construction phase will
  follow.  Hard decisions needed.  Vicky wants a meeting with the experiments
  and the department heads this week about the extent of UPS in the building.
  Meeting this Thursday at 9 a.m.

Respectfully Submitted,
Robert Harris


> ---
> - Prechallenge event production.  Processing supposed to be completed
>   by Feb. 2004.  Using onsite CPUs and GRID available offsite CPUs.
>   Running at 95% efficiency the CPU bound jobs.  Disks are filling up at
>   remote sites causing jobs to be lost.  OSCAR application has 300 MB
>   output logs.

Just for the sake of a clear statement

 - OSCAR application emits enormous stdout and stderr logs (3-4 MB per
   event).  MOP lets these flow through GASS cache, causing large loads
   on the MOP master and increased disk usage.