Minutes of April 28, 2003 CD Operations Meeting

475 Days w/o Lost-time Injury

Div. Head Walkthroughs -  1
Dept. Head Walkthroughs - 0

ESH Training

ITNAs -  3 people in the division without ITNAs.

Sexual Harrassment for Managers and Supervisors - May 6 and July 16

GERT -  reminder has been sent to those past due.

- Data Handling
  - Rob Kennedy completed a 1st pass of load tests of dCache with mixed results.
    When no staging from tape to disk is necessary, dCache handles 20 TB/day
    without problems.  When intensive staging from tape to disk was required
    the system could not handle 20 TB/day due to a bug in dCache which caused
    multiple pools to be blocked and the tape reading jobs to time out.  The
    bug in the dCache server occasionally leads to "phantom" mover entries that
    take up all the active mover slots in a pool, thus blocking or at least
    reducing normal access to files in that pool for the lifetime of those
    phantoms.  The bug is being worked on by DESY with high priority.  The
    bug must be fixed before we can put dCache into production at CDF.
  - CDF apologizes for a computer security false alarm over the weekend
    involving the dCache admin nodes.  An alias of the ls command on the CAF
    monitoring account made a soft link show up as the file it links to, and
    lead to false reports by two CDF department people that the systems had
    been hacked.

- LHC2003 this week.
  Posters for LHC2003 were shown. Software distribution plots and
  software dependencies. dCache performance.  Network attached storage
  device tests and performance.

- Farm 7.3 M events collected, 14.1 M processessed.
  Still Running p13.06.01.   The farms had an absolutely splendid week.
  Completed all the p13 reprocessing.
- 217.7TB in STK,  3TB
  81.0TB in LTO,  1 TB
- Bonnie's group to add tapes to the clean silo.
  We are will be switching over to using 9940bs for raw data, April 28.
- Analysis stations
              Projects   Data Analyzed  Events Analyzed  Data transferred in
  D0mino          1612       14 TB            100M               6TB
  CAB              292        7 TB            300M               2TB
  Clued0           231       1.2TB             20M             0.7TB
  D0Karlsruhe      36        0.2TB             13M             0.2TB
- With the Calibration DBservers turned on in p14, users who are
  re-reconstructing picked event samples are seeing large overhead in getting
  the constants.  1/2 hour per run is typical, and it can be as long as two
  hours.  And frequently, users have only one or two events per run.  The
  hope is that since the farm runs large chunks of runs together, and it
  takes a long time to process a partition that the current calibration
  dbserver performance will be adequate for production.
- NFS on D02ka hung once last week and was rebooted using the script.  SGI
  has requested dumps on all reboots and wants to take over the system the
  next time the system hangs during working hours.
- Some problem with memory on D0ora1 over the weekend, SAM team to investigate.

- Gamma Ray burst caught by SDSS.  It is a smoking gun that some gamma ray
  bursts are caused by supernova's.
- Running out of DLT 4000 tapes.

- Minos starting to think about data handling requirements.  Exploring
  TDCachefile in Root.

- Networking
  - Thursday Maintenance: new power supply and modules for CDF.  5 minute
    downtime between 6:30 and 7.  Changing routing authentication for
    CDF workgroup LAN.  Trying to do the same for D0.
  - Gigabit links installed between fcc1 and fcc2.
  - WH10 waiting on materials for desktop upgrades.
  - Misc work in conference rooms.
  - Started working with DCN on location of switches for CDF & D0 farms.
  - Offsite fiber: conference call with Ciena.
- Security
  - Peer review completed and a success.
  - This week going to catch up with GRID projects.
- Storage
  - CDF enstore operations have not been smooth. Myriad of errors.
    Problem logging raw data 3 times in the last month.

1.      CSS
1.1     Upcoming Downtimes

Thursday, May 1st Quick reboot to install Netscape IMAP patches on

Email gateway relay cutoff date May 15th. Contacting users.

1.2     DSG

The nic box that runs the miscomp production report server did not reboot.
M.Mihalek is investigating. (This means any miscomp oracle production
reports could not be run.)

A request was made to attach .pdf files to reqs and send to mms. A test
proved unsuccessful. The 1st page of the .pdf was blank. It was found that
this is a known bug in oracle financials that will require many patches. Bss
will make a problem report, but this is a low priority for them.

There have been several problems with the new printed req report. Many times
it is getting a formatter error. The root cause is not yet determined,
S.Jones is working on it.

Work on node registration included several meetings to discuss requirements
and planning. J.Trumbo, N.Ho and S.Jones participated. S.Jones created a
report for I.Gaines listing system managers with 3 or more systems, and a
report for A.Walters showing location changes based on node registrations.

1.2.2   Run II DB's

D0 calib user server d0dbsrv5 automatically failed over to failover node
2 GB RAM was exhausted. This ram failure could indicate that 2G ram in
insufficient to run the user server. Some of the users are scheduled to move
off of d0dbsrv5 to d0dbsrv4, but Taka
has not yet cooridinated that move.

1.3     CSI
1.3.1   Email

IMAPSERVERB problems on Thursday night/Friday am. Tracked to 500MB file.
IMAPSERVERs now set to 50MB file size limit.

1.3.2   AFS

Restarted AFS processes on FSUS02 on Friday at 6:52pm because of WWW
slowness for about 1 hour. Still investigating.

Waiting for Division to approve Sine Nomine req for AFS & OpenAFS support.

Installed OpenAFS v1.2.9 on abyss and fsus01. Seems to be working fine.
Systems also patched with latest Solaris8 patches. Tested OpenAFS Irix 6.5
on oasis.

1.3.3   Storage

Ray is reading up on the Lustre filesystem and the Ibrix filesystem. The
Lustre filesystem is already being funded by DOE and seems to be in use by

1.4     TOC
1.4.1   Windows Domains

We patched the NT4 server systems for the WebDAV vulnerability (Unchecked
Buffer in Windows Component).

Inadvertently, an OU admin deleted a user account in the FERMI domain. This
should not have happened as the W2K domain has policies in place to stop it.
That did not work. Ken implemented a work around, and all OUs have been
updated. An incident has been opened with Microsoft, and they are looking
into the problem.

FNALBDC-CDF started to have troubles weekend before last (mentioned in last
op report). It turned out the system's hard drive was failing, but the
system was still up in degraded mode. A new hard drive was installed, and we
rebuilt the system from backups.  It was about 6 hours from diagnosis to
system back up.

1.5     SCS
1.5.1   Farms Eval

Reqs for worker nodes for FT farms (32) and FNALU (16) are in the queue.
Final technical specs are being polished. Results of this bid will decide
the "top five" vendors for future procurements for the next two years.

Eval machines have all been sent back.

1.5.2   Run II Farms

D0 FARMS Network glitch update, there may be a nic firmware issue here. Some
reseach by Chuck from datacom and Ken have found much info on the net.
Further work is being done to see what may help us. also some special
monitors have been put in place to catch errors and data.

1.5.3   Monitoring

Ganglia running on Sun's and Sgi's in fnalu now. Had a group meeting on
Ganglia. We are trying to standardize and scale to 1000's of nodes.

See: http://flxd01.fnal.gov/

1.6     ELS
1.6.1   Maintenance Contracts

Non-CD Chargeback reports were sent on Friday.  We have asked that they be
returned to us by May 16th for processing.   Final numbers need to be turned
in to Mike Smith by the end of May.

SGI O2000's - We have requested a quote for the renewal of the SGI 02000's
maintenance contract.

We are also trying to set up a meeting with SGI to discuss extending the
current site contract for two more years.

1.7     ESS
1.7.1   Beams Support

Continued significant effort toward modifying BPM Pre-Amps for BD.  Received
PCBs for the new BPM Pre-Amp daughter cards designed by  ESE, called  the
"Calibration Board".  Stew Bledsoe assembled (2)  prototype calibration
boards and will work with Marcos T. (ESE)  this week testing the prototype
boards. Peter Prieto (BD) will be  given one of the prototype calibration
boards before ESS assembles  additional boards.

1.7.2   Run II Support

-Dzero: Received the replacement capacitors for the D0 BiRa/Benchmrk  HV Pod
ECO. Completed the following D0 items (April 21 - April 25)
 (12) Benchmrk: HVS2P HV Pods
 (3) BiRa: VMEHVS2P HV Pods

-CDF: Completed the following CDF items (April 21 -April 25)
 (1) Motorola: 2301
 (1) Motorola: 2400-0323
 (2) Fermi: VRB-10

- New recommended root version v3_05_04b.

Planning and Customer Support
- Helpdesk report: 2100 tickets this year. 84% destined to computing
  division, 14% destined to beams. Tickets by group were also shown.
  179 tickets created with NGOP. 53% processed as expected. 9% corrections
  made to NGOP or remedy.  31% maintenance not set: tickets cut through
  NGOP where the sysadmin did not set the system downtime in NGOP so a
  ticket got cut when the downtime was detected.
- Budget meeting tomorrow 1 p.m. FCC1
- System usage shown for farms.
- Effort reporting due tonight.

- No report.

- Discussion of safety here and in private industry.  Stressed the need
  for safety walkthroughs.

Respectfully Submitted,
Robert Harris