Minutes of April 28, 2003 CD Operations
475 Days w/o Lost-time Injury
Div. Head Walkthroughs - 1
Dept. Head Walkthroughs - 0
ITNAs - 3 people in the division without ITNAs.
Sexual Harrassment for Managers and Supervisors - May 6 and July 16
GERT - reminder has been sent to those past due.
- Data Handling
- Rob Kennedy completed a 1st pass of load tests of dCache with mixed results.
When no staging from tape to disk is necessary, dCache handles 20 TB/day
without problems. When intensive staging from tape to disk was required
the system could not handle 20 TB/day due to a bug in dCache which caused
multiple pools to be blocked and the tape reading jobs to time out. The
bug in the dCache server occasionally leads to "phantom" mover entries that
take up all the active mover slots in a pool, thus blocking or at least
reducing normal access to files in that pool for the lifetime of those
phantoms. The bug is being worked on by DESY with high priority. The
bug must be fixed before we can put dCache into production at CDF.
- CDF apologizes for a computer security false alarm over the weekend
involving the dCache admin nodes. An alias of the ls command on the CAF
monitoring account made a soft link show up as the file it links to, and
lead to false reports by two CDF department people that the systems had
- LHC2003 this week.
Posters for LHC2003 were shown. Software distribution plots and
software dependencies. dCache performance. Network attached storage
device tests and performance.
- Farm 7.3 M events collected, 14.1 M processessed.
Still Running p13.06.01. The farms had an absolutely splendid week.
Completed all the p13 reprocessing.
- 217.7TB in STK, 3TB
81.0TB in LTO, 1 TB
- Bonnie's group to add tapes to the clean silo.
We are will be switching over to using 9940bs for raw data, April 28.
- Analysis stations
Projects Data Analyzed Events Analyzed Data transferred in
D0mino 1612 14 TB 100M 6TB
CAB 292 7 TB 300M 2TB
Clued0 231 1.2TB 20M 0.7TB
D0Karlsruhe 36 0.2TB 13M 0.2TB
- With the Calibration DBservers turned on in p14, users who are
re-reconstructing picked event samples are seeing large overhead in getting
the constants. 1/2 hour per run is typical, and it can be as long as two
hours. And frequently, users have only one or two events per run. The
hope is that since the farm runs large chunks of runs together, and it
takes a long time to process a partition that the current calibration
dbserver performance will be adequate for production.
- NFS on D02ka hung once last week and was rebooted using the script. SGI
has requested dumps on all reboots and wants to take over the system the
next time the system hangs during working hours.
- Some problem with memory on D0ora1 over the weekend, SAM team to investigate.
- Gamma Ray burst caught by SDSS. It is a smoking gun that some gamma ray
bursts are caused by supernova's.
- Running out of DLT 4000 tapes.
- Minos starting to think about data handling requirements. Exploring
TDCachefile in Root.
- Thursday Maintenance: new power supply and modules for CDF. 5 minute
downtime between 6:30 and 7. Changing routing authentication for
CDF workgroup LAN. Trying to do the same for D0.
- Gigabit links installed between fcc1 and fcc2.
- WH10 waiting on materials for desktop upgrades.
- Misc work in conference rooms.
- Started working with DCN on location of switches for CDF & D0 farms.
- Offsite fiber: conference call with Ciena.
- Peer review completed and a success.
- This week going to catch up with GRID projects.
- CDF enstore operations have not been smooth. Myriad of errors.
Problem logging raw data 3 times in the last month.
1.1 Upcoming Downtimes
Thursday, May 1st Quick reboot to install Netscape IMAP patches on
Email gateway relay cutoff date May 15th. Contacting users.
The nic box that runs the miscomp production report server did not reboot.
M.Mihalek is investigating. (This means any miscomp oracle production
reports could not be run.)
A request was made to attach .pdf files to reqs and send to mms. A test
proved unsuccessful. The 1st page of the .pdf was blank. It was found that
this is a known bug in oracle financials that will require many patches. Bss
will make a problem report, but this is a low priority for them.
There have been several problems with the new printed req report. Many times
it is getting a formatter error. The root cause is not yet determined,
S.Jones is working on it.
Work on node registration included several meetings to discuss requirements
and planning. J.Trumbo, N.Ho and S.Jones participated. S.Jones created a
report for I.Gaines listing system managers with 3 or more systems, and a
report for A.Walters showing location changes based on node registrations.
1.2.2 Run II DB's
D0 calib user server d0dbsrv5 automatically failed over to failover node
2 GB RAM was exhausted. This ram failure could indicate that 2G ram in
insufficient to run the user server. Some of the users are scheduled to move
off of d0dbsrv5 to d0dbsrv4, but Taka
has not yet cooridinated that move.
IMAPSERVERB problems on Thursday night/Friday am. Tracked to 500MB file.
IMAPSERVERs now set to 50MB file size limit.
Restarted AFS processes on FSUS02 on Friday at 6:52pm because of WWW
slowness for about 1 hour. Still investigating.
Waiting for Division to approve Sine Nomine req for AFS & OpenAFS support.
Installed OpenAFS v1.2.9 on abyss and fsus01. Seems to be working fine.
Systems also patched with latest Solaris8 patches. Tested OpenAFS Irix 6.5
Ray is reading up on the Lustre filesystem and the Ibrix filesystem. The
Lustre filesystem is already being funded by DOE and seems to be in use by
LLNL, LANL, and SNL.
1.4.1 Windows Domains
We patched the NT4 server systems for the WebDAV vulnerability (Unchecked
Buffer in Windows Component).
Inadvertently, an OU admin deleted a user account in the FERMI domain. This
should not have happened as the W2K domain has policies in place to stop it.
That did not work. Ken implemented a work around, and all OUs have been
updated. An incident has been opened with Microsoft, and they are looking
into the problem.
FNALBDC-CDF started to have troubles weekend before last (mentioned in last
op report). It turned out the system's hard drive was failing, but the
system was still up in degraded mode. A new hard drive was installed, and we
rebuilt the system from backups. It was about 6 hours from diagnosis to
system back up.
1.5.1 Farms Eval
Reqs for worker nodes for FT farms (32) and FNALU (16) are in the queue.
Final technical specs are being polished. Results of this bid will decide
the "top five" vendors for future procurements for the next two years.
Eval machines have all been sent back.
1.5.2 Run II Farms
D0 FARMS Network glitch update, there may be a nic firmware issue here. Some
reseach by Chuck from datacom and Ken have found much info on the net.
Further work is being done to see what may help us. also some special
monitors have been put in place to catch errors and data.
Ganglia running on Sun's and Sgi's in fnalu now. Had a group meeting on
Ganglia. We are trying to standardize and scale to 1000's of nodes.
1.6.1 Maintenance Contracts
Non-CD Chargeback reports were sent on Friday. We have asked that they be
returned to us by May 16th for processing. Final numbers need to be turned
in to Mike Smith by the end of May.
SGI O2000's - We have requested a quote for the renewal of the SGI 02000's
We are also trying to set up a meeting with SGI to discuss extending the
current site contract for two more years.
1.7.1 Beams Support
Continued significant effort toward modifying BPM Pre-Amps for BD. Received
PCBs for the new BPM Pre-Amp daughter cards designed by ESE, called the
"Calibration Board". Stew Bledsoe assembled (2) prototype calibration
boards and will work with Marcos T. (ESE) this week testing the prototype
boards. Peter Prieto (BD) will be given one of the prototype calibration
boards before ESS assembles additional boards.
1.7.2 Run II Support
-Dzero: Received the replacement capacitors for the D0 BiRa/Benchmrk HV Pod
ECO. Completed the following D0 items (April 21 - April 25)
(12) Benchmrk: HVS2P HV Pods
(3) BiRa: VMEHVS2P HV Pods
-CDF: Completed the following CDF items (April 21 -April 25)
(1) Motorola: 2301
(1) Motorola: 2400-0323
(2) Fermi: VRB-10
- New recommended root version v3_05_04b.
Planning and Customer Support
- Helpdesk report: 2100 tickets this year. 84% destined to computing
division, 14% destined to beams. Tickets by group were also shown.
179 tickets created with NGOP. 53% processed as expected. 9% corrections
made to NGOP or remedy. 31% maintenance not set: tickets cut through
NGOP where the sysadmin did not set the system downtime in NGOP so a
ticket got cut when the downtime was detected.
- Budget meeting tomorrow 1 p.m. FCC1
- System usage shown for farms.
- Effort reporting due tonight.
- No report.
- Discussion of safety here and in private industry. Stressed the need
for safety walkthroughs.