Minutes of November 10, 2003 CD Operations
- 671 days without a lost time injury.
- Service coordinator class held last week. All departments
other than EAG present.
- Reminder about no cardboard in computing rooms.
-- dCache operated at average of 22 TB/day, peak of 28 TB/day (Tuesday,
-- Production farms processed 50 M events last week.
Farms operations continue to be an issue. The problem is understood to
be nodes that hang, thereby freezing out the associated disks.
Concatenation jobs then fail because they cannot access access
required data files on the frozen disks. With high data throughput,
this process eventually brings the farm to a halt due to lack of disk
The highest priority is to understand the cause of hanging nodes. Also
investigating whether an observed increase in the hanging node rate
during the summer (pre-dating the current executable) is related to the
In light of these problems and the impending arrival of new farm nodes
that will double the farm capacity, we are seeking help to review the farm
architecture to insure that the capacity will scale.
- Working on the upgrade. Have 80 worker nodes, 17 servers,
and FC switch. Worker nodes have arrived but not racks.
Concerned about networking: local cabling not completed yet.
CCF indicates they are working on it, should be going in now,
and will be completed within the week.
- Farms: 0 events collected, 23.1 M processed
146TB in 9940b, 205TB in mezsilo, 104TB in LTO
Analysis station Projects Data Analyzed Events Analyzed Data
D0mino 1385 26TB 170M 19.8TB
CAB 262 12.5TB 325M 6.5TB
Clued0 269 1.1TB 9M 0.4TB
D0Karlsruhe 102 3.6TB 82M 1.1TB
- Brian Yanny reporting due to SNAP collab meeting this week.
- Moving toward data release 2 for the collaboration this month and the
public 3 months later. 5-6 TB of data and astro catalogues. Catalog
loading stage. Will be investigating ideas for multi-TB server systems.
Have had some problems with losing disk.
- Running MINOS data on the farms, and optimizing system for best
performances. 9500 files to go, 300 ran over the weekend.
- certified cabling out at CDF. Tested for gigabit support.
- Installed 10 Gige switch for CDF, worked well.
- Two CDF reconstruction farm racks arrived and were connected.
- D0 installed CISKO switch.
- CMS installing expansion cabling in FCC1, and fiber uplinks.
- WH: office remodelling on 15th and 7th floor.
- Conference rooms: D0 DAB3 conf room requirements set, surveyed,
and ordered materials.
- Starlight: review of NWU agreement. Met with SBC Nortel reps.
- Maintenance: Thursday 11/20 6-7 a.m. FCC2 power panel work and FCC2
general switch upgrade.
- Network node registration project: prototype web page was shown.
After the 18th of Nov. plan release in phases.
URL for DHCP registration template (under development):
- DHCP registration statistics being accumulated. 1200-1400 addresses
requesting DHCP per day, half unregistered.
URL for unregistered DHCP system listings:
- Focussing on reliability enhancements for KDC.
- Presentation on Analysis of Enstore alarms.
- Categories: Stuck (transfer will not complete)
Netscan : checks IP listeners
Rate is 0: rate chacks over network.
Too long in state: Active or Seek
- May was a bad month for alaram: 5K total, while 2K is typical.
Lots of false alarms that month.
CSS Operations 2003-11-10
"Old" FNPRT shut down for the last time last week Thursday. This was the
last production VAX/VMS system in CSS and probably in all of CD. (See picture.)
From March 13, 1994 to September 15, 2003 (the lpd server was turned off
that week) the old FNPRT handled a total of 5,134,281 print jobs.
Sent out emails to Div/Sec/Exp on Microsoft support costs under new
Developed a draft statement on Linux support (based on the briefing and
discussions at HEPIX), see:
- Ken F working with Rick Thies setting up NGOP to monitor CSI
supported windows systems.
HSG group hasn't existed since reorg and last two remaining members
(Rick vanC., Bruce K.) are being trained as sysadmins (farms & desktops,
respectively). This means we are seriously ramping down hardware installations
& plans need to include provision for installations (suggest vendor or D1).
Will drain the current queue of commitments.
Rick H. has also begun training w/ CSG 2 days/week on desktops.
Issue w/ authorization of grid certificates. Rick Thies is the only person
at Fermi authorized and trained to do this and will be gone Nov 25 & 26.
Can these be put on hold for two business days? Note that "FNAL services"
certificates are due to expire by Dec 4.
Two automated tickets reported problems over the weekend that were
responded to (fcdflnx2, fnprnt).
There were 178 new Remedy tickets created last week. 46 of those tickets
are still open. Overall there are 205 open Remedy tickets. 48 MISJOB
tickets were written last week, 43 for hardware and 5 for installation.
Automation created 19 Remedy tickets. 5 tickets the automation worked
as scheduled. 9 of the tickets were from scheduled maintenance where
the sys-admin's did not set the node down in NGOP for the outage.
11/06 Escalations stopped running (again - last occurred 09/19). It
appears to have happened at the same time that the scheduled network
outage occurred at WH, ~06:00. The net result was no pages went out on
3-4 tickets that NGOP cut. It turned out that these tickets were
non-issues. I have opened a new ticket with Remedy Support. 11/07 Remedy
support recommends that we upgrade to patch 1247.
-Continued considerable effort repairing Oxford University VFB
(VARC Front-End) Boards for Far Detector, Soudan Mine, MN. Last week,
our first batch of repaired VFB boards were shipped to Soudan.
Run II Electronics Support:
-CDF: Repaired &/or Tested the following items (Nov. 3-7).
(14) MCH: DT96TDC - TDC 9U boards
-Dzero: Repaired &/or Tested the following items (Nov. 3-7).
(30) BENCHMRK: VMEHV2SP HV Pod
(2) BiRa: VMEHVS2P HV Pod
2 oracle security alerts that need attention. working on scheduling,
1 alert is for databases, 1 is for oracle application server.
cdfonprd halted due to archive log at 100%. a massive data clean up
started running , without alerting dbas, filling the archive logs, also
effecting replication. halted job and will reschedule with dbas.
d0 offline production database was upgraded to 18.104.22.168.
jtrumbo & r.jetton assisted r.st.denis in getting fcdfora4 and fcdfsun1
into a ganglia environment for monitoring. (Show picture.)
* There has been a large number of CDF farm nodes failing. The cause seems
to be 99 percent software. We are working with CDF to help fix the issues.
The problem is starting to generate calls outside of agreed to coverage
periods and weekends. We are going to put in a script
that will automate the detecting of a hung node and reboot it. We
also plan on meeting with cdf and figure out a plan to get this fixed
if we can.
* Started the burnin process of the new 64 nodes for CDF. So far so good.
- DB Group
- Presentation was given by Lee Lueking on DB Statistics for D0/CDF.
- CDF User and farms connection time distributions are very different:
user times are much longer. SVX Beam position is a long querry.
Run list is most popular querry. SiCHipPed longest average duration.
- D0 middle tier server achitecture gives different pattern of usage.
Latency to respond is short. Silicon requests are most popular, one
hundred times more frequent than others.
- SAM info. Farm server is very active.
- SNAP R&D: setup for beam test at Indiana.
- FPGA radiation test setup. Very preliminary test results shown.
After ~50 Krads the chip starts to draw a lot of current, doubling
at 64 Krads.
Planning and Customer Support
- 244 out of 251 performance appraisals completed.
- Projects meeting this wednesday. MINOS and CDF reporting.
- All projects cycled through every 3 Months.
- Lab-Wide party arrangements. Need milestones from departments.
Need pictures that will go up on posters in the atrium.