Minutes of June 23, 2003 CD Operations Meeting
--------------------------------------------

ES&H (AP)
----
- 531 days without a lost time injury.
- For info on a new incentive program, see:
http://www-esh.fnal.gov/pls/default/lsc.html
- 0 Department Head walk-through's; 0 Division Head walk-through's.
- gave computer workstation review class. Had a number of summer student
attendees.  There are still more who need to take it. Next one is June 24 at
9:15 am at the Training and Development Center.
- 10 people in the Division are past due on GERT.  Next week, I will distribute
a list of those past due.
- ITNA (and other aspects) on ES&H web site now requires individual
username/password per DOE policy.  To obtain an account, go to:
http://www-esh.fnal.gov/pls/default/create_account.html
- Tomorrow is an Ozone Alert Day.

CDF
---
- <No Report>

CMS
---
- <No Report>

D0 (JB)
--
- Farms: 14.1 M events collected, 13.8 M processessed.
9940B robot 30 TB
mezosilo   219 TB
LTO         89 TB

Station      Projects     Data Analyzed    Events Analyzed
D0mino       25           14     TB               65 M
CAB          10            8.2   TB              425 M
Clued0        4            0.168 TB                1 M
D0ora machines going to be down tommorrow and the next day to move to the new
machine.

EAG (SK)
---
Data Release 1 provided - 2 TB copied from FNAL - ~50 independent sites
A second (mirror) site being made in Texas (1/2 TB copied, 2 1/2 TB to go).
About 1 - 2 emails per day to the help desk.  Most answered by pointing people
to web pages.

Getting ready for CD review of group scheduled for 7/28 and 7/29.

EXP (LBG)
---
Had db discussion with Julie et. al.  Generally good comments about discussion
(LBG--please add any details I should have included.)
About to obtain possession of WH12NW
Want to have a working control room in the September time frame.

CCF (JB)
---
Successful copying of files by Lattice QCD between NERSC - FNAL and
between NCSA (?) and FNAL. Able to get all files here.
Have obtain agreement between Rob Kennedy regarding operations for next
several months.  (JB--please clarify if needed.)

CCF (PD)
---
Not a maintenance week.  Core Router @ D0 reset at 8:34 pm Sunday (?).  No one
seems to have noticed but we are looking into it nevertheless.

NIKEF last week tried a large data transfer.  Saw 600 Mb/sec at ESnet.  Ditto at
  our DMZ router but Border Router only saw 400 Mb/sec.  -- (VW) What did Nikef
see?  (PD) 400 Mb/sec.  Looking into this.  Router should have handled it.

Last Wed. Lab A Router dropped off. Went to investigate.  Found this amusing or
perhaps not so amusing situation (PD--shows picture of Router rack held up by a
chain to the ceiling and floor underneath absent.  (AP) Discussed with SSO in
PPD. (PD) How would we have been able to work on this?  Discussion of situation.
Oddity of not being notifed.

CCF (RC)
---

Overhead network cable tray in Fiber Central Completed.  Will make it easier to
not have to put cables under floor.  Conference Room work:  Equipment moved out
of 12NW (to 8X) Walls being put up.  PPD has a July 10th conference which will
need this.  Did 22 Julie locates for FESS including one in which the locate went
back to FESS.    Relates to effort to measure land movement.  Met with CISCO rep
re Starlight regarding equipment needed.

CCF (WB)
---

Dzero downtime will allow us to get some needed infrastructure work done.

CCF (RR)
---

No security matters to discuss.

(GB) What about e-mails from New Muon Lab?  (I gather e-mails are to be sent in
the case of power shutdowns due to high temperatures.  Equipment is to at least
stop processing.  Some discussion about whether or not personnel are to
physically visit New Muon Lab when one of the e-mails is received.--DR).  I
believe JB indicated he would get back to GB on this.

CSS (Submitted by SN; Comments in mtg added in [....] by DR)
---

CSS dept OPerations meeting notes.

ELS:
1. Non-CD Chargebacks - The chargeback process is complete.  We collected a
    total of $245k spent by other divisions & universities on maintenance
contracts. [Note that this has been declining over the years...]


CSI:

1. Security incident on Saturday with iis2 web server. Inprogress,
restricted access. [Apparently a MicroSoft FrontPage with anonymous access
enabled. Someone noted that it was amazing it took so long to be 'discovered'.]

2. ngopcli had very strange network problem.  Caused NGOP to be down for 6
    hours on Tuesday. [CSI is investigating]


ESS:

1. The first batch (40) of modified Pre-Amps were installed at MI60 and
    the installation went smoothly and was successful.
2. The second batch (160) have been completely modified and are being
    tested by ESS using the Beams test stand.
3. The third batch (37) were received by ESS last week.
    Modifications are starting this week.


DSG:
1. The miscomp database halted friday and saturday morning due to the
archivelog area filling up. Found loop in remedy ecalation code process. Working
on fix. [Discussion about details prompted by VW; (JM) Not down all
weekend--Diane, Julie dealt with the problem remotely until today when the
problem could be further investigate (I think this was the sense of things--DR)]

2. D0ora1(host for d0 off-line production database crashed and it was due
to bad CPU. System Admins are planning to replace the cpu next week. [Not
actually going away but will be repaired to become the development machine.]


SCS:

1. The Farm bid is due back next week.
2. Security issue with a person sharing a crypto card.

TOC:

1. Target student starts tuesday.

[AW: When is the next power outage (scheduled, that is)? GB: August time frame.]

(VW) We need to have a presentation on the Microsoft Licensing Directions.  Am
starting to hear discussion in the hallway that isn't quite accurate.  Need to
provide the right info for people.


CEPA (PM)
----

Mtg in DC about follow on to SciDAC project - 250 people attending.  (PM--I am
not sure I caught anything more than the fact that it was happening.  Please add
if there should be more.--DR)

Tev BPM mving ahead.  Some pressure.  Requirements will hopefully be included in
final report.  (RT) Status of Echotech boards?  (AW) Al in.  60-70 tested last
week.  All should be done by now.

Planning
--------
(JM) Working on getting the connection between system status web page and NGOP.
Involves IIS connection certificates.

(RET) Presenting graphs of resolved tickets.  Much discussion.  Presenting also
graph of unresolved tickets over the last 9 weeks.  More discussion.  (DR)
Trying to figure out if it answers the question of how many people are mad at
us--for taking a long time to respond to their help desk request. [Was not
obvious (to me) at the time but JM explained it to me afterwards--Still not sure
I could say what the answer was--DR]

Projects
--------
<No Report>

Operations
----------

(GB) Will schedule another Computing Room planning meeting.  Will present
revised info about plans for systems to be purchased in FY03.  Addition nodes
possible.  Assumption previously made about old systems to be decommissioned
prior to new ones being added not correct.  Necessitates placing FY03 systems
from CDF and DZero to go into removed ADIC location in order to get equipment to
be on new power of 2 UPS and 4 new anesl (with transformer to be added) (so not
maxing out existing power arrangements).

As regards FY04 and beyond, A&E firm has quoted for next phase of study
regarding FY04 requirements.  Dicussion of usage project profile.  Rumors of CDF
perhaps proposing a factor of 4 increase in data.  Implications of that need to
be worked through.  (VW still just at a discussion stage.)

(RT) What about Blade Server analysis? (GB) Keith to give report.  (VW) McClier
may be a little optimistic on this option for future technology.  Seems to be a
solution as far as space constraints but not power constraints.  [See below for
a post-meeting e-mail related to this--DR]

(GB) Did people find their PC's powered off over the weekend?  Was it just a FCC
occurence or did it include WH?  Discussion: (implication seemed to be it was
just FCC--DR)

(DR) Working Group to Estimate Requirements to vacate WH7SW is proceeding.
Expect to be getting to a rough draft of a report soon.

(RT) Meetings related to designing something for the "lobby" in FCC1W CDO area
are proceeding.  Designer has some ideas.  Plasma Screens may be involved in a
way that allows dual use for something by Lepton-Photon and also at SC2003.

(VW) Just returned from SLAC discussions of SC2003.  Concept seems to be a
"build your own booth" in order to save costs. (LL) <Did not catch your remarks,
Lee--DR>

(DR) NCSA Trip - 7,800 sq ft of computing space for $8M.  => $1000/sq ft of
computing space.  => $500/sq ft of total space (=computing space + a/c space).
Interesting aspect of six foot high underfloor areas so one can walk upright
(well, not DP, I guess).

(LBG) When are Performance Goals due?  Discussion.  VW will send around
something.  Deadline of deadlines is August 11th.

Respectfully Submitted,
David Ritchie



====Post-Meeting=E-mail=Report=From=Keith=Via=JB=============================

> Keith,
>
> FY02 CDF, D0 and CMS systems are configured at 32 nodes
> per rack and were measured at 7 Kilowatts per rack.
 > Also, the dual 3 GHz systems that were evaluated for
 > FY03 purchases were measured at 288 KW/node. 42 nodes would be 12KW.
>
> Gerry
> ===================================
> ----- Original Message -----
> From: Jon A. Bakken
> To: Gerald J. Bellendir ; Vicky White ; R Tschirhart
> Sent: Monday, June 23, 2003 2:37 PM
> Subject: Re: "Blade Solution" (fwd)
>
> Here is Keith's update on the blade computing solution.
>
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> Jon A. Bakken    bakken@fnal.gov   (630) 840-4790
>
> ---------- Forwarded message ----------
> Date: Mon, 23 Jun 2003 14:27:54 -0500
> From: Keith Chadwick <chadwick@fnal.gov>
> To: Jon Bakken <bakken@fnal.gov>
> Subject: Re: "Blade Solution"
>
> Ok, Here is the scoop:
>
> Currently, measured (by Jack and Gerry) rack power dissipations with the 1U servers
> range from 6 to 8 Kilowatts.  My calculations of the expected thermal loads are
> in close agreement with the measurements.
>
> Various vendors offer blade computing "solutions" based on the Intel Xeon
 > and the Pentium III which have specified power dissipations in the range from
~15 to as
> high as 28 Kilowatts per 42U rack (my calculations are in fairly close agreement).
>
> So based on currently available products, on a per rack basis, the CPU per rack can
> increase by factor of 2, but the power dissipation will also increase by a factor
> of between 2 and 3.  This will result in a higher and denser heat load.  Which in
> turn will require greater cooling and engineering to move the heat out of the
> racks.
>
> There is one possibility which may dramatically reduce the thermal load, and that
> is if we can locate products (or convince a vendor to make) based on the mobile
> Pentium 4 chips - these processors have power dissipations which are ~50% of the
> regular P4.
>
> The bottom line using todays products, is that blade computing is the appropriate solution when:
>
>         1.  Space is at a premium (as is the case in FCC).
>         2.  There is abundant cooling (this is NOT the case in FCC).
>         3.  There is abundant electrical (this is NOT the case in FCC).
>
> If we can find or convince a vendor to make for us a blades based on the mobile
> Pentium 4, then this would tip the balance in favor of blade based computing.
> Even in this case, a significant amount of electrical and HVAC engineering will be
> required to provide adequate power and cooling.
>
> I will be writing and circulating a (draft) report on the results of the researches
> later this week.  In the meantime, I have attached an Excel spreadsheet with the
> raw numbers...
>
> -Keith.
>
> At 6/23/2003 01:47 PM, you wrote:
>>Gerry B reported at the division meeting that you were
>>optimistic about your blad solution.  Both Phil and
>>I remember the opposite.  Can you clarify this?  Also
>>it was stated that you were writing a report.
>>
>>Thanks,
>>JOn
>>