Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

StartWhat System/Service is affectedWhat is happening?What will be affected?Actions

2017-10-21 17:15

LSSTTwo public/protected network switch is switches are down in rack racks N76, O76 at NPCF

All verify-worker [25-48] & qserv-db[11-20]nodes cannot communicate DNS, LDAP, etc. so largely cannot communicate with other nodes, e.g., no communication between affected verify-worker nodes and the Slurm scheduler on lsst-dev01, no communication between affected qserv-db nodes and the rest of qservnodes (ie: slurm cluster), qserv nodes, sui nodes.

in progress, working to get qserv-db[11-20] connected to other nearby switches as a workaround, replacement switch is already on order

UPDATE 2017-10-23 13:28 Borrowing two switches from L1 to put in place of failed switches at NPCF. This will require all qserv, sui and verify-worker nodes to go offline for a period of time while the switches are swapped out. No ETA at this time.

Include the keyword "issue" in updates above to trigger actions.

...