Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: iforge gpfs issue and scheduler pause

...

StartEndWhat System/Service is affectedWhat is happening?What will be affected?Contact PersonStatus
2022-03-02
930
2022-03-04 07
1200
ICC

Emergency PM

We are seeing some network issues on the cluster. In order to resolve these issues, we need to upgrade code on our infiniband infrastructure


UPDATE: We are currently experiencing unforeseen technical issues with the cluster. We are investigating and expect resolution and restoration of all Campus Cluster services by March 3rd 12PM

UPDATE2: We are still experiencing issues where the compute clients will not properly mount storage. We are engaged with vendor support and continue to work on the situation. Thank you for your patience. We have moved expected return to service to March 4th, 12PM

UPDATE3: Campus cluster is experiencing SLURM job failures in certain pods(sections) of the cluster. Investigations continue and there is a partial return to service with login nodes, storage, and data transfer services still operational. New full return of service date: Monday, March 7th, 12PM.

ICCP filesystem will be offline. Most projects will be impacted. Special arrangements have been made with some to be able to operate to some degree during the outage.help@campuscluster.illinois.edu

Status
colourRed
titleIN PROGRESS

2022-03-02

1237

TBDiForge

Emergency Work on iforge

Pause of scheduler at this time

We have identified some issues with gpfs and comms to a subset of the nodes. the scheduler is paused and we are working on the issue


GPFs is slow to respond and the Scheduler is paused so no new jobs will. Compute nodes are being rebooted, The login node may also need a reboot

help@ncsa.illinois.edu

Status
colourRed
titleIN PROGRESS


Upcoming Scheduled Maintenance

...