You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 1919 Next »

status.ncsa.illinois.edu


Watch this page in the wiki to subscribe to automatic updates to this status page.

Please do not refer to any NCSA Industry Partners on this page. Please use the iforge nomenclature for all of the *forge infrastructure.

To see older events, see Archive of NCSA Status Home

Report a problem 

Current Status  

StartEndWhat System/Service is affectedWhat is happening?What will be affected?Contact PersonStatus
2022-03-10  16:30hrs
ICCIO hanging. GPFS servers could not talk to Compute nodes. Some compute nodes were expelled. Nodes are currently returning to service. The filesystem and scheduler survived without significant interruptions except to a select number of nodesSome slurm compute nodeshelp@campuscluster.illinois.edu

PROGRESS


Upcoming Scheduled Maintenance

Listed below in chronological order.

StartEndWhat System/Service is affectedWhat is happening?What will be affected?Contact PersonStatus
2022-03-10 0700hrs2022-03-10 1200hrsDistribution panel DP-5C-1020. Power feed C to the north east corner power panelsDe-energizing electrical distribution panel  DP-5C-1020 to tie in power cables to Holl-I system

Known resources impacted:

Granite: already planned to be offline for maintenance

iForge: cluster offline for the duration

Radiant: cluster online, without power redundancy

help@ncsa.illinois.edu

SCHEDULED

2022-03-16 17002022-03-16 1800DNS1Hardware replacement on DNS1 server.DNS lookups will be on own the primary DNS server while the hardware is being swapped.  DNS2 will remain up.help+neteng@ncsa.illinois.edu

SCHEDULED

2022-03-14 18002022-03-15 0700NCSA File & Print ServersScheduled Windows Server MaintenanceFile & Print Shares will be unavailable during maintenance.  Users will not able to access shares on Fileserver (e.g. home, busnoff, hr, etc.), and printing will be unavailable.help+service@ncsa.illinois.edu

SCHEDULED

2022-04-20 08002022-04-20 2000ICCICC Quarterly MaintenanceICC Cluster nodes only

help@campuscluster.illinois.edu

SCHEDULED

2022-07-20 08002022-07-20 2000ICCICC Quarterly MaintenanceAll ICC services

help@campuscluster.illinois.edu

SCHEDULED

2022-10-19 08002022-10-19 2000ICCICC Quarterly MaintenanceICC Cluster nodes only

help@campuscluster.illinois.edu

SCHEDULED


Previous Outages or Maintenance

StartEndWhat System/Service was affected?What happened?What was affected?

Contact Person

Status
2022-03-10 0700hrs2022-03-10 1500hrsDistribution panel DP-5C-1020. Power feed C to the north east corner power panelsDe-energizing electrical distribution panel  DP-5C-1020 to tie in power cables to Holl-I system

Known resources impacted:

Granite: already planned to be offline for maintenance

iForge: cluster offline for the duration

Radiant: cluster online, without power redundancy

help@ncsa.illinois.eduCOMPLETE


2022-03-09 07002022-03-09 0810linux.ncsa.illinois.edu
(aka public-linux)
Upgrade server to RHEL 8 and add NCSA Duo 2FA authenticationServer was unavailable during maintenance.help+service@ncsa.illinois.edu

COMPLETE

2022-03-02
930
2022-03-07
1715
ICC

Emergency PM

We are seeing some network issues on the cluster. In order to resolve these issues, we need to upgrade code on our infiniband infrastructure


UPDATE: We are currently experiencing unforeseen technical issues with the cluster. We are investigating and expect resolution and restoration of all Campus Cluster services by March 3rd 12PM

UPDATE2: We are still experiencing issues where the compute clients will not properly mount storage. We are engaged with vendor support and continue to work on the situation. Thank you for your patience. We have moved expected return to service to March 4th, 12PM

UPDATE3: Campus cluster is experiencing SLURM job failures in certain pods(sections) of the cluster. Investigations continue and there is a partial return to service with login nodes, storage, and data transfer services still operational. New full return of service date: Monday, March 7th, 12PM.

ICCP filesystem will be offline. Most projects will be impacted. Special arrangements have been made with some to be able to operate to some degree during the outage.help@campuscluster.illinois.edu

COMPLETE

2022-03-02 1237

2022-03-02 1715

iforge (iforge.ncsa.illinois.eduGPFS issue with interruption of filesystem leading to scheduler pause1 running job was aborted, and any new jobs paused during the interruptionhelp@ncsa.illinois.edu 

COMPLETE

2022-03-02
0600
2022-03-02
0630
Jira

Adding Ram
to improve performance

Jira will be unavailable druning maintenance

COMPLETE

2022-03-01
1800
2022-03-01
1810
ldap2 server clients of
NCSA LDAP

on-line maintenance

restart rsyslog and Ldap after relocating /var/logs clients should have redundant servers configuredTimothy Bouvet 

COMPLETE

2022-02-28
1800
2022-02-28
1830
ldap1 server clients of
NCSA LDAP

on-line maintenance

Had to restart rsyslog and Ldap after relocating /var/log

slow response from ldap1 but clients should have redundant servers configuredTimothy Bouvet 

COMPLETE

2022-02-28
0900
2022-02-28
1030
CMDBV1.7.20220228 ReleaseMDB database will be unavailable. ITSM's openDCIM will be down for a short period (~ 5 minutes) while the data is reloaded.

kimber7@illinois.edu

COMPLETE

2022-02-26 07302022-02-26 0750NCSA GitLabGitLab was updated to latest versionAll GitLab services were unavailablehelp+service@ncsa.illinois.edu

COMPLETE

2022-02-25-10:002022-02-25-13:00Taiga - CenterWide FSFull file system outageAll clients mounting Taiga

COMPLETE

2022-02-09 1400

2022-02-25 1030Jira, Internal/Savannah, LDAP, POP, Hosted web servers, virtual classroom, vcenter

The NCSA VMWare cluster is experiencing storage performance issues.

-- Update: Adjustments have been made to storage used by the LDAP servers and other non-essential VM instances have been disabled. Testing is indicating that response times have improved and services are working normally again.

We monitoring services. Please report any issues to help@ncsa.illinois.eduTimothy Bouvet 

RESOLVED FOR NOW

2022-02-24 10002022-02-24 1115

cerberus2.ncsa.illinois.edu, tg-kdc1.security.ncsa.illinois.edu, bwbh2.ncsa.illinois.edu

One of the IRST ESXi machines unexpectedly shutdown.The listed hosts are currently unavailable

COMPLETE

2022-02-23 17002022-02-23 1900DNS2DNS2 hardware will be replaced.There will be a brief outage of DNS2, while IP's are migrated to the new server.help+neteng@ncsa.illinois.edu

COMPLETE

2022-02-22: 08252022-02-22: 1324Slack

Info from Slack (https://status.slack.com/)

We've resolved the issue, and all impacted customers should now be able to access Slack. You may need to reload Slack (Cmd/Ctrl + Shift + R) to see the fix on your end. If that doesn't work, try clearing cache (Help > Troubleshooting > Clear Cache and Restart from the app menu). Thanks for bearing with us and we apologize for the disruption to your work day!

Feb 22, 1:24 PM CST

We're seeing signs of improvement. Please try reloading Slack, and if not a cache reset. We’re still monitoring the situation. We’ll confirm once this issue is fully resolved.

Feb 22, 11:07 AM CST

Slack is not loading for some users. We are continuing to investigate the cause and will provide more information as soon as it's available.

Feb 22, 9:23 AM CST

We're still working towards a full resolution. We'll be back with another update soon. Thank you for your patience.

Feb 22, 8:44 AM CST

We’re investigating the issue where Slack is not loading for some users. We’re looking into the cause and will provide more information as soon as it's available.

Feb 22, 8:25 AM CST

Various issues accessing and using Slackhelp@ncsa.illinois.edu

COMPLETE

2022-02-18 12:10PM

2022-02-18
2PM


Jira

Reboot to add ram/swap

This is to improve stability


Jira tickets unavailableTimothy Bouvet 

COMPLETE

2022-02-10 10302022-02-18 3:55pmNgale filesystem

The Lustre filesystem is not loading correctly. The support team has been contacted.

Still in progress. MDT0001 is partially recovered. Vendor is attempting to fully restore.

Near completion: Working with vendor on additional configuration changes. Hope to complete final validation and return to service by close of business 2022-02-18.

/ngale filesystem is not accessible. Peter Hartman 

COMPLETE

2022-02-18 12:10PM

2022-02-18
2PM


Jira

Reboot to add ram/swap

This is to improve stability


Jira tickets unavailableTimothy Bouvet 

COMPLETE

2022-02-14

1PM

2022-02-14

4:15PM

All NCSA LDAP serversExpanding schema and restarting serverssystems will reconnect to LDAP server after restart

COMPLETE

2022-02-09

1000

2022-02-09

1200

Facility UPSUPS DC voltage calibrationUPS will be taken to maintenance bypass and all connected  systems will be fed from unprotected power source (no power interruption).rantissi@illinois.edu

COMPLETE

2022-02-09 09002022-02-09 0940Line card failure in Core-EastLine card failure in Core-east, which is resulting in connectivity issues for some infrastructure in NCSA 3003.DNS2, and LSST systems in 3003 were down until the uplinks could be migrated to a new port on Coreshelp+neteng.ncsa.illinois.edu

COMPLETE

2022-02-01
8AM
2022-02-01
4PM
Jira/ldap-auth1login issuesJira Access
2022-02-09 05342022-02-09 0811

LDAP (and dependent services, incl. Jira)

vSphere/ICI VMware

Authorization timeouts/failures in dependent services.

ICI staff are investigating.

LDAP (and dependent services, incl. Jira)

vSphere/ICI VMware

Cause of most severe issues was power fluctuations around 0555, but certain LDAP servers showed degraded slightly earlier.


COMPLETE

2022-02-09 06002022-02-09 0645NCSA MySQL

MySQL database servers need to be synchronized to bring replicated database servers online.

NOTE: The MySQL database is back up, but users may experience issues due to an LDAP issue.

Wiki, JIRA, Savannah/Internal, Identity, and some web sites will stop working. More details are linked here.

help+service@ncsa.illinois.edu

COMPLETE

2022-02-08
7AM

22-02-08

3:15PM

iforge / vforge / license serversRegular Maintenanceiforge, vforge, license servers

COMPLETE

2022-02-08 10002022-02-08 1245CMDBV1.6.20220207 ReleaseCMDB database will be unavailable. ITSM's openDCIM will not be impacted.kimber7@illinois.edu

COMPLETE

2022-02-04 06002022-02-04 0640NCSA GitLabGitLab was updated to latest versionAll GitLab services were unavailablehelp+service@ncsa.illinois.edu

COMPLETE

2022-02-01 08002022-02-01 0900cilogon.orgUpdate to OA4MP v5.2.4Improvements in the back-end servicehelp@cilogon.org

COMPLETE

2022-01-252022-01-25Facility UPSReplace UPS batteriesAll systems with facility UPS feedrantissi@illinois.edu

COMPLETE

2022-01-24 18002022-01-24 20:00NCSA File & Print ServersScheduled Windows Server MaintenanceFile & Print Shares will be unavailable during maintenance.  Users will not able to access shares on Fileserver (e.g. home, busnoff, hr, etc.), and printing will be unavailable.help+service@ncsa.illinois.edu

COMPLETE

2022-01-24  04002022-01-24 0630Failed line card on neo-hpc-1 switch

Line card failure is affecting devices that are plugged into Neo-hpc-1 aggregation switch.  We've migrated links off the failed card, to other ports on the same switch.

No services are currently impacted.

help+neteng@ncsa.illinois.edu

IN PROGRESS

2022-01-19 08002022-01-19 2000ICCICC Quarterly MaintenanceAll ICC services

help@campuscluster.illinois.edu

COMPLETE

2022-01-18 08002022-01-18 0830cilogon.orgUpgrade MyProxy CA servers to CentOS 7Upgrade back-end MyProxy CA VMs from CentOS 6 to CentOS 7. No downtime is expected.help@cilogon.org

COMPLETE

2022-01-14 06002022-01-14 1715Business IT database had bad data.A database that NCSA mirrors from campus changed without notice breaking our MIS system. Business IT isolated the issue and corrected the data.Multiple complex systems have been affected by this data corruption issue.help+service@ncsa.illinois.edu

RESOLVED

2022-01-14 08002022-01-14 1720NCSAnet wirelessNCSAnet Wireless was unavailable due to bad data in ldapUsers couldn't connect to the NCSAnet wireless networkhelp+neteng@ncsa.illinois.edu

RESOLVED

2022-01-05 11002022-01-05 1145CMDBVersion V1.5.20211223 releaseCMDB database will be unavailable for a few moments; openDCIM will be unavailable for a few moments.kimber7@illinois.edu

COMPLETE

2021-12-20 18302021-12-20 2030JIraVersion Upgrade to address security issueJira will be unavailablehelp+service@ncsa.illinois.edu

COMPLETE

2021-12-17 13002021-12-17 1340CMDBVersion V1.4.20211217 releaseCMDB database will be unavailable for a few moments; openDCIM will not  be affected.

kimber7@illinois.edu

COMPLETE

2021-12-17 06002021-12-17 0622NCSA GitLabThe server was updated with some new Puppet configurations.GitLab services was unavailable for a few minutes as the SSL certificate for the service was updated.help+service@ncsa.illinois.edu

COMPLETE

2021-12-16 14002021-12-16 1430HTTP web proxy: httpproxy.ncsa.illinois.eduNCSA's general purpose HTTP web proxy server was rebuilt.HTTP web proxying through httpproxy was unavailable.help+service@ncsa.illinois.edu

COMPLETE

2021-12-10 07002021-12-10 1345iForgeInfiniBand switch maintenanceAll systems unavailableiforge-admin@lists.ncsa.illinois.edu

COMPLETE

2021-12-10 09002021-12-10 1000Bastion Hosts (Production group B)Patching out of cycleBastion Hosts (Production group B) were individually unavailable during reboothelp+security@ncsa.illinois.edu

COMPLETE

2021-12-09 09002021-12-09 0931Bastion Hosts (Production group A)Patching out of cycleBastion Hosts (Production group A) were individually unavailable during reboot

COMPLETE

2021-12-09 08002021-12-09 0900All IDDS servicesIDDS Postgres and Ruby on Rails upgradesAll IDDS servicestolbert@illinois.edu

COMPLETE

2021-12-09 06002021-12-09 0613NCSA GitLabGitLab was updated to latest versionAll GitLab services were unavailable for about 5 minuteshelp+service@ncsa.illinois.edu

COMPLETE

2021-12-07
1400
2021-12-07
1443
LSST

Kubernetes on NTS is not working properly after updates

Kubernetes on NCSA Test Standlsst-admin@ncsa.illinois.edu

RESOLVED

2021-12-07
0800
2021-12-07
1400
LSST

LSST Quarterly Maintenance

All LSST services hosted at NCSAlsst-admin@ncsa.illinois.edu

COMPLETE

2021-12-07

0930

2021-12-07

1030

ACHE Firewallssoftware maintenanceFirewalls will be upgraded using fail over procedures  - no traffic impact expectedJames Eyrich - eyrich on slack

COMPLETE

Legend:

IN PROGRESS

RESOLVED

SCHEDULED

MONITORING


  • No labels