Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

StartEndWhat System/Service was affected?What happened?What was affected?

Contact Person

Status
2019-07-18 08:002019-07-18 10:15LSST

Monthly Maintenance:

  • OS updates and reboots
  • Dell firmware updates
  • firmware update on bastion01 (06:00-08:00)
  • pfSense firewall maintenance postponed

ALL LSST systems will be updated, including:

  • lsst-dev01, lsst-xfer, etc.
  • Slurm verification cluster
  • PDAC/Kubernetes/LSP clusters
  • tus-ats01
  • L1 test stand
lsst-admin@ncsa.illinois.edu

COMPLETE with the following exceptions:

  • lsp-int is still inaccessible
  • two one L1 test nodes are is still down (, lsst-l1-cL-frwd16 and lsst-l1-cl-ocs)
  • production Oracle services (lsst-oradb.ncsa.illinois.edu) are still down
2019-07-11
05:00
2019-07-11 
07:00
NCSAnet Wireless Tech services performed a software upgrade on the wireless access points at NCSA and NPCF building. Brief interruption in wireless network connectivity at NPCF and NCSA. Wired networking remained available during the maintenance. help+neteng@ncsa.illinois.eduMaintenance was completed by tech services. Wireless connectivity has been restored. If users notice any wireless issues, please contact neteng. 
2019-07-08 05:17:41

2019-07-08 06:38:00

iforge login nodeLost GPFS a couple of minutes after qualys scans started on the iforge clusteriforge login nodeiforge-admin@ncsa.illinois.edu

remount failed.  system needed a reboot.

Resolved.

2019-07-05
1PM
2019-07-05
6PM
Blue Waters ComputeNPCF Power hit, BW compute requires a full reboot.Blue Waters compute
Resolved
2019-06-30 ~16:152019-06-30 ~16:15iForgePower sags on all feeds at the NPCF datacenter caused reboots of several skylake compute nodes.iforge[129-136] rebootediforge-admin@ncsa.illinois.eduResolved
Jun 28th, 2019, 9:00 AMJun 28th, 2019, 12:15 PMGlobus NCSA_Nearline Endpoint (HPSS)Emergency maintenance to restore the systemNearline storage systemsJames GlasgowResolved
2019-06-27 11:042019-06-27 11:25iForgeGPFS restarted on main head node (iforge.ncsa.illinois.edu)

On iforge.ncsa.illinois.edu only:

  • Access to GPFS shared file system was temporarily unavailable (1-2 minutes only), which could have affected user interactive processes.
  • TORQUE commands (qstat, pbsnodes, etc.) were unavailable until the service was restarted at 11:25.
iforge-admin@ncsa.illinois.eduResolved
2019-06-23 06:002019-06-23 20:00

Facility power and cooling/ all production area will be impacted


Commissioning new protective relays: 30 minute rolling power outage

Sonexion re-power: 12 hours work duration 

Expect down time of 2 hours minimum for all production areas

Blue waters down time between 10-12 hours

rantissi@illinois.eduCompleted
2019-06-23 07002019-06-23 16:55iForge/aForgePower maintenance in NPCFAll systems were unavailable during the maintenance.iforge-admin@ncsa.illinois.edu
  • iForge was returned to service by 16:55
  • aForge will be returned to service no sooner than June 24
2019-06-23 07002019-06-23 1400DNS updates / ChangesDNS will be split into views (Internal / External).  External requests will no longer be able to perform looks up *.internal.ncsa.edu or reverse lookups on 10.0.0.0/8 and 172.24.0.0/13 IP space.  This is a security improvement to prevent lookup of internal DNS records from external clients of NCSA.neteng+help@ncsa.illinois.eduDone
2019-06-23 07002019-06-23 1400LSSTPower maintenance in NPCF
  • lsst-dev01, lsst-xfer, lsst-dbb-gw
  • Slurm verification cluster
  • PDAC/Kubernetes/LSP clusters
lsst-admin@ncsa.illinois.eduReturned to service
2019-06-18
13:00
2019-06-18
13:20
npcf-core-east Primary routing engine crashed on one of the core routers in NPCFNo visible impact to users since redundant hardware took over. If users noticed any issues please contact neteng.help+neteng@ncsa.illinois.eduReturned to service
2019-06-18 06:002019-06-18 07:30NCSA ITS vSphere vCenterUpgraded ITS vSphere vCenter server to latest version

All VMs remained online during the maintenance, but management through vCenter was unavailable.

Note: Users of the HTML5 interface may need to clear cached browser data after the upgrade. Clear cookies and cached site data from your web browser.

help+its@ncsa.illinois.edu

Status
subtletrue
colourGreen
titleComplete

Users of the HTML5 interface may need to clear cached browser data after the upgrade.

2019-06-12 12:452019-06-12 13:00

NCSA Fileserver

NCSA-Print

NCSA AD-B

Windows Servers experienced a brief outage, cause is currently unknown

Fileserver shares were unavailable

Shared printers from NCSA-Print were unavailable

help+its@ncsa.illinois.edu


Returned to Service
2019-06-11 12002019-06-11 1215NAPSMinor updates being applied to NAPSNAPS will be unavailable for a very short period from noon to 12:15pmhelp+idds@ncsa.illinois.edu

Complete

2019-06-05 13:202019-06-05 18:05NCSA FileserverThe storage used by NCSA Fileserver lost networking and is offline.

SMB file sharing is down, e.g.:

  • H drive in Windows
  • shared folders
help+its@ncsa.illinois.eduReturned to Service
2019-06-04 09002019-06-04 1700ISDA VM's including NCSA open sourceUpgrade of the storage server hosting the VM data, as well as upgrades to VM serversAll of ISDA VM's will be down during this time, including NCSA Open SourceAll machines upgraded, including firmware. Systems back online
2019-06-03 08302019-06-03 14:45Bluewaters NearlineOne tape library is having robotic issues. It will be unavailable while the vendor fixes the issue. This will mean some files will be unavailable for retrieval while the issue is addressed.  Transfers into the system should proceed as usual.Bluewaters NearlineBrian Dickinson

Returned to Service

5/31/19 9:30AM5/31/19 11:30 AMBlue Waters NearlineEmergency reboot to clear issues with two librariesNearlineBrian DickinsonReturned to Service
2019-05-28 0940

2019-05-28

1530

BW/HPSSOne tape library is having robot issues.  It will be unavailable while the vendor fixed the issue.  This will mean some files are unavailable for retrieval.  Transfers in to the system should proceed as usual.HPSS ncsa#Nearlinesstevens@illinois.eduReturned to Service


2019-05-22 08002019-05-22 0800External Samba/Windows File Sharing (TCP port 445) will be shut off.Direct external connections to TCP port 445 (Windows File Sharing / Samba) will be turned off on the NCSA Border.  Internal clients will no longer be allowed to connect to external SMB/Windows File Sharing.neteng+help@ncsa.illinois.eduCompleted.
2019-05-21 11:00am2019-05-22 7:00pmLSST, K8s clustersPlanned k8s cluster migration. Most Primary services did return around the scheduled time on the 21st. Development services took an extra day to stabilize.All LSST k8s at NPCFlsst-admin@ncsa.illinois.edu

Status
subtletrue
colourGreen
titlecomplete

Note: Nearly all services have been stabilized. The few remaining impact a very limited number of developers.

2019-05-22 14032019-05-22 1450iForgeGPFS was accidentally shut down on iforge (the main head node).on iforge only: GPFS filesystem access, TORQUE (e.g., qsub), system cronsiforge-admin@ncsa.illinois.eduNode was rebooted and returned to service.
2019-05-22 12002019-05-22 1300IDDS sybdev databases

PostgreSQL will be upgraded to version 9.6

The outage will likely last about 10 minutes, but reserving the hour in case of issues.

All uses of IDDS production databases, except Tableau dashboards

  • job usage
  • accounting
  • acctd, jobd, accounting.sh
  • IDDS API
  • NAPS
  • identity services
help+idds@ncsa.illinois.eduPostgreSQL upgraded to 9.6
2019-05-21 11:002019-05-21 11:30hub.ncsa.illinois.eduUpgrade of softwareShort outage during upgradesupgraded to latest version
2019-05-15
1300
2019-05-16
17:07
Blue WatersProjects has unavailable OSS and fschk in progress, Scheduler pausedSmall portion of Projects FSTimothy BouvetResolved/repaired
2 user files affected and restored
2019-05-16 08002019-05-16 1200

LSST

Monthly Maintenance:

  • Reference new LDAP & Kerberos servers

No interruption of service or downtime.

All LSST systems will be updated, including:

  • lsst-dev01, lsst-xfer, etc.
  • Slurm verification cluster
  • PDAC/Kubernetes/LSP clusters
  • tus-ats01
lsst-admin@ncsa.illinois.edu

Status
subtletrue
colourGreen
titlecomplete

2019-05-15 09302019-05-15 11:30

ad.ncsa.edu Domain Controllers (ad-a, ad-b)

Print Servers (ncsa-print, ncsa-printz)

Applying patches due to Microsoft Remote Desktop vulnerability CVE-2019-0708

Login to Windows Machines on the ad.ncsa.edu domain should not be affected, but may be interrupted for a short time.

Printing may be interrupted for a short time.

help+its@ncsa.illinois.eduWindows Domain Controllers and Print Servers were patched.  All services should be functioning normal.
2019-05-14 08:002019-05-14 09:00CILogonUpdate CILogon OAuth2/OIDC service to support public clients and redirect URI schemes other than https://. https://cilogon.org/oauth2/register updated with new functionality.help@cilogon.org Update was completed successfully
2019-05-10 09302019-05-10 1010

NCSA Identity
LSST Identity

Upgrade PHP to v7Minimal, momentary downtime is expected.help+its@ncsa.illinois.eduUpgrade was completed successfully.
2019-05-08 10002019-05-08 1100IRST SYSLOG collectors

syslog-sec.ncsa.illinois.edu will be changed from a DNS A record to a CNAME pointing to syslog.security.ncsa.illinois.edu

systems that send logs to IRSThelp+security@ncsa.illinois.eduwas completed successfully.

2019-05-07

11:00

2019-05-08

10:45

RSA Authentication Manager and the RSA Self-service ConsoleThe services were upgraded to the latest release to solve a serious security concernthe self service portal was down overnight and users were unable to change their PIN during that outageotp@ncsa.illinois.edunow running Authentication Manager 8.4.0.3.0
2019-05-07 07:002019-05-07 20:00iForge/aForgeQuarterly MaintenanceAll systems were unavailable during the maintenance.iforge-admin@ncsa.illinois.eduMaintenance successfully completed
2019-05-02 14:002019-05-02 14:30LSSTBrief network outage for Oracle (oradb)Oracle services hosted on oradb[01-03] will be interrupted briefly while the servers are moved to a different network switchlsst-admin@ncsa.illinois.edu

2019-05-02

14:00

2019-05-02

14:20

crashplancrashplan was upgraded to the latest 6.8.8 release to resolve security issuesbackups were suspended while the servers restarted the servicecrashplan@ncsa.illinois.eduNow running Code42 6.8.8
2019-04-30 06102019-04-30 0711NCSA Wiki & JiraThe Wiki & Jira servers did not startup as expected after a kernel update.The NCSA Wiki & Jira were unavailable
Wiki & Jira now available
2019-04-30 0900

2019-04-30 0930

CILogon (https://cilogon.org)New Logout endpoint (https://cilogon.org/logout) and updated ORCID credentialsORCID users may be asked to re-consent to release of ORCID iD to CILogon. help@cilogon.org Service update completed.
2019-04-23 12002019-04-23 1700CILogon 2.0 AWS (COmanage and LDAP)On April 23, 2019, the Amazon Web Services (AWS)  infrastructure supporting the CILogon COmanage Registry and  LDAP services will be modified to increase the high  availability (HA) posture. A new network load balancer (NLB)  will be introduced and DNS entries modified to point to the  new NLB interfaces. The existing NLB interfaces will continue  to function for 72 hours after the transition to support any
clients that have cached the older (current) DNS mappings.
CILogon COmanage and LDAPhelp@cilogon.orgWork completed.
2019-04-18 08002019-04-18 1200LSST

Monthly Maintenance:

  • 10G Network switch maintenance
  • GPFS server updates
  • OS updates and reboots
  • Dell firmware updates
  • Kubernetes update
  • Pending Puppet changes

ALL LSST systems, including:

  • lsst-dev01, lsst-xfer, etc.
  • Slurm verification cluster
  • PDAC/Kubernetes/LSP clusters
  • tus-ats01
lsst-admin@ncsa.illinois.eduMaintenance completed.
2019-04-18 09002019-04-18 0930NCSA Open SourceUpgrade confluence to apply security patch NCSA Open Source confluence, all other services are unaffectedopensource@ncsa.illinois.eduConfluence upgraded to 6.15.2
2019-04-17 08002019-04-17 1000ADSICCP Carne MaintenanceAll services will be down.
Maintenance completed.
2019-04-17 07:302019-04-17 19:45ICCP

Quarterly Maintenance

  • Ur1carne router code upgrade
  • Centos 7.6 upgrade
  • Deployment of HDR
All services unavailableiccp-admins@campuscluster.illinois.eduMaintenance completed.
2019-04-12
1410
2019-04-12
1534
Blue Waters/ SchedulerHSN issuescheduler paused
New Login sessions hang
tbouvet@illinois.eduHSN recovered, scheduling resumed
2019-04-11 
0555
2019-04-11 
0702
LDAPLDAP process crashedAuthentication to LDAP-backed serviceshelp+its@ncsa.illinois.eduLDAP was restarted
2019-04-10 08002019-04-1530wikiwiki was taken off-line for a security related upgradewiki was unavailablehelp+its@ncsa.illinois.eduNow running the latest version of confluence

2019-04-09 0900

2019-04-09 0930CILogon ( https://cilogon.org), myproxy.xsede.org, tfca.ncsa.illinois.eduDeploy new Luna SA HSM (hsm5) to production and take one old HSM (hsm4) offline (to serve as emergency backup).No downtime is expected. Use instructions at SafeNet LunsaSA HSM Monthly Testing to change pool of available HSMs on \{warm,cool,tepid\}.ncsa.illinois.edu .help@cilogon.org\{warm,cool\}.ncsa.illinois.edu now use hsm3+hsm5. tepid.ncsa.illinois.edu uses hsm5+hsm3. hsm4 will eventually be powered off and reserved as a backup.

2019-04-07

0645

2019-04-07

1650

Campus Cluster and ADSWe were experiencing network connectivity issues to both WAN and to some stuff internal to ICCP but all the traffic that was suspicious was going through the cc-core. Rebooting cc-core0 seems to have resolved the issue.Intermittent connectivity issue causing login and job submission to failed.iccp-admins@campuscluster.illinois.eduRebooting cc-core0 seems to have resolved the issue.
04/07/2019
9:30AM
04/07/2019
2:30PM
Blue WatersScheduler paused, oss hardware was replaced on scratch. Filesystem check in progress.

New jobs not starting.
Current jobs may stall if access bad oss.

tbouvet@illinois.eduOSS hardware replaced and scheduler resumed

2019-04-04

0800

2019-04-04

0830

NCSA LSST ResourcesSwitches servicing LSST hardware in NCSA-3003 were migrated to a new aggregation router.A brief network blip (~60s) occurred. All hosts have been verified after the moveneteng@ncsa.illinois.eduMaintenance has been completed.
2019-04-02 09002019-04-02 1500NCSA Open SourceUpgrade software and serverServer and/or services can be down during this timeopensource@ncsa.illinois.eduUpgrade completed
2019-04-02 09002019-04-02 1100CILogon (https://cilogon.org)Upgrade PHP from v5.6 to v7.3No downtime is expected.help@cilogon.orgUpgrade completed.
2019-April-01
NCSA DuoBackup code reminder emails were sent to all NCSA Duo participants in error. Your previously created backup codes are still valid. We are investigating why this email was sent.NCSA Duohelp+security@ncsa.illinois.eduAPI changes required re-coding the backup code process.
20190318 - 140020190318 - 1500BW Nearline EndpointScheduled HPSS software patch roll-upAccess to BW Nearline endpoint is suspededhelp+bw.ncsa.illinois.eduPatch installation complete
2019-03-12 07:002019-03-13 17:45LSST - LSST dev/Slurm compute nodes

network testing

24 compute nodes were reserved for admin use for this testinglsst-sysadm@ncsa.illinois.edutesting was extended into the 13th but was completed and nodes have been returned to service
2019-03-12 13:252019-03-12 14:25LSST

public DNS names were inadvertently removed for LSST's Oracle servers/service and the service became unavailable

LSST Oracle servers/servicelsst-sysadm@ncsa.illinois.edu
  • DNS was completed restored by 14:25
  • slowness following return to service was initially reported by one user but this seems to have resolved itself
2019-03-09 22:352019-03-09 22:35LSSTPower sag caused 27 L1 "NCSA test stand" nodes to reboot27 L1 "NCSA test stand" nodeslsst-sysadm@ncsa.illinois.eduServers rebooted themselves
2019-03-09 09:562019-03-09 10:31NCSA Jira, Pop, File-serverA VM host kernel panicked, causing its VMs to restart on alternate hosts.Jira, pop mail server, and file-server serviceshelp+its@ncsa.illinois.eduVMs automatically restarted themselves.
2019-03-08 06:152019-03-08 06:45NCSA Storage CondoThere was an IB error on the storage network causing the core servers to lose connectivity to disk.NFS/GridFTP/Remote Cluster Mountsckerner@illinois.eduThe node with the IB issue has been temporarily removed from service and will be placed back in when corrected.
06-Mar-2019 8am (CST)06-Mar-2019 9am (CST)All services behind pfsense firewall at NCSA. (qserv, verify, lsp, oradb)pfsense network config update to stage 'k8s-prod' deployment. Requires failover of firewall, and may cause short (~60s) outage of systems behind the firewall.All services behind pfsense firewall at NCSA. (qserv, verify, lsp, oradb)

lsst-sysadm@ncsa.illinois.edu

help+security@ncsa.illinois.edu

Complete
2019-03-04 2:00 pm2019-03-04 2:08pmNAPS

IDDS will be applying several updates to the NCSA Allocations Processing Service (NAPS):

(1) Searches for logins will only find those logins for the current domain

(2) Logins will always be created for the same organization as the domain (instead of always creating an NCSA login)

(3) Valid login rules will check the rules for the organization of the current domain

(4) Bug fix to make sure int args to procedures are passed as ints, not strings

(5) Speed up project loading process

(6) Dynamically determine compute resources

(7) Correct information in confirmation message when terminating a user from a project

(8) When selecting allocation for new users, only show the most current allocations for each resource

NAPShelp+idds@ncsa.illinois.eduComplete
2019-03-01 09:002019-03-01 09:33aForgeMultiple Ambari Services were in an error state. Individual service starts would fail.Job submission was downaforge-admin@ncsa.illinois.eduCluster was restarted
2019-02-27 23:002019-02-28 08:23NCSA VPNCampus moved Duo to a different instance (off of DUO1) to improve performance and reduce future downtime.  NCSA Duo is bundled with campus Duo and is also affected.  The vendor has completed changes but additional work appears needed on the NCSA VPN to accommodate this change.NCSA VPN was not working with Duo push - entering the 6 digit passcodes generated by the Duo app can be used as a work-around.

help+neteng@ncsa.illinois.edu


NCSA VPN is now working for both push and passcodes.
2019-02-27 23:002019-02-27 23:59Any system using Duo authenticationThe vendor moved us to a different Duo instance (off of DUO1) to improve performance and reduce future downtime.

Anyone who has a current session will not be impacted, it will only be for people trying to auth into a new session.   We expect Duo to be up most of this change window and actual downtime to be minutes.  All systems using Duo are affected including:

help+neteng@ncsa.illinois.edu

help+security@ncsa.illinois.edu


Vendor has completed work and most systems appear to be functioning, however it appears some local changes are needed for the NCSA VPN - see separate posting.
2019-02-22 06:302019-02-22 07:00ICCP WAN

This morning during a routine generator transfer test, one of the UPS units in a Tech Services networking node, node-1, failed resulting in a loss of power to portions of node-1.  Network Engineers were on-site during the test and were able to quickly resolve all issues stemming from that loss in power.  Not all equipment hosted in node-1 was impacted but one of the campus core routers, equipment hosting the science DMZ (CARNE and thus ICCP WAN as a whole) and other parts of the ICCN (Inter-Campus Communication Network) were impacted. 

All networking in and out of ICCP was down. Intra-cluster networking within ICCP was not affectedhelp+neteng@ncsa.illinois.eduICCN network engineers resolved the issues and things came back up successfully
2019-02-21 10:002019-02-21 14:00ICCPmoab core dump during startup.No one can submit job and no new jobs will start.help@campuscluster.illinois.eduAble to restart moab after removing all checkpoint files.
2019-02-21 08:002019-02-21 12:00LSST

Monthly maintenance

  • OS/Yum updates
  • Switch maintenance in NPCF N73 & P73
  • pfSense update & port negotiation change
  • GPFS server updates
  • Firmware updates for Dell C6420s

ALL LSST systems, including:

  • lsst-dev01, lsst-xfer, etc.
  • PDAC, verification, and Kubernetes clusters
  • tus-ats01
lsst-admin@ncsa.illinois.edu

Maintenance was successfully completed with one pending issue:

  • monitoring hosts (lsst-int-monitor; monitor-ncsa) are not showing status information due to problem reaching InfluxDB resolved
2019-02-21 09:20 AM2019-02-21 09:26 AMServices using DUOThe DUO1 deployment experienced a load balancer failure resulting in 100% of authentication requests failing to complete.All systems using Duo were affected including:help+security@ncsa.illinois.eduThis issue was identified and resolved via automated remediation by the vendor.  See http://stspg.io/940af334e for details.
2019-02-18 01:31 PM2019-02-18 05:05 PMICCPMoab was crashing after a few minutes of starting.Jobs could be submitted, but would not start.iccp-admins@campuscluster.illinois.eduMoab was restarted with no additional commands run (showconfig, etc.). This allowed Moab to properly index the job database. After completion, the scheduler was stable again.
2019-02-18 9:00 AM2019-02-18 11:00 AMLSST - K8sSecurity update of Docker and Kubernetes packages to address CVE-2019-5736Qserv, All LSST services running in K8s.lsst-admin@ncsa.illinois.eduPatching completed on time (10:00 AM). Additional troubleshooting of lsp-stable & lsp-int indirectly related to maintenance.
2019-02-15 1:15 PM2019-02-15 about 1:45 PMSome internet connectivity

ICCN router card crashed. Some commodity internet traffic was affected during the timeframe listed.

Commodity traffic to/from NCSA. neteng+help@ncsa.illinois.edu This has been resolved.
2019-02-13 17:002019-02-13 21:00netdot.ncsa.illinois.eduNetEng will be migrating Netdot to a new platform.Users will not be able to login into the NetDot IPAM and make/view DNS entries. The DNS servers will remain available throughout the window. help+neteng@ncsa.illinois.edu This has been completed.
2019-02-10 11:40am

2019-02-12 11:50am

ICCPController failed that caused an interruption with the redundant controller, have a new enclosure in place, waiting on valid second controller still. Cluster has returned on one controller after FSCK came back clean on the file systemShared file systems on cluster were unavialableset@ncsa.illinois.eduAfter force verifying the Pools, running FSCK on file system, swapping enclosure, file system returned to service. New controller successfully installed on 02/13; opened PMR with IBM on FSCK duration
2019-02-11 11:002019-02-11 17:50IDDS job processing

We will be doing a correction to a large number of Blue Waters job records in the IDDS database.
This process will begin at 11am and is expected to last around 6-7 hours.

There will be a small interruption to real time job loading for Blue Waters that should last around 1 hour.
Although there should be little impact to other systems, database access to the jobs table might be sluggish.
help+idds@ncsa.illinois.eduComplete
2019-02-10 21:002019-02-11 09:15NCSA Open Sourcekernel crashed. proxy server is down resulting in all of NCSA Open Source services being unreachable

NCSA OpenSource: JIRA, WIKI, BAMBOO, Confluence

devops.isda@lists.illinois.eduphysical reboot of server resolved issue
2019-02-08 13:002019-02-08 17:30

BlueWaters HPSS

ncsa#Nearline globus service

HPSS core server encountered a bug and crashed


Vendor is installing a patch to the core hpss server. 


Anticipating the system will be returning to service by 17:20

BlueWaters HPSS storage

Globus transfers to/from ncsa#Nearline

Vendor installed a patch

HPSS and ncsa#Nearline were returned to service


2019-02-07
5:00 AM
2019-02-07
5:30 PM
BW/HPSS
ncsa#Nearline (GO)
Scheduled MaintenanceSoftware and firmware updates completed.help+bw@ncsa.illinois.eduncsa#Nearline (GO) returned to service
2019-02-06
9:05 AM
2019-02-06
3:14 PM
BW/SchedulerHSN issue - full reboot to recoverMainframe rebooted and all running jobs were lost.help+bw@ncsa.illinois.eduBW returned to service
2019-02-05 07:002019-02-05 22:00iForge/aForgeQuarterly Maintenance (20190205 Maintenance for iForge)All systems were unavailable during the maintenance.iforge-admin@ncsa.illinois.eduMaintenance was successfully completed. iForge and aForge were returned to service by 22:00.
2019-02-02 6:402019-02-02 10:20ICCP schedulerRoot fill up on cc-mgmt1.Both resource manager and scheduler were downiccp-admins@campuscluster.illinois.edu

Boot the system into single user mode and gzip old messages file and moved this to GPFS.

Having issue restarting moab after that. Restart moab with clear checkpoint option and it works.

2019-01-31 06:002019-01-31 07:10NCSA ITS vSphere vCenterUpgraded ITS vSphere vCenter server to latest versionAll VMs will remained online during the maintenance, but management through vCenter was unavailable.

help+its@ncsa.illinois.edu

Upgrade complete

2019-01-30 10:00 p.m.2019-01-30 12:00 p.m.NCSA XSEDE DNS serverPerforming patching/upgrade on the ns1.xsede.orgWhile patching the ns1.xsede.org DNS server will be unavailable intermittently. Backup DNS servers will remain during this time frame.help+neteng@ncsa.illinois.eduMaintenance complete

2019-01-23

5PM

2019-01-24

8AM

FileserverScheduled MaintenanceShares on Fileserver were unavailable during the outage.help+its@ncsa.illinois.edu

Maintenance complete

2019-01-18 12:142019-01-18 14:32RSA OTP user portalAn ESXi server crashed taking down several VMs it was hosting. The OTP VM rebooted on an alternate ESXi hosts.RSA OTP user portalhelp+its@ncsa.illinois.edu

RSA OTP user portal online

2019-01-18 12:142019-01-18 13:30JIRA, file-server, ad-a, jabber, vsphere, email relayAn ESXi server crashed taking down several VMs it was hosting. The VMs all rebooted on alternate ESXi hosts.

JIRA, file-server, ad-a, jabber, vsphere, and email relay all rebooted

JIRA had index files corrupted and took a while to repair those

help+its@ncsa.illinois.edu

JIRA, file-server, ad-a, jabber, vsphere, and email relay rebooted and online

2019-01-17 08:002019-01-17 12:00LSST

Monthly maintenance

  • Power rebalancing in NPCF L73
  • Switch maintenance in NPCF M73, N73, P73
  • Critical security patching
  • Dell firmware upgrades
ALL LSST systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC, verification, and Kubernetes clusters, and tus-ats01)lsst-admin@ncsa.illinois.edu

Maintenance was completed successfully with the following caveats:

  • lsp services in Kubernetes are not fully functional (this is carryover from before the PM; see discussion on Slack, dm-lsp-users and possibly other channels)
  • lsst-l1-cl-dmcs will not boot after firmware updates

Please open tickets if you notice other issues.

2019-01-16

10:00

2019-01-16

10:15

NPCF Emergency power offEmergency power off panel was energized Facility electrical and HVAC systemsmrantissi@illinois.eduPanel is armed
01/12/2019 8AM

01/12/2019 1PM

BW/Mainframe resourceHung threads on scratch/home, paused the scheduler, HSN requires full reboot to recover 9:30AMMainframe rebooted and all running jobs were lost.Timothy BouvetBW returned to service 1PM

2019-01-10

5:35PM

2019-01-10 5:55PMcode42 crashplan pro e services had update for dataloss bug with MS OneDriveCode42 crashplan service was updatet to latest release to fix a dataloss problem with clients also running MS One Drive.Backup services were interrupted for a few minutes while services updatedcrashplan@ncsa.illinois.eduNow running Code42 6.8.6
2019-01-10 3:20PM2019-01-10 5:00PMDUO 2-Factor AuthDUO Upstream vendor reported issues with their service.
https://status.duo.com/
NCSA systems that use DUO for 2FAhelp+security@ncsa.illinois.eduDUO brought their systems back online
2019-01-09 10:28 AM2019-01-10
3:00 PM
BW/HPSSPower event at NPCF and recovery from falloutHPSS ncsa#NearlineGlasgow, James A
glassgow@illinois.edu
HPSS ncsa#Nearline RTS
2019-01-09 10:28 AM2019-01-09
4:35 PM
BW/All Resources DownPower event at NPCF and recovery from falloutAll BW Resources Downtbouvet@illinois.eduPower Restored, All Resources Except HPSS RTS
2019-01-09
1015
2019-01-09
12:15
Industry systems/ LSST systemsPower event at NPCF caused some Industry and some LSST systems to go offlineRunning jobs on iforge and other systems

help+industry@ncsa.illinois.edu

lsst-admin@ncsa.illinois.edu

The affected systems have been returned to service and users are being notified of which jobs to rerun
2019-01-08
18:00 
2019-01-08
19:00 
NCSA office net firewallSoftware upgrade on NCSA firewall and some config changes.NCSAnet wireless, Wired network (closed and partially-closed nets). IllinoisNet wireless will remain available during the maintenance.help+neteng@ncsa.illinois.eduFirewall upgrade did not go through however all services have been restored. NetEng is investigating and will work with the vendor to figure out a solution.

01/08/2018 2:20PM

1/08/2018

2:40PM

code42 crashplan pro e services had update for security issuesCode42 crashplan service was updated with the latest security fixesBackup services were interrupted for a few minutes while services updatedcrashplan@ncsa.illinois.eduNow running Code42 6.8.5

...