Child pages
  • NCSA Status Home
Skip to end of metadata
Go to start of metadata

status.ncsa.illinois.edu

 

Icon
Watch this page in the wiki to subscribe to automatic updates to this status page.

Current Status

 

StartWhat System/Service is affectedWhat is happening?What will be affected?Actions
2017-10-19 08:00LSSTOutage and migration of qserv-master01: provisioning of new hardware, copying of data from old server to new.qserv-master01 (and any services that depend on qserv-master01, which may include services provided by qserv-db*, qserv-dax01, and sui*)

in progress

UPDATE (2017-10-19 15:15) OS install took much longer than anticipated, completed at 15:00. Data sync is started. Extending outage till 22:00.

Include the keyword "issue" in updates above to trigger actions.

Report a problem

Upcoming Scheduled Maintenance

StartEndWhat System/Service is affectedWhat is happening?What will be affected?
2017-10-23 08:002017-10-24 08:00Campus Cluster

Moving and rewiring DES rack, installing new cc-core switches and rewiring Shared Services POD.

Scheduler will be down, GPFS will be down and no outside connectivity for brief moment.
2017-10-23 08:002017-10-24 08:00Active Data ServiceSince these machines will be down due to the ICCP network maintenance, I am going to take the time to patch them and upgrade GPFS to 4.2.3.4.All services running on the ADS systems.
2017-10-19 08:002017-10-19 16:00LSSTOutage and migration of qserv-master01: provisioning of new hardware, copying of data from old server to new.qserv-master01 (and any services that depend on qserv-master01, which may include services provided by qserv-db*, qserv-dax01, and sui*)


Previous Outages

StartEndWhat System/Service was affected?What happened?What was affected?Outcome
10-19 08:002017-10-19 12:00LSSTRoutine patching and reboots, pfSense firmware updates (NPCF), Dell server firmware updates (NPCF).All NCSA-hosted resources except for Nebula.Maintenance completed successfully. (qserv-master migration is ongoing, see separate status entry)
2017-10-18 14:452017-10-18 15:35Campus ClusterRestart of resource manager failed after removing all block array jobs.Job submissionOpened case with Adaptive (#25796). Found more array jobs and bad jobs in jobs directories. Removed all of those.
2017-10-15 08:152017-10-15 08:30Open SourceEmergency upgrade of Atlassian Bamboo.Bamboo will be down for a few minutes during this outage window.Bamboo upgraded to the latest version.
2017-10-14 22:152017-10-14 23:35Campus ClusterScheduler crashJob submissionOpened case with Adaptive, run diag and uploaded the output along with the core file. Restarted the moab.
2017-10-14 13:002017-10-14 15:23Campus ClusterResource manager crashJob submissionApplied patch from Adaptive, which help with faster recovery. Suspend/block all current and new array jobs until we have a resolution.
2017-10-06 09:002017-10-11 01:00NebulaGluster and network issues

1) Gluster sync issues continue from 2017-10-05's Nebula incident.
2) At approximately 2017-10-06 16:10, a Nebula networking issue (unrelated to the Gluster issues) occurred resulting in host network drops within the Nebula infrastructure. This internal networking incident resulted in additional gluster and iscsi issues.
Many instances are broken because iSCSI is broken from the Nebula network issues. And any instances that were broken because of gluster are still broken.

All instances have been restarted and are in a state for admins to run. Some mounted file systems might require a fsk to verify. If there are other issues please send a ticket.

As the file system continues to heal we may see slower interaction.

2017-10-10 16:302017-10-10 19:10Campus ClusterResource manager crashJob submissionAfter removing problematic jobs from queue and we were able to restart the RM. Opened the case with Adaptive and forwarded those job scripts and core files.
2017-10-05 14:002017-10-05 17:00NebulaGluster sync issuesOne of the gluster storage servers within Nebula had to be restarted.Approximately 100 VM instances experienced IO issues and were restarted.
2017-10-06 08:002017-10-06 17:00NCSA direct peering with ESnet

A fiber cut between Peoria and Bloomington caused our ESnet direct peering to go down.

All traffic that would have taken the ESnet peering rerouted through our other WAN peers. As such there were no reported outages of connectivity to resources that users would normally access via this peeringThe fiber cut has been repaired and the peering has been re-established.
2017-10-06 08:002017-10-06 10:00LSSTKernel and package updates to address various security vulnerabilities, including the PIE kernel vulnerability described in CVE-2017-1000253. This will involve an upgrade to CentOS 7.4 and updates to GPFS client software on relevant nodes.All NCSA-hosted LSST resources except for Nebula (incl. LSST-Dev, PDAC, and verification/batch nodes) will be patched and rebooted.Maintenance completed successfully. Pending updates to a couple of management nodes (adm01 and repos01) and one Slurm node that is draining (verify-worker11).
2017-10-4 07:402017-10-4 09:55Campus ClusterResource Manager crashJob submissionFailure on initial restart attempt. After looking through the core, decided to try a restart again without any change. This time it worked.
2017-10-03 13:002017-10-03 19:00Campus ClusterResource Manager crashJob submissionAfter removing ~30 problematic jobs from queue and we were able to restart the RM. Opened the case with Adaptive and forwarded those job scripts and core files.
2017-09-21 02:572017-09-21 09:40Storage server (AFS, iSCSI, web, etc)

The parchment storage server stopped responding on the network.

 

  • Several websites were down, including the following: www.ncsa.illinois.edu, cybergis.illinois.edu, nationaldataservice.org, etc
  • iSCSI storage mounted to fileserver went offline.
  • Several AFS volumes, including some users' home directories were offline.
Replaced optical transceiver on the machine and networking restarted. Also updated kernel and AFS.
2017-09-20 08:002017-09-20 13:45Campus ClusterSeptember MaintenanceTotal cluster outageMaintenance completed successfully.
2017-09-20 08:002017-09-20 11:30 NCSA Storage Condo
Normal maintenance --Firmware upgrade on Netapps so new disk trays could be attached for DSILtotal file system outageThe quarterly maintenance was complete
2017-09-18 11:202017-09-18 13:30Active Data StorageRAID Failure in NSD server and disk failure on secondary NSD server.ADS service was unavailableRecovered RAID configuration on NSD server and replaced failed disk on secondary NSD. ADS restored.
2017-09-15 06:202017-09-15 09:28public-linuxOpenAFS storage was not running or mounted after rebooting to a new kernel.AFS storage was not available from this serverReinstalled the dkms-openafs package restarted the openafs-client. AFS is now working as expected.
2017-09-10 09:452017-09-10 11:30NCSA Open SourceUpgrade of Bamboo, JIRA, Confluence, BitBucket, FishEye, CrowdDuring the upgrade the services will be unavailable for a short amount of time.All services upgraded successfully.
2017-08-31 11:072017-08-31 11:11NCSA LDAPNCSA LDAP TimeoutsNCSA LDAP was overloaded and timing out. Users were not able to authenticate via NCSA LDAP during that time.NCSA LDAP stopped timing out at 11:11 am and authentication resumed.
2017-08-28 11:552017-08-28 12:59NCSA GitLabNCSA GitLab server ran out of disk space for the OSThe web interface at https://git.ncsa.illinois.edu wasn't workingWeb interface is now working. Space freed up by clearing CrashPlan caches.
2017-08-24 13:002017-08-24 14:30netact.ncsa.illinois.eduTransient config issues from some system patching caused apache to not be able to start on the netact serverNetwork Activation The issues were fixed and Network Activation is working again
2017-08-24 08:002017-08-24 15:30LSSTRack upgrades in NCSA 3003Most LSST Developer services offline during upgradeAll LSST systems are back online with new racks and switches
2017-08-24 08:002017-08-24 09:30LSSTmonthly maintenance for NPCF (includes patching to address CESA-2017:1789 and CESA-2017:1793)adm01, backup01, bastion01, monitor01, object*, qserv*, sui*, verify-worker*, test0*Maintenance was successfully completed.
2017-08-23 09:212017-08-23 16;50aForge/iForgegpfs failed during an upgrade of GPFS on the iforge storage nodes.  There was an IB hiccup at the time, but causality is unclearall jobs on iforge were aborted, gpfs clients needed to be upgrade, all gpfs client nodes were rebootediForge went production shortly before 5:12pm.  aforge went "production" at ~1630
2017-08-22 20:002017-08-22 30:00Patching DHCP servicePatching OS and services on DHCP1.Will need to reboot DHCP server a few times during this process. During the time dhcp will be unavailable. This is during the evening so I don't expect any direct issues from this.Patching has been completed.
2017-08-16 08:002017-08-16 16:00Campus ClusterAugust MaintenanceScheduler and resource manager downUpgraded Moab 9.1.1 and Torque 6.1.1.
2017-08-16 08:002017-08-16 09:15NoneReplace Line Card in Core SwitchI believe all systems connected to this switch, are multihomed and will not experience an outage.The line was has been successfully replaced.
2017-08-16 00:302017-08-16 02:30Blue WatersTwo cabinets (c10 & c11) had EPO due to XDP control valve failure.Scheduler was paused to isolate failing parts, resumed at 2:09.Parts replaced and cabinets were returned to service.
2017-08-08 7:002017-08-09 3:00iforge/cfdforge/aforge

Update OS image to RH 6.9

Update GPFS to version 4.2.3-2

Redistribute power drops

All four clusters were updated.

All items on checklist completed.

20170808 Maintenance for iforge

2017-08-03 06:452017-08-03 07:35NCSA Jabber upgradeUpgraded Openfire XMMP jabber softwareNCSA Jabber was unavailable during the upgrade.Jabber was upgraded to the latest version of Openfire
2017-07-28 17:002017-07-31 evening Update - All of the production data has been migrated except for the largest object table. That is loading now, then the user space will be loaded. Should all hopefully be done by this evening. Migration of operational database to new hardware happening during the weekend. DES old operational databasemigration done successfully. Some other maintenance tasks that will give DES additional disk space was done, too and some performance improvements.
2017-07-27 11:002017-07-28 15:00netact.ncsa.illinois.edu

 The netact.ncsa.illinois.edu network activation server VM needed to be restored from backup

Network Activation serviceThe service has been fully restored
2017-07-25 02:362017-07-25 18:00Campus Cluster / Scheduler downBlip on mgmt1 causing GPFS drop and scheduler to crashScheduler offlineStill taking long time for Scheduler to initialize but jobs can start and run as usual. Opened case with Adaptive.
2017-07-20 09:002017-07-20 17:00ROGER Ambari and OpenStackUpdates to openstack control node and the Ambari clusterAmbari nodes (cg-hm08 - cg-hm18), OpenStack instances and serversOpenstack was back in service on time. Ambari had issues mounting hdfs was held out of service. HDFS was remounted on 25 July
2017-07-20 06:002017-07-20 10:00All NCSA hosted LSST resourcesMonthly OS patches (addressing issues including CESA-2017:1615 and CESA-2017:1680). Roll-out updated puppet modules. Batch nodes updated firmware.All nodes in NCSA 3003 and NPCF (batch nodes) will reboot.Overall success. Exceptions: verify-worker31 failed a firmware update and is out of comission (LSST-914) and there are connectivity issues for some VMs used by the NCSA DM team (IHS-365). adm01, backup01, and test[09-10] will be patched in the near future.
2017-07-19 08:002017-07-19 14:44Campus ClusterJuly Maintenance (applied security patch)Cluster wide, except mwt2 nodesApplied new kernel, glibc, bind patches and newest NVIDIA driver.
2017-06-29
1800
2017-06-30 0000Blue WatersEmergency maintenance to apply security patch addressing Stack Guard security vulnerability.Compute, Login, Scheduler are offline.Kernel and glibc library patched on all affected system.
2017-06-22 08002017-06-22 1200All NCSA hosted LSST resourcesCRITICAL kernel and package updates to address Stack Guard Page security vulnerability.

Systems will be patched and rebooted.

Outage was extended to last past 1000 until 1200. Systems were successfully patched as planned except for qserv-db12 and qserv-db27, which will not boot. We will follow up on those with a ticket.
2017-06-22 08002017-06-22 0930LSST cluster nodes (verify-worker*, qserv*, sui*, bastion01, test*, backup01)Deploy Unbound (local caching DNS resolver)DNS resolving may have a short (~30 mins) delay. Successfully deployed and all tests (including reverse DNS and intra-cluster SSH) pass.
2017-06-20
0930
2017-06-20
1100
BluewatersXDP shutting down causing EPO on cabinet c1-7 and c2-7.Scheduler was paused to isolate the failing components, then resumed.Warmswap of failing components, and returned them to service.

2017-06-20

0900

2017-06-20

1000

NCSA Open Source

Security upgrade needed for Bamboo, will also update the following components: Bamboo, JIRA, Confluence, BitBucket, FishEye

Most of the subcomponents of NCSA opensource will be down for a short time when the software is updated.Upgraded Bamboo, JIRA, Confluence, BitBucket, FishEye to latest versions

2017-06-16

0900

2017-06-16

1100

ROGER Openstack nfs backend failed and was restartedThe primary CES server for the openstack backend failed and tried to fail over to the secondary server, which also failed. SET was notified and they had the CES nfs service back up by 1100The RoGER openstack dashboard went down and needed a restart. Several VM's experienced "virtual drive errors" and will need to be restartedSET is still investigating the cause of the GPFS CES service failover. CyberGIS is working with their users to get the affected VM's restarted
2017-06-15 08002017-06-15 0930LSST cluster nodes (verify-worker*, qserv*, sui*, bastion01, test*, backup01)Deploy unboundDNS resolving may have a short (~30 mins) delay.

Updates deployed successfully via new puppet module. All tests passed.

EDIT 2017-06-15 1500 - Reverse DNS not working, which broke ssh to qserv* nodes. Disaabled unbound.

6/14/2017

8:00 a.m.

6/14/2017

10:00 p.m.

Network Core SwitchNetwork Engineering will be replacing a line card in one of our Core switches due to hardware issue.All services should remain active. Any affected switch will have a second redundant link to the other core to pass traffic.Line card was successfully replaced.
2017-06-08 12:002017-06-11 22:20Campus Cluster (scheduler paused)Disk Enclosure 3 failure on DDN 10K.Lost redundancy and force us to drain the cluster.Repair/replacement for controller can be time consuming so we took action to rebalance data out of failed enclosure. Scheduler was resumed as of 22:00.

2017-06-07 12:07

2017-06-07 12:42NCSA LDAPThe NCSA LDAP service crashedNCSA LDAP service was unavailableLDAP software and OS were updated and server rebooted. LDAP is working normally.
2017-05-31 20:062017-05-31 20:36NCSA LDAPThe NCSA LDAP service was timing outNCSA LDAP service was unavailableThe root cause of LDAP timeouts is still being investigated.
2017-05-222017-05-26Campus Cluster VMsNetwork issue ESXI (hypervisor) Boxes after maintenanceCould no longer able to login to start VMs. License Server, nagios, all MWT2 VMs were down

The issue is fixed on 5/24. Restored license and Nagios service on 5/24. Moved MWT2 VMs to Campus Farm. All VMs return to service as of noon 5/26.

5/12/20175/18/2017Condo/NFS partitions onlythe NFS partition for the condo became extremely unstable after a replication (normal daily maintenance) was completed. Many iterations with FSCK and IBM on the phone got it resolved, and then 1.5 days restoring files that had been put in Lost and found.UofI library was switched to the READONLY version on the ADS during this timeThe root cause is still being investigated.
2017-05-23 14:052017-05-23 14:13NCSA LDAPThe NCSA LDAP service was timing outNCSA LDAP service was unavailableThe issue is still being investigated, but seems to be steadily available since the incident.
2017-05-22 15:412017-05-22 15:51idp.ncsa.illinois.edu
oa4mp.ncsa.illinois.edu
Apache Tomcat out of memoryInCommon/SAML IdP and OIDC authentication services were unavailable.Service restored by failing over to secondary server while memory is being increased on primary server.
05/20/2017 21:09

05/20/2017 23:37

DES nodes on Campus ClusterCould not communicate outside the switchAll nodes connected to switch in POD22 Rack2 @ACBUpgraded the code on the switch resolved the issue.
05/20/2017 05:0005/20/2017 21:09Campus Cluster and Active Data Storage (ADS)Total power outage at ACBAll systems currently reside at ACB

Power was restored around 13:00hrs. We rotated ADS rack to align with Campus Cluster Storage Rack. Changed couple of VLAN IDs to reflect campus for future merger. ESXI boxes are down due to a configuration error after reboot. No major issue from output of FSCK from scratch02.

05/17/2017 02:0005/17/2017 10:45Internet2 WAN connectivityIntermittent WAN connectivity. The outage was a result of Tech Services' DWDM system, which provides us with our physical optical path up to Chicago via the ICCN. Specifically, the Adva card that our 100G wave is on was seeing strange errors, which was causing input framing errors for traffic coming in on this interface.General WAN connectivity to XSEDE sites, certain commodity routes, and other I2 AL2S connections.The Adva card was rebooted and we stopped seeing the input framing errors. Tech Services is working with Adva to find the root cause of the issues on the card.
5/11/20175/12/2017ESnet 100G connectionNCSA and ESnet will be moving their 100G connection to a different location in Chicago.We have several diverse high speed paths to ESnet and DOE, traffic will be redirected to a secondary path. 
2017-05-11
06:45
2017-05-11
07:33
NCSA Jabber upgradeUpgraded Openfire XMMP jabber softwareNCSA Jabber was unavailable during the upgrade.Jabber was upgraded to the latest version of Openfire

2017-05-09

07:00

2017-05-09

18:15

iForge, GPFS, License ServersiForge Planned MaintenanceiForge systems, including the ability to submit/run jobs.Pm was completed early at 1815
2017-05-06 22:002017-05-06 23:00NCSA Open SourceUpgrades of Atlassian softwareNCSA Open Source BitBucketBitBucket is upgraded.
2017-05-06 09:002017-05-06 10:00NCSA Open SourceUpgrade of Atlassian SoftwareMost services hosted at NCSA Open Source were down for 5 minutes during rolling upgrades.The following services were upgraded: HipChat, Bamboo, JIRA, Confluence, FishEye and CROWD.
2017-05-05 17:432017-05-05 20:02ITS vSphereA VM node panickedSeveral VMs died when the node panicked and were restarted on other VM nodes. This included LDAP, JIRA, Help/RT, SMTP, Identity, and others.All affected VMs were restarted on other VM nodes. Most restarted automatically.
2017-04-27 18:102017-04-27 18:55Campus ClusterAnother GPFS interruptionBoth Resource Manager and Scheduler went down along with hand full of compute nodes.Restarted the RM and Scheduler and rebooted all down nodes.
2017-04-27 13:112017-04-27 14:20Nebulaglusterfs crashed due to this bug, so no instances could access their filesystemsAll instances running on NebulaNeeded to reboot the node that systems were mounting from, but took the opportunity to upgrade all gluster clients on other systems while waiting for a reboot. Version 3.10.1 fixes the bug. All instances with errors in their logs were restarted.
2017-04-27 11:202017-04-27 12:45Campus ClusterGPFS interruptionBoth Resource Manager and Scheduler went down.Torque serverdb file was corrupted. Restore the file from this morning snapshot and modified the data to match the current state.
2017-04-26 12:002017-04-26 18:30CondoA bug in the delete of a disk partition from GPFS. a problem within GPFSDES, Condo partitions, and UofI Library.Partitions had been up for 274 days, and many changes. The delete partition bug caused us to stop ALL operations on the condo and repair each disk through GPFS. Must have quarterly maintenance. Just too complicated to go a year without reseting things.
2017-04-19 16:542017-04-20 08:45gpfs01, iforge

Filled-up metadata disks on I\O servers caused failures on gpfs01.

iforge clusters, including all currently running jobs.

Scheduling on iForge was paused for the duration of the incident. Running jobs were killed.13% metadata space was freed. Clusters were rebooted and scheduling resumed.

2017-04-19 08:002017-04-19 13:00Campus ClusterMerging xpacc data and /usr/local back to data01 (April PM)Resource manager and Scheduler were unavailable during the maintenance.Once again, /usr/local, /projects/xpacc and /home/<xpacc users> are mounting from data01. No more split cluster.
2017-04-04 (1330)2017-04-04 (1600)NetworkingSome fiber cuts caused a routing loop inside one of the campus ISP's network.Certain traffic that traversed this ISP would never make the final destination. Some DNS lookups would have also failed.Campus was able to route around the problem, and the ISP also corrected their internal problem. The cut fiber was restored last night.
2017-03-28 (0000)2017-03-29 (1600)LSSTNPCF Chilled Water OutageLSST - Slurm cluster nodes will be offline during the outage. All other LSST systems are expected to remain operational.No issues. Slurm nodes restarted.
2017-03-28 (0000)2017-03-29 (0230)Blue WatersNPCF Chilled Water OutageFull system shutdown on Blue Waters (except Sonexion which is needed for fsck)FSCK done on all lustre file systems, XDP piping works done (no leakage found), Software updates (PE, darshan) completed.
2017-03-25
10:15PM
2017-03-26
00:08AM
Blue WatersBW scratch MDT failover, df hangsBW scratch MDT failover, load on mds was 500+ delayed failover. Post FO had some issues that delayed RTS.scheduler was paused
2017-03-25
4pm
2017-03-25
8Ppm
Blue WatersBW login node ps hangrebooted h1-h3, lost bw/h2ologin DNS record, had neteng recreate the record. Had to rotate login in and out of round-robins until all rebooted. User email sent (2).Logins nodes rebooted
DNS round-robin changes
2017-03-23 (1000)2017-03-23 (1500)NebulaNCSA Nebula OutageNebula will take an outage to balance and build a more stable setup for the file system. This will require a pause of all instances, and Horizon being unavailable.File system online and stable. At this time all blocks were balanced and healed.
2017-03-16 (0630)2017-03-16 (1130)LSSTLSST monthly maintenanceGPFS filesystems will go offline for entire duration of outages. Some systems may be rebooted, especially those that mount one or more of the GPFS filesystems. 
2017-03-15
15:11 
2017-03-15
16:01 
Blue WatersFailure on cabinet c9-7, affecting HSN.Filesystem hung for several minutes.Scheduler was paused for 50 minutes.
Warmswap cabinet c9-7.
Nodes on c9-7 are reserved for further diagnosis.  
2017-03-15 09:002017-03-15 12:47Campus ClusterUPS work at ACB.Reshuffling electrical drops on 10k controllers, storage IB switches and some servers.Scheduler will be paused for regular jobs. MWT2 and DES will continue run on their nodes.UPS work at ACB - incomplete (required additional parts)Redistributing power work done.Scheduler was paused for 3hrs 50 mins.
2017-03-10 13:002017-03-10 18:00Campus ClusterICCP - We lost 10K controllers due to some type of power disturbance at ACB.ICCP - Lost all filesystem and its a cluster wide outage.Recovered missing LUNs and rebooted the cluster. Cluster was back in service at 18:00.
2017-03-09 09002017-03-09 1500RogerROGER planned PMbatch, hadoop, data transfer services & Ambarisystem out for 6hrs, DT services out until 0000
2017-03-08 19:412017-03-08 22:41Blue WatersXDP powered off that served the four cabinets
(c16-10, c17-10, c18-10, c19-10).
scheduler paused, four rack power cycled.
moab required a restart, too many down nodes
and itterations were stuck.
Scheduler paused
three hours
2017-03-03 17002017-03-03 2200Blue WatersBW hpss emergency outage to clean
up db2 database
ncsa#nearline, stores are failing with cache fullResolved cache full errors
2017-02-28 12002017-02-28 1250Campus ClusterICC Resource Manager downUser can't submit new jobs or start new jobsRemove corrupted job file
2017-02-22 16152017-02-221815NebulaNebula Gluster IssuesAll Nebula instances paused while gluster repairedNebula is available.
2017-02-11 19002017-02-11 2359NPCFNPCF Power HitBW Lustre was down, xdp heat issues.RTS 2017-02-11 2359
2017-02-15 08002017-02-15 1800Campus ClusterICC Scheduled PMBatch jobs and login nodes access