Datastore Location : /mnt/b/archive/monitoring_data/ovis

Derived datastore located at : /mnt/b/archive/monitoring_data/archive/isc/

List of Fields : Current Data Set

Current Data Set in Files

Link to dataset description with stats (Note stats are from 1 month period currently. In process of expanding to full year of 2020.) https://docs.google.com/spreadsheets/d/1Qf268lKCZtdQ2wu_TFr3bA0TJrzSztAc1F2FxvG2YVc/edit?usp=sharing


FieldTypeDescriptionIncluded In Training DatasetReasoning For Inclusion (Or Not)
#TimeUnix epoch timestamp with decimal for microseconds on the end.The timestamp of when this data was recorded.

Time_usecIntegerMicrosecond component of the timestamp

CompIdIntegerNumerical component of the node id from the torque data, can be used to link to the torque data.

Tesla_K20X.gpu_util_rateInteger [0-100]Percentage of GPU used. Equivalent to the GPU rate from Nvidia-SMI.

Tesla_K20X.gpu_agg_dbl_ecc_total_errorsIntegerAggregated Double Error Checking and Correcting. Counter of how many errors have been thrown.

Tesla_K20X.gpu_agg_dbl_ecc_texture_memoryIntegerAggregated Double Error Checking and Correcting. Counter of how many texture_memory errors have been thrown.

Tesla_K20X.gpu_agg_dbl_ecc_register_fileIntegerAggregated Double Error Checking and Correcting. Counter of how many register_file errors have been thrown.

Tesla_K20X.gpu_agg_dbl_ecc_device_memoryIntegerAggregated Double Error Checking and Correcting. Counter of how many device_memory errors have been thrown.

Tesla_K20X.gpu_agg_dbl_ecc_l2_cacheIntegerAggregated Double Error Checking and Correcting. Counter of how many L2_Cache errors have been thrown.

Tesla_K20X.gpu_agg_dbl_ecc_l1_cacheIntegerAggregated Double Error Checking and Correcting. Counter of how many L1_Cache errors have been thrown.

Tesla_K20X.gpu_memory_usedIntegerNumber of bytes of memory that are in use.

Tesla_K20X.gpu_tempIntegerThe degrees Celsius of the GPU rounded to nearest degree

Tesla_K20X.gpu_pstateIntegerThe type of power state the GPU is currently in should be a number between 0-15

Tesla_K20X.gpu_power_limitIntegerMaximum number of watts the GPU can use. This number should be staticno
Tesla_K20X.gpu_power_usageIntegerNumber of watts currently being used by the GPU

ipogif0_tx_bytesBytesNumber of bytes transmitted by the IP over Gemini Interface protocol

ipogif0_rx_bytesBytesNumber of bytes received by the IP over Gemini Interface protocol

RDMA_rx_bytesBytes

Number of bytes received by the Remote Direct Memory Access protocol



RDMA_nrxIntegerCounter of requests for receives by the Remote Direct Memory Access protocol

RDMA_tx_bytesBytes

Number of bytes transmitted by the Remote Direct Memory Access protocol



RDMA_ntxIntegerCounter  of requests for transmits by the Remote Direct Memory Access protocol

SMSG_rx_bytesBytes

Number of bytes received by the Small Message protocol



SMSG_nrxIntegerCounter of requests for receives by the Small Message protocol

SMSG_tx_bytesBytes

Number of bytes transmitted by the Small Message protocol



SMSG_ntxIntegerCounter of requests for transmits by the Small Message protocol

current_freememBytesNumber of bytes of memory that are currently free in RAM

loadavg_total_processesIntegerCounter of processes that are currently trying to run.

loadavg_running_processesIntegerCounter of currently running processes

loadavg_5min(x100)IntegerAverage Count of the last 5 mins of total process

loadavg_latest(x100)Integer


nr_writeback
Count of memory pages currently being written to disk

nr_dirty
Count of memory pages waiting to be written to disk

lockless_write_bytes#stats.snx1100[123]IntegerCount of lockless_write_bytes requests to the file system

lockless_read_bytes#stats.snx1100[123]IntegerCount of lockless_read_bytes requests to the file system

direct_write#stats.snx1100[123]IntegerCount of direct_write requests to the file system

direct_read#stats.snx1100[123]IntegerCount of direct_read requests to the file system

inode_permission#stats.snx1100[123]IntegerCount of inode_permission requests to the file system

removexattr#stats.snx1100[123]IntegerCount of removexattr requests to the file system

listxattr#stats.snx1100[123]IntegerCount of listxattr requests to the file system

getxattr#stats.snx1100[123]IntegerCount of getxattr requests to the file system

setxattr#stats.snx1100[123]IntegerCount of setxattr requests to the file system

alloc_inode#stats.snx1100[123]IntegerCount of alloc_inode requests to the file system

statfs#stats.snx1100[123]IntegerCount of statfs requests to the file system

getattr#stats.snx1100[123]IntegerCount of getattr requests to the file system

flock#stats.snx1100[123]IntegerCount of flock requests to the file system

lockless_truncate#stats.snx1100[123]IntegerCount of lockless_truncate requests to the file system

truncate#stats.snx1100[123]IntegerCount of truncate requests to the file system

setattr#stats.snx1100[123]IntegerCount of setattr requests to the file system

fsync#stats.snx1100[123]IntegerCount of fsync requests to the file system

seek#stats.snx1100[123]IntegerCount of seek requests to the file system

mmap#stats.snx1100[123]IntegerCount of mmap requests to the file system

close#stats.snx1100[123]IntegerCount of close requests to the file system

open#stats.snx1100[123]IntegerCount of open requests to the file system

ioctl#stats.snx1100[123]IntegerCount of ioctl requests to the file system

brw_write#stats.snx1100[123]IntegerCount of brw_write requests to the file system

brw_read#stats.snx1100[123]IntegerCount of brw_read requests to the file system

write_bytes#stats.snx1100[123]bytes?Count of write_bytes requests to the file system?

read_bytes#stats.snx1100[123]bytes?Include castrates can be higher then theoretical maximum

writeback_failed_pages#stats.snx1100[123]IntegerCount of writeback_failed_pages requests to the file system

writeback_ok_pages#stats.snx1100[123]IntegerCount of writeback_ok_pages requests to the file system

writeback_from_pressure#stats.snx1100[123]IntegerCount of writeback_from_pressure requests to the file system

writeback_from_writepage#stats.snx1100[123]IntegerCount of writeback_from_writepage requests to the file system

dirty_pages_misses#stats.snx1100[123]IntegerCount of dirty_pages_misses requests to the file system

dirty_pages_hits#stats.snx1100[123]IntegerCount of dirty_pages_hits requests to the file system

bteout_optABytesNumber of bytes transmitted by the NIC using optA

SAMPLE_bteout_optA (B/s)Bytes/second?Rate in bytes/second of bteout_optA

bteout_optBBytesNumber of bytes transmitted by the NIC using optB

SAMPLE_bteout_optB (B/s)Bytes/second?Rate in bytes/second of bteout_optB

fmaoutBytesNumber of bytes transmitted by the NIC using fmaout

SAMPLE_fmaout (B/s)Bytes/second?Rate in bytes/second of fmaout

totalinputBytesNumber of bytes recived by the NIC

SAMPLE_totalinput (B/s)Bytes/second?Rate in bytes/seond of totalinput

totaloutput_optABytesNumber of bytes transmitted by the NIC using optA

SAMPLE_totaloutput_optA (B/s)Bytes/second?Rate in bytes/second of totaloutput_optA

totaloutput_optBBytesNumber of bytes transmitted by the NIC using optB

SAMPLE_totaloutput_optB (B/s)Bytes/second?Rate in bytes/second of totaloutput_optB

[XYZ][+-]_SAMPLE_GEMINI_CREDIT_STALL (% x1e6)

Z-_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6)
Z+_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6)
Y-_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6)
Y+_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6)
X-_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6)
X+_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6)


Derived metrics of link aggregated percent output stalls based on the current and previous sample

[XYZ][+-]_SAMPLE_GEMINI_INQ_STALL (% x1e6)

Z-_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6)
Z+_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6)
Y-_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6)
Y+_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6)
X-_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6)
X+_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6)


Derived metrics of link aggregated percent input stalls based on the current and previous sample

[XYZ][+-]_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B)

Z-_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B)
Z+_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B)
Y-_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B)
Y+_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B)
X-_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B)
X+_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B)

Bytes?Derived metrics of link aggregated ave packetsize based on the current and previous sample

[XYZ][+-]_SAMPLE_GEMINI_LINK_USED_BW (% x1e6)

Z-_SAMPLE_GEMINI_LINK_USED_BW (% x1e6)
Z+_SAMPLE_GEMINI_LINK_USED_BW (% x1e6)
Y-_SAMPLE_GEMINI_LINK_USED_BW (% x1e6)
Y+_SAMPLE_GEMINI_LINK_USED_BW (% x1e6)
X-_SAMPLE_GEMINI_LINK_USED_BW (% x1e6)
X+_SAMPLE_GEMINI_LINK_USED_BW (% x1e6)


Derived metrics of link aggregated bandwidth based on the current and previous sample and the estimated max BW for this link

[XYZ][+-]_SAMPLE_GEMINI_LINK_BW (B/s)

Z-_SAMPLE_GEMINI_LINK_BW (B/s)
Z+_SAMPLE_GEMINI_LINK_BW (B/s)
Y-_SAMPLE_GEMINI_LINK_BW (B/s)
Y+_SAMPLE_GEMINI_LINK_BW (B/s)
X-_SAMPLE_GEMINI_LINK_BW (B/s)
X+_SAMPLE_GEMINI_LINK_BW (B/s)

Bytes/second?Derived metric of link aggregated bandwidth based on the current and previous sample

[XYZ][+-]_recvlinkstatus (1)

Z-_recvlinkstatus (1)
Z+_recvlinkstatus (1)
Y-_recvlinkstatus (1)
Y+_recvlinkstatus (1)
X-_recvlinkstatus (1)
X+_recvlinkstatus (1)

Integer?link aggregated status information (used to detect degraded links)

[XYZ][+-]_sendlinkstatus (1)

Z-_sendlinkstatus (1)
Z+_sendlinkstatus (1)
Y-_sendlinkstatus (1)
Y+_sendlinkstatus (1)
X-_sendlinkstatus (1)
X+_sendlinkstatus (1)

Integer?link aggregated status information (used to detect degraded links)

[XYZ][+-]_credit_stall (ns)

Z-_credit_stall (ns)
Z+_credit_stall (ns)
Y-_credit_stall (ns)
Y+_credit_stall (ns)
X-_credit_stall (ns)
X+_credit_stall (ns)

nanoseconds?Link aggregated Gemini output stalls

[XYZ][+-]_inq_stall (ns)

Z-_inq_stall (ns)
Z+_inq_stall (ns)
Y-_inq_stall (ns)
Y+_inq_stall (ns)
X-_inq_stall (ns)
X+_inq_stall (ns)

nanoseconds?Link aggregated Gemini input stalls

[XYZ][+-]_packets (1)

Z-_packets (1)
Z+_packets (1)
Y-_packets (1)
Y+_packets (1)
X-_packets (1)
X+_packets (1)


Link aggregated Gemini packet counter

[XYZ][+-]_traffic (B)

Z-_traffic (B)
Z+_traffic (B)
Y-_traffic (B)
Y+_traffic (B)
X-_traffic (B)
X+_traffic (B)

Bytes?Link aggregated Gemini traffic counter in Bytes

nettopo_mesh_coord_[XYZ] :

nettopo_mesh_coord_Z
nettopo_mesh_coord_Y
nettopo_mesh_coord_X

IntegerLocation of the process on the 3D Torusno

Derived Dataset in Database

Same as top but all values are per sec averages.

FieldTypeDescriptionIncluded In Training DatasetReasoning For Inclusion (Or Not)
#Time
Timestamp of when the data was captured

Time_usec
microsecond component

DT
Time since last datapoint

DT_usec
microsecond component

CompId



nettopo_mesh_coord_[XYZ] (x 1.00e+00)

nettopo_mesh_coord_X (x 1.00e+00)
nettopo_mesh_coord_Y (x 1.00e+00)
nettopo_mesh_coord_Z (x 1.00e+00)





[XYZ][+-]_SAMPLE_GEMINI_LINK_BW (B/s) (x 1.00e+00)

X+_SAMPLE_GEMINI_LINK_BW (B/s) (x 1.00e+00)
X-_SAMPLE_GEMINI_LINK_BW (B/s) (x 1.00e+00)
Y+_SAMPLE_GEMINI_LINK_BW (B/s) (x 1.00e+00)
Y-_SAMPLE_GEMINI_LINK_BW (B/s) (x 1.00e+00)
Z+_SAMPLE_GEMINI_LINK_BW (B/s) (x 1.00e+00)
Z-_SAMPLE_GEMINI_LINK_BW (B/s) (x 1.00e+00)





[XYZ][+-]_SAMPLE_GEMINI_LINK_USED_BW (% x1e6) (x 1.00e+00)

X+_SAMPLE_GEMINI_LINK_USED_BW (% x1e6) (x 1.00e-06)
X-_SAMPLE_GEMINI_LINK_USED_BW (% x1e6) (x 1.00e-06)
Y+_SAMPLE_GEMINI_LINK_USED_BW (% x1e6) (x 1.00e-06)
Y-_SAMPLE_GEMINI_LINK_USED_BW (% x1e6) (x 1.00e-06)
Z+_SAMPLE_GEMINI_LINK_USED_BW (% x1e6) (x 1.00e-06)
Z-_SAMPLE_GEMINI_LINK_USED_BW (% x1e6) (x 1.00e-06)





[XYZ][+-]_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B) (x 1.00e+00)

X+_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B) (x 1.00e+00)
X-_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B) (x 1.00e+00)
Y+_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B) (x 1.00e+00)
Y-_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B) (x 1.00e+00)
Z+_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B) (x 1.00e+00)
Z-_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B) (x 1.00e+00)





[XYZ][+-]_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6) (x 1.00e+00)

X+_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6) (x 1.00e-06)
X-_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6) (x 1.00e-06)
Y+_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6) (x 1.00e-06)
Y-_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6) (x 1.00e-06)
Z+_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6) (x 1.00e-06)
Z-_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6) (x 1.00e-06)





[XYZ][+-]_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6) (x 1.00e+00)

X+_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6) (x 1.00e-06)
X-_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6) (x 1.00e-06)
Y+_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6) (x 1.00e-06)
Y-_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6) (x 1.00e-06)
Z+_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6) (x 1.00e-06)
Z-_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6) (x 1.00e-06)





SAMPLE_totaloutput_optA (B/s) (x 1.00e+00)



SAMPLE_totalinput (B/s) (x 1.00e+00)



SAMPLE_fmaout (B/s) (x 1.00e+00)



SAMPLE_bteout_optA (B/s) (x 1.00e+00)



SAMPLE_bteout_optB (B/s) (x 1.00e+00)



SAMPLE_totaloutput_optB (B/s) (x 1.00e+00)



Rate_read_bytes#stats.snx11001 (x 1.00e+00)



Rate_write_bytes#stats.snx11001 (x 1.00e+00)



Rate_open#stats.snx11001 (x 1.00e+00)



Rate_close#stats.snx11001 (x 1.00e+00)



Rate_seek#stats.snx11001 (x 1.00e+00)



Rate_read_bytes#stats.snx11002 (x 1.00e+00)



Rate_write_bytes#stats.snx11002 (x 1.00e+00)



Rate_open#stats.snx11002 (x 1.00e+00)



Rate_close#stats.snx11002 (x 1.00e+00)



Rate_seek#stats.snx11002 (x 1.00e+00)



Rate_read_bytes#stats.snx11003 (x 1.00e+00)



Rate_write_bytes#stats.snx11003 (x 1.00e+00)



Rate_open#stats.snx11003 (x 1.00e+00)



Rate_close#stats.snx11003 (x 1.00e+00)



Rate_seek#stats.snx11003 (x 1.00e+00)



loadavg_latest(x100) (x 1.00e+00)



loadavg_5min(x100) (x 1.00e+00)



loadavg_running_processes (x 1.00e+00)



loadavg_total_processes (x 1.00e+00)



current_freemem (x 1.00e+00)



Rate_SMSG_ntx (x 1.00e+00)



Rate_SMSG_tx_bytes (x 1.00e+00)



Rate_SMSG_nrx (x 1.00e+00)



Rate_SMSG_rx_bytes (x 1.00e+00)



Rate_RDMA_ntx (x 1.00e+00)



Rate_RDMA_tx_bytes (x 1.00e+00)



Rate_RDMA_nrx (x 1.00e+00)



Rate_RDMA_rx_bytes (x 1.00e+00)



Rate_ipogif0_rx_bytes (x 1.00e+00)



Rate_ipogif0_tx_bytes (x 1.00e+00)



Tesla_K20X.gpu_util_rate (x 1.00e+00)



Tesla_K20X.gpu_memory_used (x 1.00e+00)



Tesla_K20X.gpu_temp (x 1.00e+00)



Tesla_K20X.gpu_pstate (x 1.00e+00)



Tesla_K20X.gpu_power_limit (x 1.00e+00)



Tesla_K20X.gpu_power_usage (x 1.00e+00)



Flag
Is Data Valid

  • No labels
Write a comment...