Datastore Location : /mnt/b/archive/monitoring_data/ovis
Derived datastore located at : /mnt/b/archive/monitoring_data/archive/isc/
List of Fields : Current Data Set
Current Data Set in Files
Link to dataset description with stats (Note stats are from 1 month period currently. In process of expanding to full year of 2020.) https://docs.google.com/spreadsheets/d/1Qf268lKCZtdQ2wu_TFr3bA0TJrzSztAc1F2FxvG2YVc/edit?usp=sharing
Field | Type | Description | Included In Training Dataset | Reasoning For Inclusion (Or Not) |
---|---|---|---|---|
#Time | Unix epoch timestamp with decimal for microseconds on the end. | The timestamp of when this data was recorded. | ||
Time_usec | Integer | Microsecond component of the timestamp | ||
CompId | Integer | Numerical component of the node id from the torque data, can be used to link to the torque data. | ||
Tesla_K20X.gpu_util_rate | Integer [0-100] | Percentage of GPU used. Equivalent to the GPU rate from Nvidia-SMI. | ||
Tesla_K20X.gpu_agg_dbl_ecc_total_errors | Integer | Aggregated Double Error Checking and Correcting. Counter of how many errors have been thrown. | ||
Tesla_K20X.gpu_agg_dbl_ecc_texture_memory | Integer | Aggregated Double Error Checking and Correcting. Counter of how many texture_memory errors have been thrown. | ||
Tesla_K20X.gpu_agg_dbl_ecc_register_file | Integer | Aggregated Double Error Checking and Correcting. Counter of how many register_file errors have been thrown. | ||
Tesla_K20X.gpu_agg_dbl_ecc_device_memory | Integer | Aggregated Double Error Checking and Correcting. Counter of how many device_memory errors have been thrown. | ||
Tesla_K20X.gpu_agg_dbl_ecc_l2_cache | Integer | Aggregated Double Error Checking and Correcting. Counter of how many L2_Cache errors have been thrown. | ||
Tesla_K20X.gpu_agg_dbl_ecc_l1_cache | Integer | Aggregated Double Error Checking and Correcting. Counter of how many L1_Cache errors have been thrown. | ||
Tesla_K20X.gpu_memory_used | Integer | Number of bytes of memory that are in use. | ||
Tesla_K20X.gpu_temp | Integer | The degrees Celsius of the GPU rounded to nearest degree | ||
Tesla_K20X.gpu_pstate | Integer | The type of power state the GPU is currently in should be a number between 0-15 | ||
Tesla_K20X.gpu_power_limit | Integer | Maximum number of watts the GPU can use. This number should be static | no | |
Tesla_K20X.gpu_power_usage | Integer | Number of watts currently being used by the GPU | ||
ipogif0_tx_bytes | Bytes | Number of bytes transmitted by the IP over Gemini Interface protocol | ||
ipogif0_rx_bytes | Bytes | Number of bytes received by the IP over Gemini Interface protocol | ||
RDMA_rx_bytes | Bytes | Number of bytes received by the Remote Direct Memory Access protocol | ||
RDMA_nrx | Integer | Counter of requests for receives by the Remote Direct Memory Access protocol | ||
RDMA_tx_bytes | Bytes | Number of bytes transmitted by the Remote Direct Memory Access protocol | ||
RDMA_ntx | Integer | Counter of requests for transmits by the Remote Direct Memory Access protocol | ||
SMSG_rx_bytes | Bytes | Number of bytes received by the Small Message protocol | ||
SMSG_nrx | Integer | Counter of requests for receives by the Small Message protocol | ||
SMSG_tx_bytes | Bytes | Number of bytes transmitted by the Small Message protocol | ||
SMSG_ntx | Integer | Counter of requests for transmits by the Small Message protocol | ||
current_freemem | Bytes | Number of bytes of memory that are currently free in RAM | ||
loadavg_total_processes | Integer | Counter of processes that are currently trying to run. | ||
loadavg_running_processes | Integer | Counter of currently running processes | ||
loadavg_5min(x100) | Integer | Average Count of the last 5 mins of total process | ||
loadavg_latest(x100) | Integer | |||
nr_writeback | Count of memory pages currently being written to disk | |||
nr_dirty | Count of memory pages waiting to be written to disk | |||
lockless_write_bytes#stats.snx1100[123] | Integer | Count of lockless_write_bytes requests to the file system | ||
lockless_read_bytes#stats.snx1100[123] | Integer | Count of lockless_read_bytes requests to the file system | ||
direct_write#stats.snx1100[123] | Integer | Count of direct_write requests to the file system | ||
direct_read#stats.snx1100[123] | Integer | Count of direct_read requests to the file system | ||
inode_permission#stats.snx1100[123] | Integer | Count of inode_permission requests to the file system | ||
removexattr#stats.snx1100[123] | Integer | Count of removexattr requests to the file system | ||
listxattr#stats.snx1100[123] | Integer | Count of listxattr requests to the file system | ||
getxattr#stats.snx1100[123] | Integer | Count of getxattr requests to the file system | ||
setxattr#stats.snx1100[123] | Integer | Count of setxattr requests to the file system | ||
alloc_inode#stats.snx1100[123] | Integer | Count of alloc_inode requests to the file system | ||
statfs#stats.snx1100[123] | Integer | Count of statfs requests to the file system | ||
getattr#stats.snx1100[123] | Integer | Count of getattr requests to the file system | ||
flock#stats.snx1100[123] | Integer | Count of flock requests to the file system | ||
lockless_truncate#stats.snx1100[123] | Integer | Count of lockless_truncate requests to the file system | ||
truncate#stats.snx1100[123] | Integer | Count of truncate requests to the file system | ||
setattr#stats.snx1100[123] | Integer | Count of setattr requests to the file system | ||
fsync#stats.snx1100[123] | Integer | Count of fsync requests to the file system | ||
seek#stats.snx1100[123] | Integer | Count of seek requests to the file system | ||
mmap#stats.snx1100[123] | Integer | Count of mmap requests to the file system | ||
close#stats.snx1100[123] | Integer | Count of close requests to the file system | ||
open#stats.snx1100[123] | Integer | Count of open requests to the file system | ||
ioctl#stats.snx1100[123] | Integer | Count of ioctl requests to the file system | ||
brw_write#stats.snx1100[123] | Integer | Count of brw_write requests to the file system | ||
brw_read#stats.snx1100[123] | Integer | Count of brw_read requests to the file system | ||
write_bytes#stats.snx1100[123] | bytes? | Count of write_bytes requests to the file system? | ||
read_bytes#stats.snx1100[123] | bytes? | Include castrates can be higher then theoretical maximum | ||
writeback_failed_pages#stats.snx1100[123] | Integer | Count of writeback_failed_pages requests to the file system | ||
writeback_ok_pages#stats.snx1100[123] | Integer | Count of writeback_ok_pages requests to the file system | ||
writeback_from_pressure#stats.snx1100[123] | Integer | Count of writeback_from_pressure requests to the file system | ||
writeback_from_writepage#stats.snx1100[123] | Integer | Count of writeback_from_writepage requests to the file system | ||
dirty_pages_misses#stats.snx1100[123] | Integer | Count of dirty_pages_misses requests to the file system | ||
dirty_pages_hits#stats.snx1100[123] | Integer | Count of dirty_pages_hits requests to the file system | ||
bteout_optA | Bytes | Number of bytes transmitted by the NIC using optA | ||
SAMPLE_bteout_optA (B/s) | Bytes/second? | Rate in bytes/second of bteout_optA | ||
bteout_optB | Bytes | Number of bytes transmitted by the NIC using optB | ||
SAMPLE_bteout_optB (B/s) | Bytes/second? | Rate in bytes/second of bteout_optB | ||
fmaout | Bytes | Number of bytes transmitted by the NIC using fmaout | ||
SAMPLE_fmaout (B/s) | Bytes/second? | Rate in bytes/second of fmaout | ||
totalinput | Bytes | Number of bytes recived by the NIC | ||
SAMPLE_totalinput (B/s) | Bytes/second? | Rate in bytes/seond of totalinput | ||
totaloutput_optA | Bytes | Number of bytes transmitted by the NIC using optA | ||
SAMPLE_totaloutput_optA (B/s) | Bytes/second? | Rate in bytes/second of totaloutput_optA | ||
totaloutput_optB | Bytes | Number of bytes transmitted by the NIC using optB | ||
SAMPLE_totaloutput_optB (B/s) | Bytes/second? | Rate in bytes/second of totaloutput_optB | ||
[XYZ][+-]_SAMPLE_GEMINI_CREDIT_STALL (% x1e6) Z-_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6) | Derived metrics of link aggregated percent output stalls based on the current and previous sample | |||
[XYZ][+-]_SAMPLE_GEMINI_INQ_STALL (% x1e6) Z-_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6) | Derived metrics of link aggregated percent input stalls based on the current and previous sample | |||
[XYZ][+-]_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B) Z-_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B) | Bytes? | Derived metrics of link aggregated ave packetsize based on the current and previous sample | ||
[XYZ][+-]_SAMPLE_GEMINI_LINK_USED_BW (% x1e6) Z-_SAMPLE_GEMINI_LINK_USED_BW (% x1e6) | Derived metrics of link aggregated bandwidth based on the current and previous sample and the estimated max BW for this link | |||
[XYZ][+-]_SAMPLE_GEMINI_LINK_BW (B/s) Z-_SAMPLE_GEMINI_LINK_BW (B/s) | Bytes/second? | Derived metric of link aggregated bandwidth based on the current and previous sample | ||
[XYZ][+-]_recvlinkstatus (1) Z-_recvlinkstatus (1) | Integer? | link aggregated status information (used to detect degraded links) | ||
[XYZ][+-]_sendlinkstatus (1) Z-_sendlinkstatus (1) | Integer? | link aggregated status information (used to detect degraded links) | ||
[XYZ][+-]_credit_stall (ns) Z-_credit_stall (ns) | nanoseconds? | Link aggregated Gemini output stalls | ||
[XYZ][+-]_inq_stall (ns) Z-_inq_stall (ns) | nanoseconds? | Link aggregated Gemini input stalls | ||
[XYZ][+-]_packets (1) Z-_packets (1) | Link aggregated Gemini packet counter | |||
[XYZ][+-]_traffic (B) Z-_traffic (B) | Bytes? | Link aggregated Gemini traffic counter in Bytes | ||
nettopo_mesh_coord_[XYZ] : nettopo_mesh_coord_Z | Integer | Location of the process on the 3D Torus | no |
Derived Dataset in Database
Same as top but all values are per sec averages.
Field | Type | Description | Included In Training Dataset | Reasoning For Inclusion (Or Not) |
---|---|---|---|---|
#Time | Timestamp of when the data was captured | |||
Time_usec | microsecond component | |||
DT | Time since last datapoint | |||
DT_usec | microsecond component | |||
CompId | ||||
nettopo_mesh_coord_[XYZ] (x 1.00e+00) nettopo_mesh_coord_X (x 1.00e+00) | ||||
[XYZ][+-]_SAMPLE_GEMINI_LINK_BW (B/s) (x 1.00e+00) X+_SAMPLE_GEMINI_LINK_BW (B/s) (x 1.00e+00) | ||||
[XYZ][+-]_SAMPLE_GEMINI_LINK_USED_BW (% x1e6) (x 1.00e+00) X+_SAMPLE_GEMINI_LINK_USED_BW (% x1e6) (x 1.00e-06) | ||||
[XYZ][+-]_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B) (x 1.00e+00) X+_SAMPLE_GEMINI_LINK_PACKETSIZE_AVE (B) (x 1.00e+00) | ||||
[XYZ][+-]_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6) (x 1.00e+00) X+_SAMPLE_GEMINI_LINK_INQ_STALL (% x1e6) (x 1.00e-06) | ||||
[XYZ][+-]_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6) (x 1.00e+00) X+_SAMPLE_GEMINI_LINK_CREDIT_STALL (% x1e6) (x 1.00e-06) | ||||
SAMPLE_totaloutput_optA (B/s) (x 1.00e+00) | ||||
SAMPLE_totalinput (B/s) (x 1.00e+00) | ||||
SAMPLE_fmaout (B/s) (x 1.00e+00) | ||||
SAMPLE_bteout_optA (B/s) (x 1.00e+00) | ||||
SAMPLE_bteout_optB (B/s) (x 1.00e+00) | ||||
SAMPLE_totaloutput_optB (B/s) (x 1.00e+00) | ||||
Rate_read_bytes#stats.snx11001 (x 1.00e+00) | ||||
Rate_write_bytes#stats.snx11001 (x 1.00e+00) | ||||
Rate_open#stats.snx11001 (x 1.00e+00) | ||||
Rate_close#stats.snx11001 (x 1.00e+00) | ||||
Rate_seek#stats.snx11001 (x 1.00e+00) | ||||
Rate_read_bytes#stats.snx11002 (x 1.00e+00) | ||||
Rate_write_bytes#stats.snx11002 (x 1.00e+00) | ||||
Rate_open#stats.snx11002 (x 1.00e+00) | ||||
Rate_close#stats.snx11002 (x 1.00e+00) | ||||
Rate_seek#stats.snx11002 (x 1.00e+00) | ||||
Rate_read_bytes#stats.snx11003 (x 1.00e+00) | ||||
Rate_write_bytes#stats.snx11003 (x 1.00e+00) | ||||
Rate_open#stats.snx11003 (x 1.00e+00) | ||||
Rate_close#stats.snx11003 (x 1.00e+00) | ||||
Rate_seek#stats.snx11003 (x 1.00e+00) | ||||
loadavg_latest(x100) (x 1.00e+00) | ||||
loadavg_5min(x100) (x 1.00e+00) | ||||
loadavg_running_processes (x 1.00e+00) | ||||
loadavg_total_processes (x 1.00e+00) | ||||
current_freemem (x 1.00e+00) | ||||
Rate_SMSG_ntx (x 1.00e+00) | ||||
Rate_SMSG_tx_bytes (x 1.00e+00) | ||||
Rate_SMSG_nrx (x 1.00e+00) | ||||
Rate_SMSG_rx_bytes (x 1.00e+00) | ||||
Rate_RDMA_ntx (x 1.00e+00) | ||||
Rate_RDMA_tx_bytes (x 1.00e+00) | ||||
Rate_RDMA_nrx (x 1.00e+00) | ||||
Rate_RDMA_rx_bytes (x 1.00e+00) | ||||
Rate_ipogif0_rx_bytes (x 1.00e+00) | ||||
Rate_ipogif0_tx_bytes (x 1.00e+00) | ||||
Tesla_K20X.gpu_util_rate (x 1.00e+00) | ||||
Tesla_K20X.gpu_memory_used (x 1.00e+00) | ||||
Tesla_K20X.gpu_temp (x 1.00e+00) | ||||
Tesla_K20X.gpu_pstate (x 1.00e+00) | ||||
Tesla_K20X.gpu_power_limit (x 1.00e+00) | ||||
Tesla_K20X.gpu_power_usage (x 1.00e+00) | ||||
Flag | Is Data Valid |
Add Comment