Differences

This shows you the differences between two versions of the page.

--- gpu_resources [2017/04/26 14:12] – llewis
+++ gpu_resources [2024/03/26 13:52] (current) – external edit 127.0.0.1
@@ Line 1: / Line 1: @@
 ====== GPU Resources ======
+This is a collaborative resource, please improve it. Login using your MCIN user name and ID and add your discoveries.
+===== Items of Interest / for Discussion? =====
+==== Resources ====
+* [ OpenACC - Tutorial - Steps to More Science ]( https://developer.nvidia.com/openacc/3-steps-to-more-science )
+"Here are three simple steps to start accelerating your code with GPUs. We will be using PGI OpenACC compiler for C, C++, FORTRAN, along with tools from the PGI Community Edition."
+* [ Performance Portability from GPUs to CPUs with OpenACC ](https://devblogs.nvidia.com/parallelforall/performance-portability-gpus-cpus-openacc/)
+* [ Data Center Management Tools ]( http://www.nvidia.com/object/data-center-managment-tools.html )
+    * The GPU Deployment Kit
+    * Ganglia
+    * Slurm
+    * NVIDIA Docker
+    * Others???
+"...performance on multicore CPUs for HPC apps using MPI + OpenACC is equivalent to MPI + OpenMP code. Compiling and running the same code on a Tesla K80 GPU can provide large speedups."
+===== Preventing Job Clobbering =====
+There are currently 3 GPU's in ace-gpu-1. To select one of the three (0, 1, 2), set the CUDA_VISIBLE_DEVICES environment variable. This can be accomplished by adding the following line to your ~/.bash_profile file on ace-gpu-1, where X is either 0, 1 or 2:
+<code>
+export CUDA_VISIBLE_DEVICES=X
+</code>
+This will only take effect when you log in, so log out and back in and try the following to ensure that it worked:
+<code>
+echo $CUDA_VISIBLE_DEVICES
+</code>
+If it outputs the ID that you selected then you're ready to use the GPU.
+==== Sharing a single GPU ====
+To configure TensorFlow to not pre-allocate all GPU memory you can use the following Python code:
+<code>
+# configures TensorFlow to not try to grab all the GPU memory
+config = tf.ConfigProto(allow_soft_placement=True)
+config.gpu_options.allow_growth = True
+session = tf.Session(config=config)
+K.set_session(session)
+</code>
+This has been found to work only to a certain extent, and when there are several jobs that use a significant amount of the GPU resources, jobs can still be ruined even when using the above code
 ===== GPU Info =====
@@ Line 31: / Line 84: @@
 nsight
 </code>
+Nvidia Visual Profiler (https://developer.nvidia.com/nvidia-visual-profiler) would be useful for GPU monitoring if we had X visualization, but we do not:
+<code>
+/usr/local/cuda/bin/nvvp
+</code>
 ===== GPU Accounting =====
@@ Line 44: / Line 103: @@
 </code>
+Output example:
+<code>
+==============NVSMI LOG==============
+Timestamp                           : Thu Apr 27 09:09:50 2017
+Driver Version                      : 375.39
+Attached GPUs                       : 1
+GPU 0000:01:00.0
+    Accounting Mode                 : Enabled
+    Accounting Mode Buffer Size     : 1920
+    Accounted Processes
+        Process ID                  : 15819
+            GPU Utilization         : 100 %
+            Memory Utilization      : 6 %
+            Max memory usage        : 187 MiB
+            Time                    : 3769 ms
+            Is Running              : 0
+...
+</code>
 Users: to check GPU stats per process:
 <code>
 nvidia-smi -i 0 --query-accounted-apps=gpu_name,pid,gpu_util,max_memory_usage,time --format=csv
+</code>
+Output example:
+<code>
+gpu_name, pid, gpu_utilization [%], max_memory_usage [MiB], time [ms]
+TITAN X (Pascal), 15819, 100 %, 187 MiB, 3769 ms
+TITAN X (Pascal), 15633, 87 %, 8465 MiB, 200626 ms
+TITAN X (Pascal), 15944, 0 %, 153 MiB, 382 ms
+TITAN X (Pascal), 16000, 0 %, 155 MiB, 299 ms
+TITAN X (Pascal), 15862, 80 %, 8465 MiB, 215039 ms
+TITAN X (Pascal), 15842, 41 %, 425 MiB, 721223 ms
+TITAN X (Pascal), 16294, 74 %, 8465 MiB, 231517 ms
+TITAN X (Pascal), 16436, 70 %, 10425 MiB, 229470 ms
+TITAN X (Pascal), 16118, 40 %, 155 MiB, 1310156 ms
+TITAN X (Pascal), 16908, 72 %, 8465 MiB, 511122 ms
+TITAN X (Pascal), 17102, 73 %, 8465 MiB, 833806 ms
+TITAN X (Pascal), 17900, 0 %, 153 MiB, 358 ms
+TITAN X (Pascal), 18018, 0 %, 153 MiB, 235 ms
+TITAN X (Pascal), 17632, 75 %, 8465 MiB, 823193 ms
+TITAN X (Pascal), 18376, 74 %, 8529 MiB, 827336 ms
+TITAN X (Pascal), 18637, 74 %, 8465 MiB, 547161 ms
+TITAN X (Pascal), 16377, 54 %, 153 MiB, 0 ms
+TITAN X (Pascal), 18752, 55 %, 8465 MiB, 0 ms
 </code>
@@ Line 54: / Line 158: @@
 </code>
+==== nvidia-smi flags used ====
+<code>
+    -i,   --id=                 Target a specific GPU.
+    -am   --accounting-mode=    Enable or disable Accounting Mode: 0/DISABLED, 1/ENABLED
+    -q,   --query               Display GPU or Unit info.
+    -d,   --display=            Display only selected information: MEMORY,
+                                    UTILIZATION, ECC, TEMPERATURE, POWER, CLOCK,
+                                    COMPUTE, PIDS, PERFORMANCE, SUPPORTED_CLOCKS,
+                                    PAGE_RETIREMENT, ACCOUNTING.
+                                Flags can be combined with comma e.g. ECC,POWER.
+                                Sampling data with max/min/avg is also returned
+                                for POWER, UTILIZATION and CLOCK display types.
+                                Doesn't work with -u or -x flags.
+</code>
+* [[http://docs.nvidia.com/deploy/driver-persistence/index.html#persistence-mode]]
+* [[http://docs.nvidia.com/deploy/driver-persistence/index.html#persistence-daemon]]
 ===== Deep Learning =====