1 files changed, 86 insertions, 0 deletions
diff --git a/_posts/2024-10-03-python-woes-tensorflow.md b/_posts/2024-10-03-python-woes-tensorflow.md
new file mode 100644
index 0000000..000ce98
--- /dev/null
+++ b/_posts/2024-10-03-python-woes-tensorflow.md
@@ -0,0 +1,86 @@
+---
+layout: post
+title: "Python Woes (TensorFlow)"
+date: 2024-10-03 18:31 +0200
+lang: en
+categories: ["tech"]
+---
+
+In the last few days, I was experimenting with
+[ZenithO-o/fursuit-detection](https://github.com/ZenithO-o/fursuit-detection), after sorting some photos from
+a fursuit walk. I couldn't get it to work/run, no matter how hard I tried, with
+the system packages. (Debian stable is somewhat old, granted, which leads to
+some problems).
+
+So since venv wouldn't do it's job, I thought about (ana|mini)conda again. Only
+to find out now there's also miniforge. And apparently a project with a faster
+dependency resolver, mamba, split up. Its dependency resolver already
+re-integrated into conda (?), and then there's micromamba which is a standalone
+executable compiled from C++ (?)… I already wasted some hours researching that
+rabbit hole. And apparently you can't use that stuff without putting some stuff
+into your `.bashrc`, "activate only when needed" doesn't seem to be a supported
+usecase? (I didn't want to put even more time into this, but yes, looking at
+what's inside `.bashrc', I could simply do this manually…).
+
+So anyways. Next step was searching for the required packages in conda-forge. I
+found some very outdated guides on the internet, which installed some things
+manually. I simply went with `micromamba install tensorflow-gpu` - and hey, it
+works! Or so I thought…
+
+Running the `run_on_images` script gave me a
+
+```
+tensorflow.python.framework.errors_impl.InternalError: Graph execution error:
+
+Detected at node MultiscaleGridAnchorGenerator/GridAnchorGenerator/mul_3 defined at (most recent call last):
+<stack traces unavailable>
+Detected at node MultiscaleGridAnchorGenerator/GridAnchorGenerator/mul_3 defined at (most recent call last):
+<stack traces unavailable>
+2 root error(s) found.
+  (0) INTERNAL:  'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE'
+	 [[ { {node MultiscaleGridAnchorGenerator/GridAnchorGenerator/mul_3 } } ]]
+	 [[StatefulPartitionedCall/map/while/loop_body_control/_430/_23]]
+  (1) INTERNAL:  'cuLaunchKernel(function, gridX, gridY, gridZ, blockX, blockY, blockZ, 0, reinterpret_cast<CUstream>(stream), params, nullptr)' failed with 'CUDA_ERROR_INVALID_HANDLE'
+	 [[ { {node MultiscaleGridAnchorGenerator/GridAnchorGenerator/mul_3 } } ]]
+0 successful operations.
+0 derived errors ignored. [Op:__inference_restored_function_body_41075]
+```
+
+A very useless error message. I did a lot of fruitless internet searches. I then
+noticed the author of the script limited logging. So I removed that line. With that, I
+suddenly got a more promising
+
+```
+2024-10-03 16:04:42.816559: W tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc:40] 'cuModuleLoadData(&module, data)' failed with 'CUDA_ERROR_NO_BINARY_FOR_GPU'
+2024-10-03 16:04:42.816603: W tensorflow/compiler/mlir/tools/kernel_gen/tf_gpu_runtime_wrappers.cc:40] 'cuModuleGetFunction(&function, module, kernel_name)' failed with 'CUDA_ERROR_INVALID_HANDLE'
+```
+
+Which didn't help me much either itself. But then I spotted this at the beginning of
+the script
+
+```
+2024-10-03 16:04:36.034469: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2432] TensorFlow was not built with CUDA kernel binaries compatible with compute capability 5.2. CUDA kernels will be jit-compiled from PTX, which could take 30 minutes or longer.
+```
+
+So… my GPU was too old. And apparently *something* went wrong with the JIT
+compilation. So I step-by-step installed older `tensorflow-gpu` packages until I
+arrived at `tensorflow-gpu~=2.14.0`. Mind you, this whole process took half a
+day. And even then, I wasn't spared:
+
+```
+2024-10-03 18:16:03.567311: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:559] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
+error: libdevice not found at ./libdevice.10.bc
+2024-10-03 18:16:03.568777: W tensorflow/core/framework/op_kernel.cc:1827] UNKNOWN: JIT compilation failed.
+2024-10-03 18:16:03.568846: W tensorflow/core/framework/op_kernel.cc:1827] UNKNOWN: JIT compilation failed.
+```
+
+Luckily, for that one, I found a solition pretty quickly: You have to copy a
+file to a subdirectory in the execution directory:
+
+```
+cp ${PUT_ENV_PATH_HERE}/lib/libdevice.10.bc ./cuda_sdk_lib/nvvm/libdevice/
+```
+
+And then, FINALLY!!!, this sh** works. Half a day wasted, and a lot of angry
+shouts were emitted.
+