Sunday, September 8, 2024

Posit AI Blog: luz 0.4.0



A new version of luz is now available on CRAN. luz is a high-level interface for torch. It aims to reduce the boilerplate code necessary to train torch models while being as flexible as possible,
so you can adapt it to run all kinds of deep learning models.

If you want to get started with luz we recommend reading the
previous release blog post as well as the ‘Training with luz’ chapter of the ‘Deep Learning and Scientific Computing with R torch’ book.

This release adds numerous smaller features, and you can check the full changelog here. In this blog post we highlight the features we are most excited for.

Support for Apple Silicon

Since torch v0.9.0, it’s possible to run computations on the GPU of Apple Silicon equipped Macs. luz wouldn’t automatically make use of the GPUs though, and instead used to run the models on CPU.

Starting from this release, luz will automatically use the ‘mps’ device when running models on Apple Silicon computers, and thus let you benefit from the speedups of running models on the GPU.

To get an idea, running a simple CNN model on MNIST from this example for one epoch on an Apple M1 Pro chip would take 24 seconds when using the GPU:

  user  system elapsed 
19.793   1.463  24.231 

While it would take 60 seconds on the CPU:

  user  system elapsed 
83.783  40.196  60.253 

That is a nice speedup!

Note that this feature is still somewhat experimental, and not every torch operation is supported to run on MPS. It’s likely that you see a warning message explaining that it might need to use the CPU fallback for some operator:

[W MPSFallback.mm:11] Warning: The operator 'at:****' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (function operator())

Checkpointing

The checkpointing functionality has been refactored in luz, and
it’s now easier to restart training runs if they crash for some
unexpected reason. All that’s needed is to add a resume callback
when training the model:

# ... model definition omitted
# ...
# ...
resume <- luz_callback_resume_from_checkpoint(path = "checkpoints/")

results <- model %>% fit(
  list(x, y),
  callbacks = list(resume),
  verbose = FALSE
)

It’s also easier now to save model state at
every epoch, or if the model has obtained better validation results.
Learn more with the ‘Checkpointing’ article.

Bug fixes

This release also includes a few small bug fixes, like respecting usage of the CPU (even when there’s a faster device available), or making the metrics environments more consistent.

There’s one bug fix though that we would like to especially highlight in this blog post. We found that the algorithm that we were using to accumulate the loss during training had exponential complexity; thus if you had many steps per epoch during your model training,
luz would be very slow.

For instance, considering a dummy model running for 500 steps, luz would take 61 seconds for one epoch:

Epoch 1/1
Train metrics: Loss: 1.389                                                                
   user  system elapsed 
 35.533   8.686  61.201 

The same model with the bug fixed now takes 5 seconds:

Epoch 1/1
Train metrics: Loss: 1.2499                                                                                             
   user  system elapsed 
  4.801   0.469   5.209

This bugfix results in a 10x speedup for this model. However, the speedup may vary depending on the model type. Models that are faster per batch and have more iterations per epoch will benefit more from this bugfix.

Thank you very much for reading this blog post. As always, we welcome every contribution to the torch ecosystem. Feel free to open issues to suggest new features, improve documentation, or extend the code base.

Last week, we announced the torch v0.10.0 release – here’s a link to the release blog post, in case you missed it.

Photo by Peter John Maridable on Unsplash

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.

Citation

For attribution, please cite this work as

Falbel (2023, April 17). Posit AI Blog: luz 0.4.0. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2023-04-17-luz-0-4/

BibTeX citation

@misc{luz-0-4,
  author = {Falbel, Daniel},
  title = {Posit AI Blog: luz 0.4.0},
  url = {https://blogs.rstudio.com/tensorflow/posts/2023-04-17-luz-0-4/},
  year = {2023}
}

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles