A while back I messed up my RStudio a bit by installing a development version of a package from GitHub, which upgraded some other packages, and before I knew it I got errors with existing code that weren’t there before. At that time, I just got into containerisation with Docker so I decided to try to run RStudio from a Docker container. If you follow along, you’ll have an RStudio server running in a local container that you can access from your browser. It comes with Python and Keras, so you can do deep learning in RStudio too!
What is a container?
Docker is software to create and run containers. A container in this context is an isolated environment in which you can install and run software. Thus, it does not interfere with anything installed locally on your machine. You can safe state so you can continue working in your container later without problems. A container can be shared with others and backed up. With Docker you can create a container using a Dockerfile. This is a plain text file that defines your container, like what operating system to use, what software to install, which R packages, etc. The good thing is that this file can be shared and checked into git very easily. Docker can create an image from this file. The image is then instantiated into new containers, as many as you need.
Why run RStudio in a container?
For me, a major advantage is that I have a default installation of RStudio to fall back to. I can install packages or other stuff in my container just to try things out. Then, spinning up a new container from my RStudio template dockerfile, I am back to my clean install. If a package is useful and I want to add it to my template, I can simply update my dockerfile and rebuilt the container. An additional bonus for me is that you’ll run RStudio in your browser, which means I have all my other resources close by. I usually have a browser window with all coding related pages open in its tabs.
What also seems to be the case is that certain machine learning packages automatically run in parallel over multiple cores while the same code does not do so on the machines of my colleges. This might have something to do with the fact that the docker version runs the Linux version of R, RStudio, and the packages, while our team works on Mac laptops.
Note that in general, everything will probably run a bit slower compared to a native installation on your own machine. Depending on what you want to do, this could be a problem. A bonus though (at least on Mac), is that with Docker Desktop you can assign how many cores and how much memory you want to assign. In this way you can limit the system resources and still have a workable machine for other tasks while your algorithm is training…
Going through my Dockerfile
In the code block below you will find the Dockerfile that I use to built my RStudio image. It is slightly cleaned up, with less packages etc. to make it easier to see the overall structure. Let’s have a look!
The first step in a Dockerfile is to setup your base image. Because Docker images are layered, you can easily use an existing image as your base and go from there. For my RStudio container I use the verse image from the rocker project. It includes R, RStudio, tidyverse, and tex + publishing related packages. You can find the exact specifications here: https://hub.docker.com/r/rocker/verse. I specifically chose to use the Ubuntu version, even though its R version is slightly older, because in this version packages can be installed in binary format. This means that you won’t have to wait endlessly for each package to be built from source.
The next step is to install system packages. Since this image is based on Ubuntu you can use
apt-get to install stuff, which is great because it will automatically install the required dependencies for you. If I am totally honest, I can’t recall exactly what each system package listed below is for. Usually I have an R package that needs some system software to work, which’ll lead me to update the Dockerfile, after which I might forget what it was for again. As you can see, the
RUN command just runs system code, same as from your terminal/bash.
RUN apt-get update && apt-get install -y \ libv8-dev \ libudunits2-dev \ liblzma-dev \ libbz2-dev \ libmariadb-dev \ git-all \ python3-dev \ python3-pip
v8 lib is to get the
libmariadb is to create connections with a MySQL database using the
RMariaDB package in R. The others are maybe to work with all sorts of zip files. I think. Don’t forget to install Git and Python. For me these were the libs that worked, but if you have better suggestions, let me know!
Next up is the installation of some Python packages. I have Tensorflow and Keras installed for deep learning related things and I have Spacy installed for natural language processing. The latter also needs to download language models, which would be for English and Dutch in my case. While this may not be useful for you, it nicely illustrates how you can basically run arbitrary python code when creating an image. Use it wisely.
RUN pip3 install -U virtualenv && pip3 install -U spacy && pip3 install -U --upgrade tensorflow && pip3 install -U keras RUN python3 -m spacy download nl && python3 -m spacy download en
The last step in the image creation process is to install any R packages you know you will want to have. If you have very different projects, you can make project specific images. I have a tendency to just put everything in one container, giving me a very long list of packages that I sometimes have to prune to keep it in check. Just like with Python, this is just running arbitrary R code, in this case an
RUN Rscript -e "install.packages(c('rjson', 'shinyjs', 'timeDate', 'V8', 'glmnet', 'coda', 'rlist','xgboost','pryr','tictoc','future', 'doFuture', 'furrr', 'rsdmx', 'forecast', 'ggraph','DBI','RMariaDB','pROC', 'glue', 'plotly', 'Matrix', 'cleanNLP', 'keras', 'dbplyr', 'httr', 'rvest', 'tidytext', 'tidymodels', 'broom.mixed', 'skimr', 'nycflights13', 'modeldata', 'vip'))"
Here is the whole file, so you can easily copy it and customize it to make it your own.
FROM rocker/verse:4.0.0-ubuntu18.04 RUN apt-get update && apt-get install -y \ libv8-dev \ libudunits2-dev \ liblzma-dev \ libbz2-dev \ libmariadb-dev \ git-all \ python3-dev \ python3-pip RUN pip3 install -U virtualenv && pip3 install -U spacy && pip3 install -U --upgrade tensorflow && pip3 install -U keras RUN python3 -m spacy download nl && python3 -m spacy download en RUN Rscript -e "install.packages(c('rjson', 'shinyjs', 'timeDate', 'V8', 'glmnet', 'coda', 'rlist','xgboost','pryr','tictoc','future', 'doFuture', 'furrr', 'rsdmx', 'forecast', 'ggraph','DBI','RMariaDB','pROC', 'glue', 'plotly', 'Matrix', 'cleanNLP', 'keras', 'dbplyr', 'httr', 'rvest', 'tidytext', 'tidymodels', 'broom.mixed', 'skimr', 'nycflights13', 'modeldata', 'vip'))"
Create the image
To have Docker create an image from this Dockerfile, you need to install Docker on your system (obviously…). You can find it here: https://docs.docker.com/get-docker/. Save the file above with the filename
Dockerfile without any extension. Then open a Terminal/Bash session and navigate to the folder containing the Dockerfile. Then use the following command to build the image:
docker build . -t rstudio-dev
rstudio-dev is the tag you assign to this image so you can find it easier later on. Feel free to name it however you want. If you are on Linux or Mac, you might need to use
sudo to get this to work.
Creating containers from this image
Now that you have an image you can start creating containers from it. This can be done from the Terminal/Bash as well. Run the following command to start a new RStudio container:
docker run --rm -d -p 127.0.0.1:9999:8787 -v $(pwd):/home/rstudio -e ROOT=TRUE -e PASSWORD=123456 --name rstudio9999 rstudio-dev
Let’s unpack this rather long statement:
--rm denotes that after stopping the container you want to remove it, which I find helpful
-d denotes that this container will run detached from any Terminal session. Now you can use your Terminal session for other things 🙂
-p 127.0.0.1:9999:8787 denotes on what port you want to run this server. The 8787 port is the port in the container that RStudio is served on. This traffic is routed to your local 9999 port. If you want to run multiple RStudio containers, you need to assign different ports to each of them.
-v $(pwd):/home/rstudio creates a mapping between the folder /home/rstudio inside the container to your working directory in your own file system. This works on a Mac and Linux, but not on Windows. You can put the directory containing your R projects there. Everything in the folder on the right will be available in the /home/rstudio folder when you run RStudio. This is very helpful because the isolation of containers means you cannot access your system files from within a Docker in a regular fashion otherwise.
-e ROOT=TRUE -e PASSWORD=123456 are setting two environment variables for the RStudio container so you have root access when running code from the Terminal tab in RStudio. This is helpful because in that way you can temporarily install system packages to try stuff out.
--name rstudio9999 gives your container a name so you can find it in the list of running containers
rstudio-dev is the name of the image you created earlier
If you ran this code, you should see your container in the list of running containers when running the
docker ps command from your Terminal.
Give it a try!
Open your browser and go to http://localhost:9999/ or replace the 9999 by another port number if you chose differently when starting the container. You should see the RStudio intro screen:
Log in with username
rstudio and the password you provided when starting the container (
123456 in my case). As this is running locally, there is no real security issue. If you don’t need root access, you can leave out the password when starting the container which will skip the login screen.
Hopefully this is of some use to you. If you encounter something that is not working or you have helpful addition, let me know in the comments below. I am not a Docker expert, so I’d be happy to improve based on your contribution!