Compromising an Entire Julia Cluster

/ 05.18.16 / Brian Wallace

Vulnerabilities have been discovered in the Julia programming language which allow a remote attacker to take complete control over an entire Julia cluster, given that they have network access to at least one node. Attacking distributed systems can be enticing to attackers as it commonly gives them access to large amounts of computing power, credentials available to the compute cluster, as well as the data being computed over, which in many cases, can be of great value.

The vulnerabilities have been mitigated in the most recent versions of Julia and patches have been back-ported to reduce the attack surface of earlier versions of Julia. The Julia developers had an excellent response to the disclosure and should be commended.

Julia Clusters and Machine Learning

Figure 1: The Julia REPL loading

Recently, I have been getting more and more into the world of machine learning. It’s a powerful tool to have available, but it can be quite difficult to use. Julia, a programming language, is geared towards mathematical computing such as machine learning. It allows data scientists to manipulate data with ease and at reasonably high speeds, as the language compiles to LLVM (dynamically).

I started working with Julia as I was drawn to a specific feature: the ability to launch Julia clusters. Julia clusters allow for nearly transparent execution of Julia code in multiple cores on multiple networked computers, as long as the computers can communicate with each other over SSH (or other supported protocols). This allows for low development overhead to convert a project to work on larger scales of data in parallel, which is often needed for machine learning.

In order to start Julia instances on computers in the cluster, the ‘master’ (the initial Julia instance) connects to all the remote workers via SSH and runs Julia with specific command line options. When these workers are started, they open TCP port 9009 by default on which to receive commands. When attempting to debug an issue with my Julia cluster setup, I noticed that this TCP port is bound to all interfaces, allowing for any remote IP to connect in. At this point, my brain kicked into security research mode, and so began the bug hunting.

In further testing, I observed that this port was opened to the world even in cases of running multiple local instances of Julia on the same machine as the master. If this port provided any exploitable attack surface, it would leave Julia clusters extremely vulnerable to remote attackers. At this point, I started looking at the code behind this port, and since this portion is written in Julia, it allowed me to continue learning the language. My hopes of finding an exploitable attack surface skyrocketed as I found that information being read by this socket was not being authenticated at all, and instead, was being immediately sent into a function called ‘deserialize’.

Resisting the urge to delve deeper into the inner workings of the ‘deserialize’ function, I started to look at the actual protocol, because the point of said protocol is to remotely request that code be run. It doesn’t seem like much of a stretch to figure out that this behavior could be abused in some way. Given that you've read the title of this blog post, you already know what happens next.

Remote Code Execution

After conducting some testing with a Julia cluster and the useful PCAP library, I found what I was looking for. The ‘CallMsg’ request allows for the execution of a function defined in the message with arguments also supplied in the message. The CallMsg request is designed as follows:

 type CallMsg{Mode} <: AbstractMsg

 f::Function

 args::Tuple

 kwargs::Array

 response_oid::Tuple

end

The ‘f’ member of the CallMsg type is a function definition. If we make sure this function is an anonymous function, we can send it over the wire serialized, and have it be defined as any code we wish, instead of already defined code on the target worker. In Julia, we can execute command line commands with the following code:

run (`command`)

In order to avoid reimplementing the serialization method in another language for a proof of concept, I used the following Julia command to request that a worker execute a command while sniffing my network traffic, and identified how I could simply modify the packet to change the payload:

 remotecall(2, ()->run(`whoami`), ())

After hacking some changing values from packet to packet, I found that this could be done from cluster to cluster reliably with the following proof of concept written in Python 2.7:

https://gist.github.com/bwall/339c43f02669f3086709e041813171ed

Okay, so we have RCE on a worker. Good for us, right? Can't we do something cooler? Absolutely.

One is the Loneliest Number

Since we are able to execute arbitrary Julia code, we have a good number of options now available to us to "work inside the system". After some experimenting, I found that I could send requests from a worker node to execute on the master node. My initial test for this was the following code:

 remotecall(2, ()->remotecall(1, ()->run(`touch /tmp/victory`), ()), ())

What this effectively proves is that I could execute an anonymous function on a worker (the remote worker had an ID of 2), which then requested that the master (in every situation I've seen so far, the master node has an ID of 1) execute another anonymous function being sent from the worker to the master. The end result was simply the creation of an empty file at ‘/tmp/victory’ but it changed the game.

With this proof of concept, we can execute code on a worker which then executes code on the master, but that's just two computers. If we go another level deeper, we can then execute code on all nodes from the master node itself – and all this achieved whilst only having network access to a single node. We do this by adding another layer of 'remotecall' calls inside of a loop, executing the same code on every node. My test code was as follows:

 remotecall(2, ()->remotecall(1, ()->[remotecall(x, run(`touch /tmp/victory`), ()) for x in procs()], ()), ())

To reiterate, this proof of concept sent a request to a worker, which then sent a request to the master node, which then sent out a request to every single node (including the master) to run the command ‘touch /tmp/victory’. One can imagine my excitement checking the temporary directory of all my nodes and finding victory at every turn.

The following proof of concept in Python 2.7 is capable of executing commands on every node in a Julia cluster, given network access to a single worker:

https://gist.github.com/bwall/cac8d0b3ee4e90eab805d3adce8ec628

Naturally, we want to do something more interesting than create empty temporary files on the targets. For instance, we might want a reverse shell from every single node in the cluster, as demonstrated in this short video:

&amp;amp;amp;nbsp;

Resolution

The Julia developers, after being contacted, acted with a high degree of responsibility in resolving these issues. They have taken multiple steps to reduce the attack surface by reducing the availability of the service running on port 9009. In addition, they have instituted a ‘cluster cookie’ which acts as an authentication value, which is communicated over the encrypted SSH channel. This means that for an attacker to gain control, they would need this cookie value which is randomly generated when the cluster starts. An attacker with an active man in the middle attack on the cluster could still abuse the cluster, but resolutions to that level of compromise would incur severe overhead on the communication channel.

Conclusion

Distributed computing systems require large amounts of traffic to transfer between the computers in the cluster, which often leads to unauthenticated information and code being sent between nodes. This is a communication channel that provides a significant level of attack surface, and has been identified in a number of distributed systems before Julia. The speed that the data needs to be transferred adds complications to securing the communication channel, but the addition of security to the channel is necessary to ensure the integrity of the nodes, the data, and everything they touch. Julia had a severe security issue, but the developers resolved it in a timely and responsible manner.

Timeline

April 25th, 2016 - Initial disclosure of vulnerability
April 25th, 2016 - Initial response from development team
May 10th, 2016 - Pull request by Julia developer with solution to reported issues (https://github.com/JuliaLang/julia/pull/16292/files#diff-30dfa74d6bebe60797057605a98aeabbR932)
May 12th, 2016 - Pull request merged into master (patched builds available that night in nightly builds)

About Brian Wallace

Lead Security Data Scientist at Cylance

Brian Wallace is a data scientist, security researcher, malware analyst, threat actor investigator, cryptography enthusiast and software engineer. Brian acted as the leader and primary investigator for a deep investigation into Iranian offensive cyber activities which resulted in the Operation Cleaver report, coauthored with Stuart McClure.

Brian also authors the A Study in Bots blog series which covers malware families in depth providing novel research which benefits a wide audience.

Back