sbt bug fix – remote caching

This article is a part 2 to this introductory article on remote caching, so do read that one first and come back!

In this second part I discuss a bug fix related to sbt’s remote caching. I’ll explain what the problem was and how it was fixed using two different implementations.

What was the problem?

Remote caching was failing. It was failing to include the resources directory in the JAR file that is packaged up for caching. This meant that any changes made in the resources directory would not be picked up by remote caching.

Here is the GitHub issue.

What is the resources directory?

The resources directory is the main place that sbt has denoted for us to put resource files, which are usually static non-code files.

What is an example of this problem?

An example of this is my friend Ade and I working together on a project. She decides to rename an sql file, which is in the resources directory. She renames it to sqlV2 from sqlV1. After doing this, she uses sbt pushRemoteCache to create a JAR file.

I pullRemoteCache and see that her file is there, but also the old file is there too. I’ve ended up with two files; the old version and the new one. Sbt has failed to delete the old version of the file. Oh no!

Problem summary

In summary, the problem was that the resources files were not tracked in the Analysis file. This is what Zinc uses to keep note of the files that have changed in our application code.

So, when we push to remote cache, sbt didn’t realise that a resource file has been updated.

In the case highlighted by the GitHub Issue, this means we end up with two resource files (one from the previous version of the application and the latest one when really we just want the latest one).

How was this fixed (first time)?

This bug was fixed with 2 different implementations. I’ll explain the first implementation here and then the second afterwards.

The idea behind the first fix was, given the Analysis file does not include resource files, would it be possible to create our own file which notes down any resource files? Could that file act as a record of the state of the resources directory and then we could delete them so that only the correct file remains when someone calls pullRemoteCache?

This first fix has 2 parts; creating the file of resource file paths to be included in the JAR file for pushRemoteCache and the second part, which is reading that file and deleting the files when we pullRemoteCache.

Here is the code for the first part.

It creates a new file and writes the resource files as virtual file paths.

  def getResourceFilePaths() = Def.task {
    import sbt.librarymanagement.LibraryManagementCodec._
    val t = classDirectory.value
    val dirs = resourceDirectories.value.toSet
    val flt: File => Option[File] = flat(t)
    val cacheDirectory = crossTarget.value / (prefix(configuration.value.name) + "caches")

    val converter = fileConverter.value
    val transform: File => Option[File] = (f: File) => rebase(dirs, t)(f).orElse(flt(f))
    val resourcesInClassesDir = resources.value
      .flatMap(x => transform(x).toList)
      .map(f => converter.toVirtualFile(f.toPath).toString)
    val json = Converter.toJson[Seq[String]](resourcesInClassesDir).get
    val tmp = CompactPrinter(json)
    val file = cacheDirectory / "resources.json"
    IO.write(file, tmp)
    file
  }

Here is the code for the second part.

When pullRemoteCache is called, sbt reads the file with all the resource file paths, coverts them from virtual to absolute paths and then deletes them.

  private def extractResourceList(output: File, converter: FileConverter): Unit = {
    import sbt.librarymanagement.LibraryManagementCodec._
    import sjsonnew.support.scalajson.unsafe.{ Converter, Parser }
    import xsbti.VirtualFileRef

    val resourceFilesToDelete = output / "META-INF" / "resources.json"
    if (resourceFilesToDelete.exists) {
      val readFile = IO.read(resourceFilesToDelete)
      val parseFile = Parser.parseUnsafe(readFile)
      val resourceFiles = Converter.fromJsonUnsafe[Seq[String]](parseFile)
      val paths = resourceFiles.map(f => converter.toPath(VirtualFileRef.of(f)))
      val filesToDelete = paths.map(_.toFile)
      for (file <- filesToDelete if file.getAbsolutePath.startsWith(output.getAbsolutePath))
        IO.delete(file)
    }
  }

I haven’t included the bits of code that include this file in the JAR, but you can find that in the pull request: https://github.com/sbt/sbt/pull/6554

What did I learn from this?

I learnt about globs, which I had been using, but had never known that they were called this! This is a computer programming concept about file paths. I love the name globs. It makes me feel like I’m in outer space! Here’s some nice documentation on sbt and globs.

I also learnt how sbt does JSON parsing (documentation here and sjson-new) and in particular about codecs (contraband and an example of a generated codec). Maybe I could write another article on this specifically later!

How was it fixed (second time)?

Here is the PR for the second implementation: https://github.com/sbt/sbt/pull/6611

After the first implementation, Eugene had another implementation idea. Why not use some existing code which tracks resource files but does so using absolute file paths, which could be changed into virtual paths for our purposes?

The code that does this in sbt is called Sync.scala. It checks that the files in two given directories are the same. It’s more efficient than a task which copies a directory, because if a file already exists in the target directory, it will keep it rather than copying it again which is more efficient.

In order to use Sync.scala, virtual file paths would be required instead of absolute so that this could work on different machines for remote caching purposes.

Here is the Sync.scala code: https://github.com/sbt/sbt/blob/develop/main-actions/src/main/scala/sbt/Sync.scala

What were the challenges you faced?

I experienced a few challenges during this implementation, which gave me lots of learnings. I’ll summarise three of them here.

Challenge 1: Implicit conversions

There were a number of compile errors when adding virtualisation to Sync.sync related to implicits for JSON formatting, which I struggled to understand.

I learnt that when I use typeclasses like sbt’s sjsonnew/JsonFormat.scala, I need to provide type evidence to the compiler at the call site where I want to use them.

To borrow one of Eugene’s analogies, using an implicit as evidence for a typeclass, is like having a driving licence that proves you can use your car. An implicit proves that we can use a certain typeclass.

In my case, the compiler was telling me that it couldn’t prove that VirtualFileRef could use the JsonFormat typeclass. Using IsoString told the compiler that VirtualFileRef is similar to the String type, which it knows how to use. It worked fine after that.

There are different ways of telling the compiler the type we need. We could:

Declare an implicit val
Write a converter function, which converts something into the type we want
Or let the compiler do it (automatic derivation)

The case I’ve described is number 2.

Here is a nice explanation of typeclasses in Scala. Note that Scala does not have built in typeclasses like Haskell, but it can simulate this pattern as Heather Miller explains really well.

Here is a super explanation of implicits from one of my amazing colleagues at the Guardian who explains implicit parameters and implicit conversions, we have discussed the latter here.

Challenge 2: Overloaded methods

I had some compile issues with overloaded methods that have default parameters.

An overloaded method is 2 or more methods that have the same name, but take different parameters.

A default parameter is a parameter which in the function signature is assigned a default value so that if the function is called without this parameter, the default is used.

When I changed the Sync.sync function so that it was virtualised I had to keep the original Sync.sync function (which takes a default parameter), because sbt must ensure backwards compatibility.

The compiler was really unhappy with my overloading methods, one of which took a default parameter.

This is because the compiler encodes the function with a default parameter in a special way, so that if there are two overloaded methods with a default on the same parameter position, it cannot accept this because it won’t be able to tell them apart. This problem is explained on this Stackoverflow post, which Martin Odersky has also contributed to! In the end, this was solved by writing an additional overloaded method (so now we have 3 Sync.syncs), which had the same function signature as the original method, just without the default parameter.

Challenge 3: Putting the file in the right place

The final challenge, wasn’t a challenge as such, just something new that I learnt about how remote caching works in sbt.

The problem was that Sync.sync previously put the information about the latest files in a directory which we didn’t have control over (streams). This is important, because pullRemoteCache needs to find this information to know which files to persist.

The solution was to set the directory that we wanted the file to be saved in. We used the classes directory, because we know it will always be available.