Flyway is a library that is for example used in Spring Boot to provide schema migration functionality. But... does Flyway support BigQuery? In this blog post, we work out 3 proof-of-concepts to add BigQuery support to Flyway and integrate with Google Dataflow!
I'm a Solution Engineer in the Data team at ACA Group. Our cloud platform of choice is Google Cloud Platform (GCP). We’re currently using a subset of the services that are available on GCP.
Because our company has historically used Java as its language of choice (Python has recently gained some traction within our company and is something the Data team uses for writing Google Cloud Functions), we chose to write our Dataflow pipelines in Java (but Apache Beam also supports Python). As a company, we also have a lot of experience in writing enterprise applications in Java and frameworks like Spring Boot. So we're very used to automating our database schema evolution. This is a best practice that we like to keep and would like to apply to our data pipelines in Dataflow.
So we decided to go on a little adventure and see if we could find something out there that could solve this need for us.
The initial approach was first to do some research and see what Google can unearth. When looking around for information/tools/libraries/frameworks on schema evolution & migration for BigQuery, we did find some options that we gave a deeper look:
While some of these options do cover some or most of our requirements, there wasn’t one that really was an ideal match.
A couple of these options were also mentioned in a Stackoverflow post that also contained a reference to Flyway… and the term Flyway rings a bell!
Flyway is a library that is for example used in Spring Boot to provide schema migration functionality and that based on previous experience should in theory cover all our requirements. Leaving only one big question: does Flyway have BigQuery support?
At the time I started looking into the whole Dataflow/BigQuery schema migration question, there was no official Flyway BigQuery support. In the meantime, non-certified beta support has been added. Via the aforementioned Stackoverflow post, I did however find an issue in the Flyway GitHub repository about adding BigQuery support to Flyway. In that issue, I found a mention to a branch in a forked repository that should add some sort of BigQuery support to Flyway.
We were already familiar with Flyway due to our Spring Boot experience, and we’ve found some Flyway code that might actually add BigQuery support to Flyway. Time to do some proof of concepts, which will hopefully answer a bunch of questions:
The first proof of concept was to take the code from the forked repo as-is, clone it and try to get a simple migration to work against a BigQuery table in a dataset of a GCP test project.
There are 3 ways to run/use Flyway:
Because we want to integrate Flyway support into our Java based Dataflow pipelines and also because our Jenkins/Terraform based deploy currently isn’t well suited for the command line or Maven/Gradle options, we first looked at just calling the Flyway API. This was done by just adding a simple Java class to the cloned repository branch and adding a main method. In this main method we needed to do a couple of things:
So the first thing we need to set up for a data source is a BigQuery JDBC driver. Luckily, the Google BigQuery documentation covers this. On this page is a link to a free download of the Google BigQuery Simba Data Connector made by Magnitude. Downloading the driver from this page will get you a ZIP file that contains the actual JDBC driver JAR file, GoogleBigQueryJDBC42.jar, but also all its dependencies.
In my case, I only added this driver JAR to our company’s Maven repository, because most of the other driver dependencies are already available in public Maven repositories. It's quite the chore to check them all and upload the missing ones or the ones with differing versions.
For this first POC it was enough to add the following dependencies to the pom.xml of the project we cloned (the versions are only indicative for when I tested it, but can be replaced with newer ones):
With these dependencies in place, we can then get the code below to work if you set the GOOGLE_APPLICATION_CREDENTIALS environment variable and point it to a service account credentials JSON file (which is needed to make the OAuthType=3 authentication mode work) and replace the <GCP project ID> and <a dataset ID> placeholders.
I then also added an SQL migration file to my src/main/resources/db/migration directory and executed the code and to my surprise Flyway was trying to talk to my BigQuery. There was however one small issue with the cloned Flyway BigQuery code that needed to be fixed. The INSERT statement, in the BigQueryDatabase#getInsertStatement method, that Flyway uses to add migrations to its flyway_schema_history table failed for 2 reasons:
After fixing the INSERT statement, I was able to see Flyway work correctly with BigQuery and verify that it could do all the migration actions that we defined. I even managed to get mixed SQL & Java migrations to work (using the Java BigQuery API to do things that can’t be expressed in SQL). There was only 1 surprise: adding data to a table can’t be done in the same SQL file that you create the table in. Those kinds of actions can’t be mixed in the same file.
The output below is similar to what I got, but is from a more recent attempt with the current Flyway 8.x that has BigQuery Beta support:
The previous POC leaves us now with a new problem to solve: get this code working inside a Google Dataflow project. Taking inspiration from Spring Boot, which runs Flyway migrations during application startup, I had to find something in Beam/Dataflow that is similar and allows us to run arbitrary code during startup.
A first option that I discovered and investigated was a custom DataflowRunnerHooks implementation. While trying this out, I quickly discovered that the moment this is triggered is completely wrong for what we want to achieve as it is already executing while building the Dataflow code using the mvn compile exec:java command. Because we're building a common Dataflow artifact that is deployed to all environments and gets injected with runtime variables, triggering our custom Flyway code at this time doesn't achieve what we want.
So after looking around some more I found the JvmInitializer interface. This immediately looked more promising and a quick implementation showed that it was indeed usable, but that it does have a number of quirks/gotchas that we’ll cover in more detail in the lessons learned section.
When adding this code to a Dataflow project, there is one more thing needed to actually make it work. The JvmInitializer system works via the Java Service Provider Interface mechanism. This means we need to create a file called org.apache.beam.sdk.harness.JvmInitializer in src/main/resources/META-INF/services that contains the FQCN of our JvmInitializer implementation.
When running a Dataflow pipeline, we can see the following logging (here again with the output of a more recent attempt with the Flyway version that has Beta support for BigQuery):
When starting to write the actual blog post, I checked out the Flyway Github repo again and spotted an interesting new module in their Maven multi-module project: flyway-gcp-bigquery (and also one for GCP Spanner). Looking at Maven Central it looks like they started to release beta versions of the BigQuery support somewhere in July 2021.
So I decided to check it out and see if I could remove the forked PR code from my codebase and replace it with this beta version dependency:
After removing the code, adding the dependencies above (while also upgrading Flyway from 7.x to 8.x), recompiling and deploying, I was still able to run all the migrations successfully against an empty BigQuery environment.
The driver itself (as far as I can tell, the only JDBC BigQuery driver there is) does what it is supposed to do, but when it comes to logging it is a bit of a hot mess. Things I had to do to get the driver’s logging in Dataflow under some sort of control include:
There is no way to run something locally on your development machine that can be used to validate BigQuery behaviour. So no BigQuery docker image or emulator means that in order to test Flyway migrations against BigQuery you will actually need either a separate Google project to test against or use prefixed datasets in an existing Google project.
Due to certain limitations we had to go for the prefixed dataset approach, but managed to get it to work pretty transparently by using Dataflow runtime ValueProviders, the Flyway placeholder functionality and a custom utility that makes the dataset creation/deletion process easier.
BigQuery has a very interesting feature called Time Travel, which comes in very handy when Flyway migrations fail. Especially for the community edition of Flyway, which doesn’t have “undo” functionality, Time Travel is the easiest way to restore your database to how it was before the migration.
I’m even wondering if you could somehow build “undo” functionality using BigQuery’s Time Travel and Flyway’s “callbacks” (the ones that are available in the community version)?
Time Travel also comes in handy because BigQuery has quotas on a lot of things. Manually reverting changes via SQL ALTER TABLE statements for example quickly makes you run into these.
We first had every Dataflow pipeline using the JvmInitializer to keep the database schema up to date, but noticed that sometimes rows in the Flyway history table were duplicated (or more). As it turns out, every Dataflow worker that gets started by a pipeline goes through JVM initialization. Sometimes, these are started close enough to each other that migrations get run multiple times. Usually Flyway tries to use some sort of locking to solve this, but in the cloned code this mechanism wasn’t available for BigQuery. It seems some sort of locking is available in the 8.x Beta for this, but I haven’t been to test if this works yet.
To solve this issue, we made running the JvmInitializer configurable and turned it off by default for all pipelines and created a specific dummy Flyway pipeline for which we turned it on and which runs before all other batch pipelines.
Worker initialization takes about 2 minutes before the worker actually starts doing stuff and we see Flyway kicking into action. Afterwards, it also seems that every migration file takes at least 30 seconds to run (sometimes more, depending on the migration and table contents). From the logging it looks like this is partially due to how the SQL is being run: a BigQuery job for which you need to listen for the results.
Luckily, due to the previous issue/solution we’re only running it once every day for one dummy pipeline and not the rest of our pipelines. So the only time it is actually slow is when you’re testing and running the full set of migrations starting from an empty environment.
You will also need to set your Flyway timeout to a value that is long enough for bigger table manipulations to succeed and not cause a timeout. We’re currently working with a value of 180 seconds.
For all the things that you want to do with BigQuery in the context of a migration that aren’t supported by BigQuery’s SQL implementation, you can fall back on Flyway’s Java migrations. In a Java migration, you can easily use the BigQuery Java API to do everything that the API allows you to do.
In the end, we created a more advanced JvmInitializer that allows us to turn Flyway migrations/repair/baselining on/off, dynamic prefixing of datasets, and so on. For this we of course need to provide Dataflow pipeline options and because we’re also building a pipeline artifact (JSON + JARs) in a central bucket that is used to start jobs in multiple environments these options need to be runtime options. This is where we ran into an issue with Dataflow’s option mechanism, especially if you want to use the required/default mechanism. As it turns out, this mechanism doesn’t really work like you’d expect and defaults seem to get lost when you don’t provide a value for an option, but try to access it in the JvmInitializer.
The solution to this was found when looking at the Dataflow worker logs. In these logs, we could see a JSON being logged that contains most of the option info we need. This JSON is available under the sdk_pipeline_options_file environment variable on a worker. Reading this value and parsing it allows us to kind of get a working custom options object. Together with using reflection to look at the annotations and their contents, we got it to work well enough for our purposes.