Saturday, January 10, 2015

Treat Databases as Yet Another Network Service

Preamble

This occurred to me recently, it’s been brewing for a long time, and I feel it’s one of those things that goes unsaid, it ultimately leads to many terrible things. So here goes:
The database is a network service, just like any other.

TL;DR

The database is a network service, just like any other. As such, it should be a good network citizen, and provide a stable backwards compatible API as much as possible, an abstracted (logical) model over its internal (physical) model. With key take aways:
* Views, triggers, and stored procedures can provide backwards and forwards compatibility
* Migrations, primarily DDL, and their deployment procedures should be separate from your application and its code base

Updates

  1. 2015-01-11 - Added changes based on feedback from twitter. Specifically, addressed microservices, ownership of data, and some additional best practices.

Network Services

Specifically for the context of this post, are any services on your network(s) that accept connections, and provide functionality in order to fulfill a user need(s). In my context, and I imagine many others, this is a web service and a database server, possibly many.

Database Subtleties

When I’m discussing databases I’m not referring to the RDBMS that is running. I’m talking about databases and/or schemas held therein. I currently feel the RDBMS is what the configuration management system or golden image should provide, but the orchestration and deployment of the data/metadata held therein is what I’m primarily referring too within this post, I hope it’s sufficiently clear from context, regardless.

How I Used to Think

The database is a "special thing" that is subordinate and some how part of the of web application(s) it serves.
Presumably this is how many other people think too, as many ecosystems (Java, Ruby, Python, …) seem to push in this direction. Even “newer” thinking in microservices, which say each service should have its own data store, still treat it as some sort of appendage. This all sounds vague? It is.
Garbage in, garbage out. From there I, and most everyone else it seems, starts putting database migrations into application deployment processes, or perhaps orchestration (this is still newer/less disseminated). The pain points are few at first, and then these are the things that run through my brain:
  • Run times of application deployment process are unpredictable, similarly rollbacks (roll forwards to rollback) are affected
  • Can’t make database changes (hotfix addition indices) without doing a release
  • Or cowboy it, but those changes have a separate release process, the secret, hidden, and poorly documented one or many because this is likely unaccounted for
  • There are ways to mitigate long run times of migrations by running on a slave, catching it up, failing over, then repeating the process. I can’t imagine (I can, I’m trying not to) what this would do to deploy meant complexity of an application
  • Or, take some lessons from Release It about doing zero downtime migrations, yet somehow, adopting those techniques seems to invariably encounter friction
  • Green/Blue deploys (slave/master failover, minimal data loss in case of rollback) can help somewhat, but that’s still not a complete/ideal solution

My Ah-ha Moment

One day I woke up, and had database migrations on the brain, thought about them diffusely, and a few hours later, “the database is just another network service, and as such should provide a stable API” popped into my head, everything started falling into place after that.

Practical Consequences

Separate Project/Artifact

As the database is its own service, it should be treated as a first class citizen like the rest. Regardless of whether you have one version control repository per service, or a top level directory per service within one root, the database should be right there. Inside this project you could keep it simple as a migration script store. You could go a more complex route and also store the CLI tools that allow you to run various other operations, whatever that service needs.

Service Contracts: API

Good network services have stable APIs. The API in the case of a database is the schema, when treating the database as just another network service there are some subtleties. First, the logical model should be the schema exposed to clients, while the physical model is the private implementation (where changes for performance tuning or changes supporting new API are made). In order to enforce segregation, one possibility is to create a database user for the applications that does not have access to the non-public tables and views. You can use views to provide backwards compatibility, or triggers to bridge writing the same information in two places as may be required. Views might not always be performant, so you might only shim those in while you’re supporting old API versions, and older API versions should be deprecated in a timely fashion.

Service Contracts: SLA

Good network services have SLAs. In regards to SLAs, these are what should govern deployments. Deployments should aim to be zero downtime, without even necessarily having read Release It, you can probably start having some ideas as to fancy view and trigger tricks in order to achieve it, some I alluded towards in the API section.

Migration Example

For instance, an addition of a column to a table, simply requires a default value and both new and old API can be maintained simultaneously. On a large table this is problematic due a) long run time of alter table b) heavy IO penalty.
An option is to do database deployment that runs through a rolling schema upgrade. In case you’re not aware of a rolling schema upgrade, I’ll explain them here, given at least one slave:
  1. Stop the slave from replicating
  2. Alter the schema with the addition of the column with the default value
  3. Restart replication and allow the slave to catch up
  4. During a quiet period, turn off the master
  5. Promote the slave to master
  6. Now the old master can run the upgrade
  7. Start replication on the master
Alternatively, you can perform a similar maneuver but on a  table level:
  1. Create a new table, table_v2, which is a structurally (DDL) updated version
  2. Create insert and update triggers on the original table to also copy data over to table_v2
  3. After the triggers are working note latest record in the original table
  4. Using an INSERT … SELECT start sequentially copying over the data into the new table in small chunks. Chunking is done by limiting the SELECT clause.
  5. Once the data has been copied over (surpassing the latest record), clients can start migrating over to the new table
Excuse some of the complexity, but the point is to illustrate that there is a lot of flexibility as to what one can do to rework the database with minimum interruptions and more specifically minimally impact the application deployment processes. Lastly, I should mention here that there are a number of tools already built that can help guide you through these steps, they’re database dependent however.

Effect on Consumers [Updated: 2015-01-11]

With greater stability in the public API of the database this will have effects on your application code. Previously, one would have to update all possible clients (web service controllers, queue producer/consumers, etc…) simultaneously, now these can be done piecemeal. Your application will have to be able to deal with data potentially not being present, default values and the Option type help here.
Database access should seem like a regular network call, some APIs tend to paper over this, but retries, timeouts, bulkheads all apply – libraries like Hystrix help here. I’d like to specifically mention LIMIT clauses in SELECT clauses, they’re under used, unbounded collections are not sound engineering. I’m very guilty of ignoring this for a long time. We did some analysis of our data recently and found a number of people in our database that had enormous collection data associated with them, such as phone numbers, emails and addresses, in the 100s or 1000s. Fun things happen, like our export system crash on these, random timeouts if the database is overloaded, the UI is really slow to load. Honestly, does someone really need more than ten emails, phones or addresses associated against a them, start with a reasonable number and bump it from there. I’m starting to do this, and it’s stopping me from over engineering, and most importantly under engineering.

Caution

What deployment and operational sanity giveth, inconsistent models and undisciplined software engineering taketh away. This is awesome in most contexts I’m in, but if I am to deal with a tiny (usage, and code) intranet application this is just more work. This is also where much of my old way of thinking came from, it’s a matter of scale.

Microservices [Updated: 2015-01-11]

Omar asked if microservices, due to their single datastore avoid these problems. Possibly, I think the biggest problem is that microservices are poorly defined, so I’ll try and tackle the two-ish definitions I know.

Pure Microservice

If your microservices is “truly” micro (1k-ish lines of code), and really only does one thing, then having a single subordinate data store doesn’t really hurt, except for the whole unpredictable deployment. My worries can be summaries in the following example: Consider an HTTP service that performs the usual CRUD operations on some data type Foo. Say you want to import Foos, this is a long running process and really doesn’t belong in your existing microservice. So do we create a new microservice? But making a separate data store for this doesn’t sound right either,  are we really going to fire off one http/network call from the import service per item being imported? Now we might come around to shared data stores, but then more than one client. This is exactly where I would throw up my hands.

Aggregates/Bounded Contexts

This is where your service is kept as small as possible, only acting over a minimal set of related Aggregates forming a Bounded Context, this should be conceptually no larger than what one team can understand. This case is similar, and hopefully where you end up if you’ve tried the pure method. My current situation is: HTTP service, long running tasks (producer/consumers), replication into our reporting database, and a legacy application. We have at least four types of API clients. We’re working on killing the legacy, but this will only get us down to 3. Coordinating a deployment and trying to get a consistent (versions) system across that many moving parts has been painful.

Next Steps

This is still fresh for me, I’m going to start working on presenting this line of reasoning to others, I’ve already had a chance to with Jeremy, you’re seeing the distillation here. I think people who really “get it” knew this all along, but didn’t necessarily convey it, because it seems so obvious. For me, it’s been a revelation.

Conclusions

Ultimately, once you’ve done this database stuff enough you probably arrive at the same conclusions, but from all accounts, via a long painful road. I feel that if people start off thinking about their databases correctly they will naturally avoid much of the pain.

Acknowledgements

The content and blog post started long ago, influenced by many things, but mostly it’s been conversations around where database migrations should live and who/what should initiate them with Jeremy, none of this would have happened without him.

Reviewers

Inspiration/References