Saturday, January 10, 2015

Treat Databases as Yet Another Network Service

Preamble

This occurred to me recently, it’s been brewing for a long time, and I feel it’s one of those things that goes unsaid, it ultimately leads to many terrible things. So here goes:
The database is a network service, just like any other.

TL;DR

The database is a network service, just like any other. As such, it should be a good network citizen, and provide a stable backwards compatible API as much as possible, an abstracted (logical) model over its internal (physical) model. With key take aways:
* Views, triggers, and stored procedures can provide backwards and forwards compatibility
* Migrations, primarily DDL, and their deployment procedures should be separate from your application and its code base

Updates

2015-01-11 - Added changes based on feedback from twitter. Specifically, addressed microservices, ownership of data, and some additional best practices.

Network Services

Specifically for the context of this post, are any services on your network(s) that accept connections, and provide functionality in order to fulfill a user need(s). In my context, and I imagine many others, this is a web service and a database server, possibly many.

Database Subtleties

When I’m discussing databases I’m not referring to the RDBMS that is running. I’m talking about databases and/or schemas held therein. I currently feel the RDBMS is what the configuration management system or golden image should provide, but the orchestration and deployment of the data/metadata held therein is what I’m primarily referring too within this post, I hope it’s sufficiently clear from context, regardless.

How I Used to Think

The database is a "special thing" that is subordinate and some how part of the of web application(s) it serves.
Presumably this is how many other people think too, as many ecosystems (Java, Ruby, Python, …) seem to push in this direction. Even “newer” thinking in microservices, which say each service should have its own data store, still treat it as some sort of appendage. This all sounds vague? It is.
Garbage in, garbage out. From there I, and most everyone else it seems, starts putting database migrations into application deployment processes, or perhaps orchestration (this is still newer/less disseminated). The pain points are few at first, and then these are the things that run through my brain:

Run times of application deployment process are unpredictable, similarly rollbacks (roll forwards to rollback) are affected
Can’t make database changes (hotfix addition indices) without doing a release
Or cowboy it, but those changes have a separate release process, the secret, hidden, and poorly documented one or many because this is likely unaccounted for
There are ways to mitigate long run times of migrations by running on a slave, catching it up, failing over, then repeating the process. I can’t imagine (I can, I’m trying not to) what this would do to deploy meant complexity of an application
Or, take some lessons from Release It about doing zero downtime migrations, yet somehow, adopting those techniques seems to invariably encounter friction
Green/Blue deploys (slave/master failover, minimal data loss in case of rollback) can help somewhat, but that’s still not a complete/ideal solution

My Ah-ha Moment

One day I woke up, and had database migrations on the brain, thought about them diffusely, and a few hours later, “the database is just another network service, and as such should provide a stable API” popped into my head, everything started falling into place after that.

Practical Consequences

Separate Project/Artifact

As the database is its own service, it should be treated as a first class citizen like the rest. Regardless of whether you have one version control repository per service, or a top level directory per service within one root, the database should be right there. Inside this project you could keep it simple as a migration script store. You could go a more complex route and also store the CLI tools that allow you to run various other operations, whatever that service needs.

Service Contracts: API

Good network services have stable APIs. The API in the case of a database is the schema, when treating the database as just another network service there are some subtleties. First, the logical model should be the schema exposed to clients, while the physical model is the private implementation (where changes for performance tuning or changes supporting new API are made). In order to enforce segregation, one possibility is to create a database user for the applications that does not have access to the non-public tables and views. You can use views to provide backwards compatibility, or triggers to bridge writing the same information in two places as may be required. Views might not always be performant, so you might only shim those in while you’re supporting old API versions, and older API versions should be deprecated in a timely fashion.

Service Contracts: SLA

Good network services have SLAs. In regards to SLAs, these are what should govern deployments. Deployments should aim to be zero downtime, without even necessarily having read Release It, you can probably start having some ideas as to fancy view and trigger tricks in order to achieve it, some I alluded towards in the API section.

Migration Example

For instance, an addition of a column to a table, simply requires a default value and both new and old API can be maintained simultaneously. On a large table this is problematic due a) long run time of alter table b) heavy IO penalty.
An option is to do database deployment that runs through a rolling schema upgrade. In case you’re not aware of a rolling schema upgrade, I’ll explain them here, given at least one slave:

Stop the slave from replicating
Alter the schema with the addition of the column with the default value
Restart replication and allow the slave to catch up
During a quiet period, turn off the master
Promote the slave to master
Now the old master can run the upgrade
Start replication on the master

Alternatively, you can perform a similar maneuver but on a table level:

Create a new table, table_v2, which is a structurally (DDL) updated version
Create insert and update triggers on the original table to also copy data over to table_v2
After the triggers are working note latest record in the original table
Using an INSERT … SELECT start sequentially copying over the data into the new table in small chunks. Chunking is done by limiting the SELECT clause.
Once the data has been copied over (surpassing the latest record), clients can start migrating over to the new table

Excuse some of the complexity, but the point is to illustrate that there is a lot of flexibility as to what one can do to rework the database with minimum interruptions and more specifically minimally impact the application deployment processes. Lastly, I should mention here that there are a number of tools already built that can help guide you through these steps, they’re database dependent however.

Effect on Consumers [Updated: 2015-01-11]

With greater stability in the public API of the database this will have effects on your application code. Previously, one would have to update all possible clients (web service controllers, queue producer/consumers, etc…) simultaneously, now these can be done piecemeal. Your application will have to be able to deal with data potentially not being present, default values and the Option type help here.
Database access should seem like a regular network call, some APIs tend to paper over this, but retries, timeouts, bulkheads all apply – libraries like Hystrix help here. I’d like to specifically mention LIMIT clauses in SELECT clauses, they’re under used, unbounded collections are not sound engineering. I’m very guilty of ignoring this for a long time. We did some analysis of our data recently and found a number of people in our database that had enormous collection data associated with them, such as phone numbers, emails and addresses, in the 100s or 1000s. Fun things happen, like our export system crash on these, random timeouts if the database is overloaded, the UI is really slow to load. Honestly, does someone really need more than ten emails, phones or addresses associated against a them, start with a reasonable number and bump it from there. I’m starting to do this, and it’s stopping me from over engineering, and most importantly under engineering.

Caution

What deployment and operational sanity giveth, inconsistent models and undisciplined software engineering taketh away. This is awesome in most contexts I’m in, but if I am to deal with a tiny (usage, and code) intranet application this is just more work. This is also where much of my old way of thinking came from, it’s a matter of scale.

Microservices [Updated: 2015-01-11]

Omar asked if microservices, due to their single datastore avoid these problems. Possibly, I think the biggest problem is that microservices are poorly defined, so I’ll try and tackle the two-ish definitions I know.

Pure Microservice

If your microservices is “truly” micro (1k-ish lines of code), and really only does one thing, then having a single subordinate data store doesn’t really hurt, except for the whole unpredictable deployment. My worries can be summaries in the following example: Consider an HTTP service that performs the usual CRUD operations on some data type Foo. Say you want to import Foos, this is a long running process and really doesn’t belong in your existing microservice. So do we create a new microservice? But making a separate data store for this doesn’t sound right either, are we really going to fire off one http/network call from the import service per item being imported? Now we might come around to shared data stores, but then more than one client. This is exactly where I would throw up my hands.

Aggregates/Bounded Contexts

This is where your service is kept as small as possible, only acting over a minimal set of related Aggregates forming a Bounded Context, this should be conceptually no larger than what one team can understand. This case is similar, and hopefully where you end up if you’ve tried the pure method. My current situation is: HTTP service, long running tasks (producer/consumers), replication into our reporting database, and a legacy application. We have at least four types of API clients. We’re working on killing the legacy, but this will only get us down to 3. Coordinating a deployment and trying to get a consistent (versions) system across that many moving parts has been painful.

Next Steps

This is still fresh for me, I’m going to start working on presenting this line of reasoning to others, I’ve already had a chance to with Jeremy, you’re seeing the distillation here. I think people who really “get it” knew this all along, but didn’t necessarily convey it, because it seems so obvious. For me, it’s been a revelation.

Conclusions

Ultimately, once you’ve done this database stuff enough you probably arrive at the same conclusions, but from all accounts, via a long painful road. I feel that if people start off thinking about their databases correctly they will naturally avoid much of the pain.

Acknowledgements

The content and blog post started long ago, influenced by many things, but mostly it’s been conversations around where database migrations should live and who/what should initiate them with Jeremy, none of this would have happened without him.

Reviewers

Inspiration/References

James Hamilton - On Designing and Deploying Internet-Scale Services
Uncle Bob - SOLID Principles Specifically, Separation of Concerns, Liskov Substitution Principle, and Acyclic Dependency Principle (applied in the non-OO sense)
Michael Nygard - Release It!: Design and Deploy Production-Ready Software
Eric Evans - Domain Driven Design
Netflix - Hystrix

Sunday, December 28, 2014

Comparing Dropwizard and Ratpack

This originally started as a reply to this email thread. As I started typing it, it's grown much longer, I'm cross posting it here.

I've been looking at Ratpack on and off for the last six months or so, I've also been working heavily with Dropwizard in the mean time.

Dropwizard like Ratpack is small. The focus in Dropwizard is JSON HTTP services with a strong focus on operations, here is a flavour of what it offers:

Authentication: Dropwizard offers basic HTTP auth and an OAuth2 client (you still need to set something else to provide bearer tokens)

Deployment: A single fat jar that can be run as a server (provided a configuration file) a means to execute database migrations, and a variety of other tasks.

Metrics: The DataSource for database connections is the one from the metrics library which means all queries are metered and trivially fired into something like graphite. Also, HTTP endpoints can be easily metered, timed, and watched for exceptions. All of this means that your application provides a lot of operational observability. HTTP Clients are also provided by the metrics library and will log any issues they had.

Logging: Logging is sane and setup uniformly across the various subsystems, so during development you can configure it to append to console, while elsewhere all logs can go to somewhere sensible such as /var/log/yourapp.log.

Database: Along with metrics, queries are executed with a comment identifying where in the code they came from, though this may be a hibernate specific feature.

Health Checks: A Dropwizard application without any configured health checks will throw a warning on startup each and every time, this encourages you to write at least minimal checks.

Banners: You can add startup ASCII art banners to your app. Say what you will, but I'd miss this if it was taken away.

Migrations: Integration with liquibase is provided, I believe flyway is available as well, by way of invoking commands on the final compiled fat jar.

Admin Port: Besides your application port there is a separate admin port containing various background tasks that one may run as part of operations, providing access to healthchecks, metrics, etc... This is awesome until you're running in heroku, we're not, but I've yet to see a reasonable solution to this as Heroku only offers one routable port. For Ratpack one could just create a branch in ones handlers at the root and require HTTP basic auth/pre-shared-key/something simple over HTTPS to get the same effect, but a default implementation would be nice.

Testing: It's nice to see lots of testing levels documented and supported.

Design Opinions:

1) Dropwizard's docs encourage you to split your project (service) into three maven modules, one is the server, one is the client, and the last is the API (POJOs representing JSON) which is your API. This way you can trivially avoid dependency cycles.

2) Dropwizard heavily favours JSON (big fan here) and uses the Jackson library to do encoding/decoding

3) The page/static asset serving portion of Dropwizard is somewhat of an afterthought though sufficient for most needs. Personally, this makes a lot of sense to me as I think most apps (often single-page/fat client) should just serve static assets. Most html/css/js development should be supported by things like Node which provide first class integration with Less, linting, and all sorts of other tooling "native" to that environment.

After using Dropwizard, shortcomings:

1) First class integration of hystrix or the like isn't there, I think rather than providing HTTP clients, it should just have a factory/builder specifying endpoint, payload, and the nature of the call (can-fail+fallback, cannot fail, etc...) and then let it handle the rest. Also layer another level of metrics atop this to be able to see the overall health of requests. There is an "out of band" module for this, but again first-class.

2) No trivial hot code reload, as builds are packaged via maven-shade, especially annoying when running multiple services simultaneously -- this could be my fault.

3) A nice to have, but something around using a Dropwizard service to wrap long running worker supervision, basically long running tasks in some thread pool, and then exposing their status/progress etc...

4) Migrations can only be run from the command line, I'm not sure if it'd be a good idea to allow a web request to trigger them, but it might be easier for systems like heroku and generally working with orchestration tools operating over the network.

5) No built-in/encouraged way to put your app into maintenance mode, this would be far easier in Ratpack as you can go early in the handler chain and check for a maintenance flag, effectively making a simple decision tree.

First blush shortcomings of Ratpack:

1) Operations emphasis (see above)

2) A persistence module with something like jOOQ

Lessons learned while working with Dropwizard:

0) Cap memory usage of the JVM, constrain your various connections, grow them as you need them, otherwise you'll be hurting. This was more a “me being new to JVM” thing and assuming/hoping more kid gloves were being employed.

1) Dependency injection is fantastic, IoC containers are the devil. Magic annotations and action at a distance are pure evil, it's so much easier to have constructor injection and plain old java code creating objects. If something is requiring you to use these evil tools there are other anti-patterns at play which are painting you into this corner. Many of our bugs, gotchas etc... have been around action at a distance and dependency injection breaking the chain of type safety, just don't do it. Ratpack seems better in this regard because it's easier to programmatically get at the lifecycle of the application (please make sure you never give this up). I can't emphasize this enough, plain old language level composability is king.

2) Separate your read and write representations as even in trivial cases they're different. Don't share code here, it's a pyrrhic victory. This simplifies a great deal, by getting rid of spurious control flow and really meshes well with java 8 and the streaming APIs. Note, I'm not advocating for full blown CQRS + event sourcing.

3) Avoid hibernate, it conflates your read and write models, and just isn't sufficiently typesafe. Java isn't haskell, but it still has a good type system that you can easily push, even if it means you have the occasional superfluous objects, just to create types.

The Bigger Picture

This next part is very much my opinion and I'm attempting to look at the bigger picture, I hope it proves useful as these are the thoughts running through my head as I currently write this. I review the code of the majority of services where I work. Over and over again, the mistakes I see would only partially be covered by a better type system, expressive language, unit tests, integration tests, or pure functions. The vast majority of truly punishing mistakes are around unsuccessfully reasoning about failure. It's a skill often not taught to the vast majority of programmers and I'm only now learning it by bumping into walls and the occasional gem from which to learn. So after a language/framework/library helps you get rid of the obvious 'spelling' errors, and the honestly trivial logic mistakes, we're left with the actual failures. The ones that are hard to test; moreover, think of.

As an example, let's say we have a monolithic app, without the distributed systems headaches. We have hundreds/thousands of unique queries across our system and the vast majority of them are without a limit clause. All it would take is one writer to bug out or a user to import some data and all readers of that data are now in for a world of hurt. I'm cheating here because even "monolithic" apps are distributed systems but if said data explosion is near your core data model, I suspect cascading failures.

Frameworks like Dropwizard are really interesting to me in that they're taking very solid lessons from things like 12 factor app (http://12factor.net/) and Designing and Deploying Internet Scale Services (http://mvdirona.com/jrh/talksandpapers/jamesrh_lisa.pdf), and distilling them. More generally, I think libraries/frameworks that create tools to support these as an "engineer checklist" would dramatically improve our chance of building robust systems.

Thursday, October 21, 2010

[DEPRECATED] Interview Process

Update
I consider this horribly out of date, I'm only leaving it here for posterity, and much of what I used to do has been dropped. I might right and update post one day, but my new process is more exploratory and less adversarial.

The Setup

This is an overview of my interviewing processes, along with some background, that I've formulated over the years. The post itself was inspired in part by a conversation and after reading an incredulous look at claims of job candidate failure rates through hacker news.

Different Strokes for Different Folks

Primarily there are two formats that I've run, one of which is a bog standard interview (60 mins), the other is much more involved, spanning multiple interviews/sessions. The former was simply me being asked to assess technical skills based on a resume, while the latter was a process designed with a dev team a couple years back.

The standard interview, at least the time box aspect, was ostensibly foisted upon me. My employer at the time walked up to me, handed me a resume, and told that I'd be interviewing someone the day after. All I had to go on was their resume, and could only spare 30 minutes of prep time, and so the resume was my focus. It was only in the latter part of the first interview, where I realised that I had been grilling the candidate rather hard, and wasn't feeling entirely convinced they were going to cut the mustard. At which point I felt sorry for them, and tried to find out if they might have other redeeming qualities. That's when I asked a few prying questions and stumbled upon three that would be the core of this interview format going forward.

Hold the Cream and Sugar

The standard interview has a fairly simple premise; it's based on assuming the worst, that the person I'm interviewing is a flunky and a liar. The prep is simple, grab a pen and markup their resume with incredulous comments about their accomplishments and a quick checklist of the major items you care about that are relevant to the position (TDD, CI, DDD, SOLID, Design Patterns, etc...). These act as a guide as I probe into the person's past, and ensure that they're on the up and up. This deep dive into their professional background will end up taking the majority of the allotted time. If things are going well, I break into the actual interview questions, which, outside of the veracity of their resume, is what I've been after all along. The list is short, but given the time frame it makes sense, but here they are along with my expectations and insights:

1) What are you excited about in regards to software engineering/computing?

Here I'd like the candidate to talk about software, technology, architecture, industry trend, process, etc... anything to show that they're engaged, that they're passionate about computing, and are so now. This can also be an opportunity to get insight into how current they are, mind you this is not going to be a definitive answer in that regard, so it entirely depends on why they're excited about it.

2) Do you have a reading list, informal or otherwise, slashdot, blogs, mailing lists, books, documentation, papers?

Push come to shove, anyone remotely worthwhile reads, and reads a lot, even if they don't think they do. It's a matter of life and death for one's career. I tend to afford a lot of leniency when getting answers for this question, and pry and dig deeper, asking them if they: read any blogs regularly; have an aggregator, such as Google Reader; frequent sites like Hacker News or SlashDot; have read any academic or white papers; subscribe to mailing lists such as a news group for an OSS project. Quite literally anything will do, but nothing is a non-negotiable game over, in my mind; though, I'm not convinced that's better or worse than, "If I get stuck, I Google around and find the answer". I believe what bothers me most about the latter is that the only time one seeks new knowledge is when they're observably stuck. It puts into question how proactive they are; moreover, how capable they are in identifying that they're in trouble in the first place. In my view, as a developer, our situation is as follows: we read code, we read emails, we read commit logs, we read documentation, we read all the time. We have to love reading; performing it as a mere job function doesn't suffice. Not only that, the inability, or lack of motivation, to stomach technical material is not an option. Lastly, and most importantly, reading is the gateway to learning, and a reading habit is a strong indication of the desire to educate oneself.

3) What would you like to tell me about yourself that I haven't asked you about, something that I should really know?

Here the candidate could talk about something that wasn't on the resume, or didn't come up, or revisit a prior thread in the interview. It's open mic, they shouldn't let it go to waste. This is merely a last chance for them to make their 'pitch'.

As for the "plain Jane interview", that's it. Those three questions are really what I'm after. If the resume has made it this far, and their story checks out, they're probably plenty qualified, at a technical level.

How it happened

The lengthy and involved interview process was only run a few times, and needs to be iterated upon further; moreover, the initial runs were not carried out entirely to my liking. It would be a disservice to simply render that here as the final product; moreover seeing the evolution of is key in highlighting the impetus of its design. So in order to ensure that it can be taken and carried forward successfully, I'm providing a summary of its history.

The first step is filtering the resumes. Simply look for the major no-nos: typos, grammatical errors, etc... The only additional criteria was having a Linux background so we didn't have to teach them the basics -- that would have been torture -- and an experience quality and quantity commensurate to the role. I could go into a rant about people lacking a Unix background, I won't, but I will leave you with this, of the people who come from one and jump from the other, I find one of those groups have little issue gaining competency in their new environment while the other are a liability.

The first interview was a short, 10 minute, chat to make sure they're not crazy or smelly, and then they were handed the test. The test was as follows: the first page, with which I disagree strongly but the rest of the team wanted it, was 8 or 10 ridiculous esoteric PHP questions about obscure semantics of rarely used PHP functions. The next page was the famous, or infamous as the case may be, fizzbuzz, answerable in any language with which they were comfortable. I somewhat agree with the author that some might take fizzbuzz to be insulting. Perhaps we should have put up a warning, to the effect of, "please don't be insulted, this is a simple litmus test", and another, "please walk through your code, there are some very subtle but easy mistakes made if you rush". The last page was three questions: 1) name two variables in the my.cnf, which ensured that they had at one point configured MySQL, 2) design a normalized table structure for about 10 or 12 items of a person's bio, name, address1 and address2, phone1, phone2, etc... 3) and a simple select statement. Afterwards the team reviews the test outcome; primarily focusing on FizzBuzz, and we were very forgiving when marking it, making sure the logic is sound, while syntax errors are perfectly fine. As for normalisation we expected three tables, though we're willing to take less normalized designs so long as there was a blurb showing that they recognize what they did. If we're happy with the performance on to the next step.

The candidate would be given a simple take home, they are provided a sandbox, and asked to read the data off the file system, and output a web page with some formatting. Just some simple code, we wanted to see approach, I think the test was too simple and it didn't allow anyone to flex any real OAD skills without it being an embarrassingly obese solution. It's best done procedurally, and therefore largely a failure when attempting to get any design sense.

Onto the real technical interview, here I'm going to switch tenses, more to the point, I'm going to describe the intension. The problem is simple, give the candidate a whiteboard, the floor, and in front of at least some or all of the team they'll be working with. Then ask them to design netflix -- having them dive in head first, it'll give you an idea of the things they care about and confident with, see what sort of questions they have, act more like the business owner at this point. Now, structure it, ask them some questions about the areas they haven't covered, expose the things they didn't think about or were trying to skirt. We had a list of things that we were hoping for, something about the web presentation layer, something about the domain, something about code organisation, something about server architecture, and something about database design. These are simple expectations that the team can come up with according to the position, spanning the major concerns of the business currently -- if you haven't guessed it, you should be wearing the fellow programmer hat at this stage. In the previous two stages it's important to help coach the candidate through any 'dead air', make sure they feel they're not being grilled and help keep them moving. Otherwise, you're wasting everyone's time. Each of the stages should last approximately 20 minutes. Then let them ask the team questions, talk about fit, talk about culture, talk about inappropriate humour at the office, the team needs to open up and let the candidate know exactly what they're getting into, this is going to be a new family after all.

It's My Party

Currently, this would be my game plan, at least for a product oriented shop, service programming is another matter entirely.

Have a member of the team do an initial smell test, and the business should evaluate them for their soft-skills, end off by giving them a programming test. The test should consist of some sort of fizzbuzz, and whatever else is relevant to the industry.

If things are still green, have them do a take home test, for which I would give them a large problem and ask them to get as far as they can -- and idea lifted from a colleague who did this with me and others he's interviewed. If time permits, you could do even better by having them come in and pair for and hour or two with a senior team member on a problem.

Satisfied with their performance on the take home, ask them to come in. Review the test with them, and then do a shortened version of an incredulous look at their resume. In particular, I would incorporate the insights gained from reviewing their test. Finally ask them the three questions to find out if they're going to do more than contribute at the level they're currently at, or are they going to grow.

Then setup the team interview, "The Netflix Design Session", or frankly any other large system they're familiar with or would be a good way stress their design skills.

If they're good to go, then let the business work out the hiring details.

In the case where you have to turn away the candidate, don't waste their time, and don't help perpetuate the issue. Give them a 15 to 30 minute call and coach them, tell them their sore spots out right and hopefully they'll attack them. You could even give them a timeline to shoot for to complete those objectives and ask them to reapply when they feel they've had a chance to address those issues.

Perspective

All that said and done, you can be a vicious interviewer and cut apart everyone who comes before you, and I could have done that too. But I try not to, everyone has a path that they follow and whether you like it or not, everyone is a work in progress, and you're going to be a part of finishing it up. Decide on a reasonable baseline, which is determined by a lot of things, the work, it's quality and demands; the location; compensation; benefits; and the learning. I know many people want the best, but fall short on two key points, first and foremost, chances are they themselves are not the best and so cream of the crop will not bother with them, and secondly, they can't afford it. Push come to shove, Google pays six figures, the stock is ridiculous, signing bonuses, etc. For the Vancouverites, I imagine California is a pretty awesome place to live, so the entire location thing isn't as big a deal. At the end of the day, you can be very stringent, only take the awesome candidate (read: no one, for the aforementioned reasons), or you could be realistic, and realise that there is a middle ground, it's a minimum level of knowledge, and the character required to do the job well.