What Happened to Lee

(wired.com)

1298 points jgrahamc | 2 comments | 15 Apr 20 14:31 UTC | HN request time: 0.001s | source

Show context

jgrahamc ◴[15 Apr 20 21:54 UTC] No.22883548[source]▶

I posted this hours ago and then stepped away. The story captures so much about the Lee I knew so well. I'll add one piece of praise for Lee's early architecture of Cloudflare.

Everything was controlled by a single Postgres database that made very heavy use of stored procedures, that called other procedures, that called others. It was one giant program inside the database. It took me a while to comprehend what he'd done but it was really great. The database ran everything and all those functions made sure that audit logs were kept, that the calls were allowed for the user ID being passed in, and some of these procedures made external calls to APIs including getting things like SSL certificates.

It was a magnificent monolith inside a database.

I worked on the periphery of the database (it was truly Lee's domain) and he'd tell me what output to expect or API to create and I'd code to his spec. and we'd just hook it up.

If any single artefact represents what he did at Cloudflare, it's that database. And he used to code it on a laptop we called "The Beast" because it was so crazily heavy and overloaded with memory etc. that he'd carry around a mini, test Cloudflare wherever he went.

replies(2): >>22883751 #>>22883786 #

mynameishere ◴[15 Apr 20 22:16 UTC] No.22883751[source]▶

>>22883548 #

Working with stored procedures is just the pits. I wonder why you would describe that as "great". Maybe it's optimized, but so would be a giant executable written in assembly.

replies(4): >>22883916 #>>22883978 #>>22884300 #>>22884664 #

crazygringo ◴[16 Apr 20 00:06 UTC] No.22884664[source]▶

>>22883751 #

As long as you have the necessary tooling for source control, versioning, code search, etc. -- why is it the pits?

I mean, if you stick to a rigorous procedure of automatically exporting your stored procedures to files and committing them to git with messages... how is that different from any other development?

Serious question -- I've done tons of database work but only a little bit with stored procedures, so wondering if I'm missing a crucial detail here.

replies(4): >>22886392 #>>22886876 #>>22887472 #>>22892698 #

ggreer ◴[16 Apr 20 19:38 UTC] No.22892698[source]▶

>>22884664 #

Let's say that when you deploy, you want to direct 1% of traffic to the new code, then 5%, then 25%, then 50%, and finally 100%. And if monitoring metrics go red at any point, automatically roll back. A typical canary deployment.

How do you do that with a sproc? With normal application code, you can have the load balancer send an arbitrary percentage of traffic to the new instances of the service. A sproc binds the code and the data together. But you only have one database, so there's no way to do a gradual deploy of the code. That would be like doing a gradual deploy of a schema migration. It's only possible if you have a lot of databases.

Also unless you're using MS SQL server, debugging sprocs is a pain. The only workaround I've seen is to add a debug or loglevel arg to the sproc. But actually pausing execution and inspecting data? Good luck.

replies(2): >>22892758 #>>22899585 #

dragonwriter ◴[16 Apr 20 19:44 UTC] No.22892758[source]▶

>>22892698 #

> Let's say that when you deploy, you want to direct 1% of traffic to the new code, then 5%, then 25%, then 50%, and finally 100%. And if monitoring metrics go red at any point, automatically roll back.

> How do you do that with a sproc?

Wrap the real sproc (well, the set of real sprocs, since old and new will sometimes coexist) with a wrapper sproc that probabilistically directs to the appropriate working sproc based on a table value (or anything else the sproc can read) that is updated to reflect the current stage in the conversion process, including any monitoring-based reversals.

replies(1): >>22893286 #

ggreer ◴[16 Apr 20 20:37 UTC] No.22893286[source]▶

>>22892758 #

How do you correlate those errors at the application level? If only 1% of sproc calls are the new sproc, any elevated error rates in the application will be hard to see over typical noise.

replies(1): >>22893681 #

dragonwriter ◴[16 Apr 20 21:19 UTC] No.22893681[source]▶

>>22893286 #

> How do you correlate those errors at the application level? If only 1% of sproc calls are the new sproc, any elevated error rates in the application will be hard to see over typical noise.

By using the DBs error reporting features to report back errors to the application, including information about which ultimate implementation was handling the request.

replies(1): >>22893843 #

ggreer ◴[16 Apr 20 21:39 UTC] No.22893843[source]▶

>>22893681 #

That only covers errors that occur inside the sproc, not errors that are caused by the sproc. For example: What if the new version of the sproc succeeds, but the data it returns causes the application code to fail in some way? The sproc metrics would report 100% success and the application error rate wouldn't get above noise until a significant fraction of traffic was using the new buggy version of the sproc.

It's possible to add more epicycles to try and correct for these deficiencies, but the truth is that sprocs are dangerously double-edged. Yes, your code runs next to your data, but to get parity with application code (canary deploys, debugging, logging, etc) you must reinvent lots of tooling.

replies(1): >>22906106 #

KajMagnus ◴[18 Apr 20 08:03 UTC] No.22906106[source]▶

>>22893843 #

@ggreer, the grand-grand...parent, wrote:

> there's no way to do a gradual deploy of the code

There could be version numbers?

    some_procedure_v1()
    some_procedure_v2()

and new v2 version application server code calls v2 stored procedures. And if v2 app server code gets rolled back / cancelled, then, the v2 stored procedures automatically won't be used any more.

replies(2): >>22908152 #>>22949414 #

1. cpx86 ◴[18 Apr 20 14:52 UTC] No.22908152{3}[source]▶

>>22906106 #

Exactly this. Treat the DB schema as you would any typical API schema. A lot of the techniques used for evolving application APIs can be used for sprocs and views as well, e.g. versioning for breaking changes, adding optional parameters or new result fields for non-breaking changes. Fundamentally I don't think there's much difference between say, a DB schema defined with sprocs/view or an HTTP schema defined with OpenAPI. Both describe an API contract between two remotely communicating processes, the former case just happens to use a SQL dialect/transport to do it.

replies(1): >>22938263 #

2. KajMagnus ◴[21 Apr 20 19:08 UTC] No.22938263[source]▶

>>22908152 (TP) #

Interesting that you mention doing this (version numbers) with views too, I didn't think about that ...

... Maybe that could be a way to "perview" a database migration, before running the migration for real. There could be a table 'Some_table_view_v2' that shows how that table would look, after an upcoming data migration. And then v2 app server code, would use that new view. — Then one could be more certain that the data migration, will work fine.

(At the same time, one might need to be careful to let just a small a fraction of the requests, use the new View, if the view is a bit / a-lot slower than the real table.)

↑