626 points mrkurt | 4 comments | 18 Mar 20 14:15 UTC | HN request time: 0.611s | source

Hello Hacker News! We're Kurt, Jerome, and Michael from fly.io (https://fly.io/). We're building a platform to run Docker applications close to end users. It's kind of like a content delivery network, but for backend servers.

I helped build Ars Technica and spent the majority of my time trying to make the site fast. We used a content delivery network to cache static content close to anonymous readers and it worked very well for them. But the most valuable readers were not these, but the ones who paid for subscriptions. They wanted personalized content and features for interacting with the community – and we couldn't make those fast. Content delivery networks don't work for Ars Technica's best customers.

Running Docker apps close to users helps get past the "slow" speed of light. Most interactions with an app server seem slow because of latency between the hardware it's running on (frequently in Virginia) and the end user (frequently not in Virginia). Moving server apps close to users is a simple way to decrease latency, sometimes by 80% or more.

fly.io is really a way to run Docker images on servers in different cities and a global router to connect users to the nearest available instance. We convert your Docker image into a root filesystem, boot tiny VMs using a project called Firecracker (recently discussed here: https://news.ycombinator.com/item?id=22512196) and then proxy connections to it. As your app gets more traffic, we add VMs in the most popular locations.

We wrote a Rust based router to distribute incoming connections from end users. The router terminates TLS when necessary (some customers handle their own TLS) and then hands the connection off to the best available Firecracker VM, which is frequently in a different city.

Networking took us a lot of time to get right. Applications get dedicated IP addresses from an Anycast block. Anycast is an internet routing feature that lets us "announce" from multiple datacenters, and then core routers pick the destination with the shortest route (mostly). We run a mesh Wireguard network for backhaul, so in flight data is encrypted all the way into a user application. This is the same kind of network infrastructure the good content delivery networks use.

We got a handful of enterprise companies to pay for this, and spent almost a year making it simple to use — it takes 3 commands to deploy a Docker image and have it running in 17 cities: https://fly.io/docs/speedrun/. We also built "Turboku" to speed up Heroku apps. Pick a Heroku app and we deploy the slug on our infrastructure .. typical Heroku apps are 800ms faster on fly.io: https://fly.io/heroku/

We've also built some features based on Hacker News comments. When people launch container hosting on Hacker News, there's almost always a comment asking for:

1. gRPC support: apps deployed to fly.io can accept any kind of TCP connection. We kept seeing people say "hey I want to run gRPC servers on this shiney container runtime". So you can! You can specify if you want us to do TLS or HTTP for an app, or just do everything yourself.

2. Max monthly spend: unexpected traffic spikes happen, and the thought of spending an unbounded amount of money in a month is really uncomfortable. You can configure fly.io apps with a max monthly budget, we'll suspend them when they hit that budget, and then re-enable them at the beginning of the next month.

One of the best parts of building this has been seeing the problems that developers are trying to solve, often problems we didn't know about beforehand. My favorite is a project to re-encode MP3s at variable speeds for specific users (apparently the Apple Audiobook player has no option for playback speed). Another is "TensorFlow at the edge" — they trained a TensorFlow model to detect bots and run predictions before handling requests.

We're really happy we get to show this to you all, thank you for reading about it! Please let us know your thoughts and questions in the comments.

Show context

yingw787 ◴[18 Mar 20 16:05 UTC] No.22618138[source]▶

>>22616857 (OP) #

Actually, one more question...do you guys scale compute and data layers separately, or are they tightly coupled within the same container?

I was looking at containerized PostgreSQL on AWS because I want to colocate a job scheduling tool (pg_cron) with the database process, but RDS doesn't support that extension. Apparently (or at least I hope), ecs-cli compose supports docker volumes through EBS, which is the same base as EKS persistent volumes. There's next to no information for ECS + EBS though, everybody uses EC2 or full on EKS.

I was just thinking, if you needed to handle excessive read load on small quantities of data, having separate data layers would enable you to autoscale db instances while still having the same volumes, instead of using an entirely separate caching layer which could introduce bugs and increase maintenance overhead. If you guys had native HA with docker exec access and passed savings to consumers that would be huge for me and my use cases.

replies(2): >>22618330 #>>22618355 #

mrkurt ◴[18 Mar 20 16:23 UTC] No.22618330[source]▶

>>22618138 #

Right now we're best suited for app servers, databases won't (yet) run very well on fly.io. We are trying really hard to focus on what we have because it's so valuable but we love DBs so much we might end up trying to "solve" them soon.

replies(2): >>22618406 #>>22618411 #

tarun_anand ◴[18 Mar 20 16:30 UTC] No.22618411[source]▶

>>22618330 #

But your most valuable customers will need to interact with an app server plus database for any real life use case. Can you share some applications where only placing the app server close to user works? Is the database back in Virginia?

replies(1): >>22618928 #

1. mrkurt ◴[18 Mar 20 17:16 UTC] No.22618928[source]▶

>>22618411 #

You are mostly right, there are a surprising number of problems that don't need much database interaction. Lots of image generation, video workloads, game servers, etc.

One of the things we want to do, though, is make "boring" apps really fast. My heuristic for this is "can you put a Rails app on fly.io without a rewrite?".

Many of these applications add a caching layer. Normally if someone wants to make a Rails app fast, they'll start by minimizing database round trips and cache views or model data. If somone has already done this work, fly.io might just work for this app since we have a global Redis service (https://fly.io/docs/redis/).

We have experimented with using CockroachDB in place of Postgres to get us even farther, but it doesn't work with most frameworks' migration tools.

We're also thinking of running fast-to-boot read replicas for Postgres, so people could leave their DB in Virginia but bring up replicas alongside their app servers.

If you've seen anyone do anything clever to "globalize" their database we're all ears.

replies(2): >>22624965 #>>22644115 #

2. wasd ◴[19 Mar 20 06:06 UTC] No.22624965[source]▶

>>22618928 (TP) #

I’m extremely impressed with how slick your Heroku integration is. We thought about moving over to render but the dev ux just isn’t there like Heroku. I would be fine with paying for a read replica on the west coast that was always running if you can make it as easy the rest of your Heroku integration.

3. ignoramous ◴[21 Mar 20 03:22 UTC] No.22644115[source]▶

>>22618928 (TP) #

I've seen https://macrometa.co take a stab at an edge database, but their guarantees (consistency / correctness) don't really infuse any sort of confidence in me [0]. https://yugabyte.com is another global scale database that competes squarely with cockroach-db, though I haven't used either.

Cloudflare Workers KV has the simplest model, with a central-db that transparently and eventually only replicates read-only, hot-data specific to a DC but writes continue to incur heavy penalty in terms of operations-per-second, cost, and latency.

In our production setup, we back Workers KV with a single-region, source-of-truth DynamoDB [1] and employ DynamoDB Streams to push data to Workers KV [2], that is,

Writes (control-plane): clients -> (graphql) DynamoDB -> Streams -> Workers KV

Reads (data-plane): clients -> Workers KV

Reads (control-plane): clients -> (graphql) DynamoDB

[0] https://news.ycombinator.com/item?id=19307122

[1] We really should switch to QLDB once it supports Triggers.

[2] We do so mainly because we do not to be locked-down to Workers KV, especially at its very nascent stage.

replies(1): >>22686494 #

4. ctesh ◴[25 Mar 20 17:10 UTC] No.22686494[source]▶

>>22644115 #

Hi Ignoramus - founder and CEO of Macrometa here - regret that our first attempt at explaining our consistency model caused confusion last year. Here's a link to the research paper that describes our architecture and consistency model.

https://bit.ly/HPTS-Macrometa

We got accepted in High Performance Transaction systems last year for the innovations around CRDTs for strong eventual consistency (SEC) with low read and write latencies.

Im trying to figure out how to provide a simple light weight way for fly.io users to use our global DB in their apps. It would allow a full stack to run at the edge with the compute on fly.io and the data on Macrometa either directly on fly.io or a nearby PoP (same city). Will update

↑

Launch HN: Fly.io (YC W20) – Deploy app servers close to your users