626 points mrkurt | 4 comments | 18 Mar 20 14:15 UTC | HN request time: 0.675s | source

Hello Hacker News! We're Kurt, Jerome, and Michael from fly.io (https://fly.io/). We're building a platform to run Docker applications close to end users. It's kind of like a content delivery network, but for backend servers.

I helped build Ars Technica and spent the majority of my time trying to make the site fast. We used a content delivery network to cache static content close to anonymous readers and it worked very well for them. But the most valuable readers were not these, but the ones who paid for subscriptions. They wanted personalized content and features for interacting with the community – and we couldn't make those fast. Content delivery networks don't work for Ars Technica's best customers.

Running Docker apps close to users helps get past the "slow" speed of light. Most interactions with an app server seem slow because of latency between the hardware it's running on (frequently in Virginia) and the end user (frequently not in Virginia). Moving server apps close to users is a simple way to decrease latency, sometimes by 80% or more.

fly.io is really a way to run Docker images on servers in different cities and a global router to connect users to the nearest available instance. We convert your Docker image into a root filesystem, boot tiny VMs using a project called Firecracker (recently discussed here: https://news.ycombinator.com/item?id=22512196) and then proxy connections to it. As your app gets more traffic, we add VMs in the most popular locations.

We wrote a Rust based router to distribute incoming connections from end users. The router terminates TLS when necessary (some customers handle their own TLS) and then hands the connection off to the best available Firecracker VM, which is frequently in a different city.

Networking took us a lot of time to get right. Applications get dedicated IP addresses from an Anycast block. Anycast is an internet routing feature that lets us "announce" from multiple datacenters, and then core routers pick the destination with the shortest route (mostly). We run a mesh Wireguard network for backhaul, so in flight data is encrypted all the way into a user application. This is the same kind of network infrastructure the good content delivery networks use.

We got a handful of enterprise companies to pay for this, and spent almost a year making it simple to use — it takes 3 commands to deploy a Docker image and have it running in 17 cities: https://fly.io/docs/speedrun/. We also built "Turboku" to speed up Heroku apps. Pick a Heroku app and we deploy the slug on our infrastructure .. typical Heroku apps are 800ms faster on fly.io: https://fly.io/heroku/

We've also built some features based on Hacker News comments. When people launch container hosting on Hacker News, there's almost always a comment asking for:

1. gRPC support: apps deployed to fly.io can accept any kind of TCP connection. We kept seeing people say "hey I want to run gRPC servers on this shiney container runtime". So you can! You can specify if you want us to do TLS or HTTP for an app, or just do everything yourself.

2. Max monthly spend: unexpected traffic spikes happen, and the thought of spending an unbounded amount of money in a month is really uncomfortable. You can configure fly.io apps with a max monthly budget, we'll suspend them when they hit that budget, and then re-enable them at the beginning of the next month.

One of the best parts of building this has been seeing the problems that developers are trying to solve, often problems we didn't know about beforehand. My favorite is a project to re-encode MP3s at variable speeds for specific users (apparently the Apple Audiobook player has no option for playback speed). Another is "TensorFlow at the edge" — they trained a TensorFlow model to detect bots and run predictions before handling requests.

We're really happy we get to show this to you all, thank you for reading about it! Please let us know your thoughts and questions in the comments.

Show context

a13n ◴[18 Mar 20 18:05 UTC] No.22619508[source]▶

>>22616857 (OP) #

If one API request makes on average 5-10 round trips to the database, and the database is in Virginia, this only makes the problem (much) worse. How do you solve this problem for this use case?

replies(4): >>22619613 #>>22620066 #>>22620972 #>>22622289 #

michaeldwan ◴[18 Mar 20 18:15 UTC] No.22619613[source]▶

>>22619508 #

We're not solving db latency yet. A good place to start is aggressively caching at the edge. We offer an in-memory redis cache for this that can replicate commands globally. Beyond that you'd need read replicas which will be possible once we launch persistent storage. That said, latency between data centers on the same continent is often less than I would have thought!

replies(1): >>22619910 #

1. a13n ◴[18 Mar 20 18:45 UTC] No.22619910[source]▶

>>22619613 #

Gotcha, honestly that feels pretty niche to me, which might be a good place for a startup to start.

I can't think of many back-end applications between purely static content (just use a CDN) and needs a database connection. Probably video game servers, where you don't need the game state to be (immediately) stored/accessed globally.

replies(2): >>22620433 #>>22621686 #

2. mrkurt ◴[18 Mar 20 19:40 UTC] No.22620433[source]▶

>>22619910 (TP) #

Game servers are a great example. What's interesting is how many different kinds of apps need game-server like infrastructure: https://www.figma.com/blog/how-figmas-multiplayer-technology...

We've been talking to a lot of startups doing communications tools, especially for remote work.

Lots of full stack apps benefit from app servers + redis cache in different regions. They need a database connection, but if they're already done the work to minimize DB round trips they might just work with no code changes.

There are also a bunch of folks doing really dynamic video and image delivery. Where an individual user gets an entirely unique blob of binary data.

3. ShroudedNight ◴[18 Mar 20 21:59 UTC] No.22621686[source]▶

>>22619910 (TP) #

I could see IoT device 'acceleration' to be a significant potential use-case. Something with a tiny bill of materials for the device itself, offloading any non-trivial processing to a virtual device on the closest 'real' infrastructure you can get. Especially for something human-interactive, you would want to be very aggressive about minimizing latency.

Also, depending on how tight the limits are for VM lifetime / bandwidth / outbound connections, I could see using these as a kind of virtual NIC / service mesh type thing for consumer-grade internet connections, to restore the inbound routing capabilities precluded by carrier-grade NAT and avoid their traffic discrimination, as well as potentially on-boarding to higher-quality transit as early as possible for use when accessing latency-sensitive services further 'interior' to the cloud.

replies(1): >>22621808 #

4. mrkurt ◴[18 Mar 20 22:09 UTC] No.22621808[source]▶

>>22621686 #

These are great. IoT seems like a thing you could do but that's a really specific use case I hadn't even considered.

The second example would be interesting to try. There's no real limit on VM lifetime or outbound connections, bandwidth is more of a budget problem. VMs are ephemeral, so they _can_ go away but we're all happier if they just run forever.

↑

Launch HN: Fly.io (YC W20) – Deploy app servers close to your users