Ask HN: How did the internet discover my subdomain?

I have a domain that is not live. As expected, loading the domain returns: Error 1016.

However...I have a subdomain with a not obvious name, like: userfileupload.sampledomain.com

This subdomain IS LIVE but has NOT been publicized/posted anywhere. It's a custom URL for authenticated users to upload media with presigned url to my Cloudflare r2 bucket.

I am using CloudFlare for my DNS.

How did the internet find my subdomain? Some sample user agents are: "Expanse, a Palo Alto Networks company, searches across the global IPv4 space multiple times per day to identify customers' presences on the Internet. If you would like to be excluded from our scans, please send IP addresses/domains to: scaninfo@paloaltonetworks.com", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_7; en-us) AppleWebKit/534.20.8 (KHTML, like Gecko) Version/5.1 Safari/534.20.8", "Mozilla/5.0 (Linux; Android 9; Redmi Note 5 Pro) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.89 Mobile Safari/537.36",

The bots are GET requests which are failing, as designed, but I'm wondering how the bots even knew the subdomain existed?!

Show context

parliament32 ◴[07 Mar 25 00:15 UTC] No.43286370[source]▶

>>43285725 (OP) #

Certificate Transparency logs, or they don't actually know the domain name: just port-scanning[1] then making requests to open web ports.

[1] Turns out you can port-scan the entire internet in under 5 minutes: https://github.com/robertdavidgraham/masscan

replies(3): >>43286494 #>>43287497 #>>43287503 #

1. andix ◴[07 Mar 25 00:37 UTC] No.43286494[source]▶

>>43286370 #

Port scanning usually can't discover subdomains. Most servers don't expose the of the domains they server content for. In case of HTTP they usually only serve the subdomain content if the Host: request-header includes it.

replies(4): >>43286515 #>>43286516 #>>43286524 #>>43292643 #

2. hombre_fatal ◴[07 Mar 25 00:43 UTC] No.43286515[source]▶

>>43286494 (TP) #

Most servers just listen on :80 and respond to all requests. Almost nobody checks the host header intentionally, it's just a happy mistake if they use a reverse proxy.

You can often decloak servers behind Cloudflare because of this.

But OP's post already answered their question: someone scanned ipv4 space. And what they mean is that a server they point to via DNS is receiving requests, but DNS is a red herring.

replies(1): >>43286535 #

3. benfortuna ◴[07 Mar 25 00:44 UTC] No.43286516[source]▶

>>43286494 (TP) #

I could be wrong, but the Palo Alto scanner says it's using global ipv4 space, so not using DNS at all. So actually the subdomain has not been discovered at all.

replies(1): >>43287624 #

4. parliament32 ◴[07 Mar 25 00:45 UTC] No.43286524[source]▶

>>43286494 (TP) #

How deep in the domain hierarchy you are doesn't matter from a network layer: a bare tld (yes this exists), a normal domain, a subdomain, a sub-subdomain, etc can all be assigned different IPs and go different places. You can issue a GET against / for any IP you want (like we see in the logs OP posted). The only time this would actually matter is if a host at an address is serving content for multiple hostnames and depends on the Host header to figure out which one to serve -- but even those will almost always have a default.

replies(1): >>43286618 #

5. andix ◴[07 Mar 25 00:48 UTC] No.43286535[source]▶

>>43286515 #

This really depends on the setup. Most web servers host multiple virtual hosts. IP addresses are expensive.

If you're deploying a service behind a reverse proxy, it either must be only accessible from the reverse proxy via an internal network, or check the IP address of the reverse proxy. It absolutely must not trust X-Forwarded-For: headers from random IPs.

replies(1): >>43286553 #

6. hombre_fatal ◴[07 Mar 25 00:53 UTC] No.43286553{3}[source]▶

>>43286535 #

I just don't see how any of this matters. OP's server is reachable via ipv4 and someone sent an http request to it. Their post even says that this is the case.

replies(1): >>43286582 #

7. andix ◴[07 Mar 25 00:58 UTC] No.43286582{4}[source]▶

>>43286553 #

I'm guessing they meant it discovered a virtual host behind a subdomain.

8. andix ◴[07 Mar 25 01:08 UTC] No.43286618[source]▶

>>43286524 #

You can discover IP adresses, sure. Just enumerate them. But this doesn't give you the domain, as long as there is no reverse dns record.

I'm quite sure OP meant a virtual host only reachable with the correct Host: header.

9. reactordev ◴[07 Mar 25 05:46 UTC] No.43287624[source]▶

>>43286516 #

This is exactly what’s happening based on the log snippet posted. Has nothing to do with subdomains, has everything to do with it being on the internet.

10. cryptonector ◴[07 Mar 25 18:20 UTC] No.43292643[source]▶

>>43286494 (TP) #

And in the case of HTTPS they need to insist on SNI (and TLSv3 requires it).

↑