146 points FrasiertheLion | 2 comments | 15 May 25 16:19 UTC | HN request time: 0.408s | source

Hello HN! We’re Tanya, Sacha, Jules and Nate from Tinfoil: https://tinfoil.sh. We host models and AI workloads on the cloud while guaranteeing zero data access and retention. This lets us run open-source LLMs like Llama, or Deepseek R1 on cloud GPUs without you having to trust us—or any cloud provider—with private data.

Since AI performs better the more context you give it, we think solving AI privacy will unlock more valuable AI applications, just how TLS on the Internet enabled e-commerce to flourish knowing that your credit card info wouldn't be stolen by someone sniffing internet packets.

We come from backgrounds in cryptography, security, and infrastructure. Jules did his PhD in trusted hardware and confidential computing at MIT, and worked with NVIDIA and Microsoft Research on the same, Sacha did his PhD in privacy-preserving cryptography at MIT, Nate worked on privacy tech like Tor, and I (Tanya) was on Cloudflare's cryptography team. We were unsatisfied with band-aid techniques like PII redaction (which is actually undesirable in some cases like AI personal assistants) or “pinky promise” security through legal contracts like DPAs. We wanted a real solution that replaced trust with provable security.

Running models locally or on-prem is an option, but can be expensive and inconvenient. Fully Homomorphic Encryption (FHE) is not practical for LLM inference for the foreseeable future. The next best option is using secure enclaves: a secure environment on the chip that no other software running on the host machine can access. This lets us perform LLM inference in the cloud while being able to prove that no one, not even Tinfoil or the cloud provider, can access the data. And because these security mechanisms are implemented in hardware, there is minimal performance overhead.

Even though we (Tinfoil) control the host machine, we do not have any visibility into the data processed inside of the enclave. At a high level, a secure enclave is a set of cores that are reserved, isolated, and locked down to create a sectioned off area. Everything that comes out of the enclave is encrypted: memory and network traffic, but also peripheral (PCIe) traffic to other devices such as the GPU. These encryptions are performed using secret keys that are generated inside the enclave during setup, which never leave its boundaries. Additionally, a “hardware root of trust” baked into the chip lets clients check security claims and verify that all security mechanisms are in place.

Up until recently, secure enclaves were only available on CPUs. But NVIDIA confidential computing recently added these hardware-based capabilities to their latest GPUs, making it possible to run GPU-based workloads in a secure enclave.

Here’s how it works in a nutshell:

1. We publish the code that should run inside the secure enclave to Github, as well as a hash of the compiled binary to a transparency log called Sigstore

2. Before sending data to the enclave, the client fetches a signed document from the enclave which includes a hash of the running code signed by the CPU manufacturer. It then verifies the signature with the hardware manufacturer to prove the hardware is genuine. Then the client fetches a hash of the source code from a transparency log (Sigstore) and checks that the hash equals the one we got from the enclave. This lets the client get verifiable proof that the enclave is running the exact code we claim.

3. With the assurance that the enclave environment is what we expect, the client sends its data to the enclave, which travels encrypted (TLS) and is only decrypted inside the enclave.

4. Processing happens entirely within this protected environment. Even an attacker that controls the host machine can’t access this data. We believe making end-to-end verifiability a “first class citizen” is key. Secure enclaves have traditionally been used to remove trust from the cloud provider, not necessarily from the application provider. This is evidenced by confidential VM technologies such as Azure Confidential VM allowing ssh access by the host into the confidential VM. Our goal is to provably remove trust both from ourselves, aka the application provider, as well as the cloud provider.

We encourage you to be skeptical of our privacy claims. Verifiability is our answer. It’s not just us saying it’s private; the hardware and cryptography let you check. Here’s a guide that walks you through the verification process: https://docs.tinfoil.sh/verification/attestation-architectur....

People are using us for analyzing sensitive docs, building copilots for proprietary code, and processing user data in agentic AI applications without the privacy risks that previously blocked cloud AI adoption.

We’re excited to share Tinfoil with HN!

* Try the chat (https://tinfoil.sh/chat): It verifies attestation with an in-browser check. Free, limited messages, $20/month for unlimited messages and additional models

* Use the API (https://tinfoil.sh/inference): OpenAI API compatible interface. $2 / 1M tokens

* Take your existing Docker image and make it end to end confidential by deploying on Tinfoil. Here's a demo of how you could use Tinfoil to run a deepfake detection service that could run securely on people's private videos: https://www.youtube.com/watch?v=_8hLmqoutyk. Note: This feature is not currently self-serve.

* Reach out to us at contact@tinfoil.sh if you want to run a different model or want to deploy a custom application, or if you just want to learn more!

Let us know what you think, we’d love to hear about your experiences and ideas in this space!

Show context

interleave ◴[16 May 25 20:19 UTC] No.44009457[source]▶

>>43996555 (OP) #

Technically my wife would be a perfect customer because we literally just prototyped your solution at home. But I'm confused.

For context:

My wife does leadership coaching and recently used vanilla GPT-4o via ChatGPT to summarize a transcript of an hour-long conversation.

Then, last weekend we thought... "Hey, let's test local LLMs for more privacy control. The open source models must be pretty good in 2025."

So I installed Ollama + Open WebUI plus the models on a 128GB MacBook Pro.

I am genuinely dumbfounded about the actual results we got today of comparing ChatGPT/GPT-4o vs. Llama4, Llama3.3, Llama3.2, DeepSeekR1 and Gemma.

In short: Compared to our reference GPT-4o output, none (as in NONE, zero, zilch, nil) of the above-mentioned open source models were able to create even a basic summary based on the exact same prompt + text.

The open source summaries were offensively bad. It felt like reading the most bland, generic and idiotic SEO slop I've read since I last used Google. None of the obvious topics were part of the summary. Just blah. I tested this with 5 models to boot!

I'm not an OpenAI fan per se, but if this is truly OS/SOTA then, we shouldn't even mention Llama4 or the others in the same breath as the newer OpenAI models.

What do you think?

replies(1): >>44009673 #

1. FrasiertheLion ◴[16 May 25 20:45 UTC] No.44009673[source]▶

>>44009457 #

Ollama does heavily quantize models and has a very short context window by default, but this has not been my experience with unquantized, full context versions of Llama3.3 70B and particularly, Deepseek R1, and that is reflected in the benchmarks. For instance I used Deepseek R1 671B as my daily driver for several months, and it was at par with o1 and unquestionably better than GPT-4o (o3 is certainly better than all but typically we've seen opensource models catch up within 6-9 months).

Please shoot me an email at tanya@tinfoil.sh, would love to work through your use cases.

replies(1): >>44015529 #

2. interleave ◴[17 May 25 16:53 UTC] No.44015529[source]▶

>>44009673 (TP) #

Hey Tanya! Thank you for helping me understand the results better.

I just posted the results of another basic interview analysis (4o vs. Llama4) here: https://x.com/SpringStreetNYC/status/1923774145633849780

To your point: Do I understand correctly that, for example, by running the default model of Llama4 via ollama, the context window is very short even when the model's context is, like 10M. In order to "unlock" the full context version, I need to get the unquantized version.

For reference, here's what `ollama show llama4` returns: - parameters 108.6B # llama4:scount - context length 10485760 # 10M - embedding length 5120 - quantization Q4_K_M

↑

Launch HN: Tinfoil (YC X25): Verifiable Privacy for Cloud AI