That's gonna be a very, very, long answer. What's funny is that not much changed since 2022 (eoy) when the projected started; the models just got better, but we had a good chunk of the primitives since gpt-3.
What's more recent is the DbC contribution which I think is unique. It literally solved anything agent related I've thrown at it -- especially because I can chain contracts together and the guardrails propagate nicely.
I've built most the custom tools myself. For instance, not only perplexity was rendered useless by openai's web search, but openai's web search itself is not good enough for what you can customize yourself. To this end, I've built my own deep research agent. Here's a thread with some results from the first day it was working: https://x.com/futurisold/status/1931751644233945216
I'm also running a company, and we've built an e2e document generation just from contracts (3 contracts chained together in this case). Here's an output (sorry about the PDF rendering, that's not what we serve, it's just what I've quickly hacked for local dev): https://drive.google.com/file/d/1Va7ALq_N-fTYeumKhH4jSxsTrWD...
This was the input:
---
Prompt:
> I want the files to be analyzed and I am interested in finding patterns; feel free to make suggestions as well. I want to understand how different providers use their system prompts, therefore things like: what kind of tags do they use – are they XML, markdown, etc, are they prone toward sycophancy or trying to manipulate the user, are they using tools and if so how, etc. I want the tech report to deconstruct and synthesize and compare the information, find interesting patterns that would be hard to spot.
Generated instructions:
(a) Query:
Conduct a comparative analysis of system prompts across major AI providers (OpenAI, Google, Anthropic, xAI, etc.) to identify structural patterns, linguistic frameworks, and operational constraints that shape AI behavior and responses.
(b) Specific Questions:
1. What syntactic structures and formatting conventions (XML, markdown, JSON, etc.) are employed across different AI system prompts, and how do these technical choices reflect different approaches to model instruction?
2. To what extent do system prompts encode instructions for deference, agreeability, or user manipulation, and how do these psychological frameworks vary between commercial and research-focused models?
3. How do AI providers implement and constrain tool usage in their system prompts, and what patterns emerge in permission structures, capability boundaries, and function calling conventions?
4. What ethical guardrails and content moderation approaches appear consistently across system prompts, and how do implementation details reveal different risk tolerance levels between major AI labs?
5. What unique architectural elements in specific providers' system prompts reveal distinctive engineering approaches to model alignment, and how might these design choices influence downstream user experiences?
---
Contracts were introduced in March in this post: https://futurisold.github.io/2025-03-01-dbc/
They evolved a lot since then, but the foundation and motivation didn't change.