Given that domain expertise of the problem statment, we can apply the same tactics in context engineering on higher level in prompt engineering.
On top of that -- rebranding "prompt engineering" as "context engineering" and pretending it's anything different is ignorant at best and destructively dumb at worst.
Like how all squares are rectangles, but not all rectangles are squares; prompt engineering is context engineering but context engineering also includes other optimisations that are not prompt engineering.
That all said, I don’t disagree with your overall point regarding the state of AI these days. The industry is full of so much smoke and mirrors these days that it’s really hard to separate the actual novel uses of “AI” vs the bullshit.
This entire field is basically being built on quicksand. And it will stay like this until the bubble bursts.
The other is that they intentionally forced LLMs to do the things we know are bad at (following algorithms, tasks that require more context that available, etc) without allowing them to solve it in a way they're optimized to do (write a code that implements the algorithm).
A cynical read is that the paper is the only AI achievement Apple has managed to do in the past few years.
(There is another: they managed not to lose MLX people to Meta)
One thing I’ve noticed about this AI bubble is just how much people are sharing and comparing notes. So I don’t think the issue is people being too arrogant (or whatever label you’d prefer to use) to agree on a way to use.
From what I’ve seen, the problem is more technical in nature. People have built this insanely advanced thing (LLMs) and now trying to hammer this square peg into a round hole.
The problem is that LLMs are an incredibly big breakthrough, but they’re still incredibly dumb technology in most ways. So 99% of the applications that people use it for are just a layering of hacks.
With an API, there’s generally only one way to call it. With a stick of RAM, there’s generally only one way to use it. But to make RAM and APIs useful, you need to call upon a whole plethora of other technologies too. With LLMs, it’s just hacks on top of hacks. And because it seemingly works, people move on before they question whether this hack will still work in a months time. Or a years time. Or a decade later. Because who cares when the technology would already be old next week anyway.
LLMs are extremely subtle, they are intellectual chameleons, which is enough to break many a person's brain. They respond as one prompts them in a reflection of how they were prompted, which is so subtle it is lost on the majority. The key to them is approaching them as statistical language constructs with mirroring behavior as the mechanism they use to generate their replies.
I am very successful with them, yet my techniques seem to trigger endless debate. I treat LLMs as method actors and they respond in character and with their expected skills and knowledge. Yet when I describe how I do this, I get unwanted emotional debate, as if I'm somehow insulting others through my methods.
It is different. There are usually two main parts to the prompt:
1. The context.
2. The instructions.
The context part has to be optimized to be as small as possible, while still including all the necessary information. It can also be compressed via, e.g., LLMLingua.
On the other hand, the instructions part must be optimized to be as detailed as possible, because otherwise the LLM will fill the gaps with possibly undesirable assumptions.
So "context engineering" refers to engineering the context part of the prompt, while "prompt engineering" could refer to either engineering of the whole prompt, or engineering of the instructions part of the prompt.
Early in the game when context windows were very small (8k, 16k, and then 32k), the team I was working with achieved fantastic results with very low incidence of hallucinations through deep "context engineering" (we didn't call it that but rather "indexing and retrieval").
We did a project for Alibaba and generated tens of thousands of pieces of output . They actually had human analysts reviews and grade each one for the first thousand. The errors they found? Always in the source material.
What I described not only applies to using AI for coding, but to most of the other use cases as well.
> Now I've never found typing to be the bottle neck in software development, even before modern IDEs, so I'm struggling to see where all the lift is meant to be with this tech.
There are many ways to use AI for coding. You could use something like Claude Code for more granular updates, or just copy and paste your entire code base into, e.g., Gemini, and have it oneshot a new feature (though I like to prompt it to make a checklist, and generate step by step).
And that is also not only about just typing, that is also about debugging, refactoring, figuring out how a certain thing works, etc. Nowadays I not only barely write any code by hand, but also most of the debugging, and other miscellaneous tasks I offload to LLMs. They are simply much faster and convenient at connecting all the dots, making sure nothing is missed, etc.
If you assume any kind of error rate of consequence, and you will get that, especially if temperature isn't zero, and at larger disk sizes you'd start to hit context limits too.
Ask a human to repeatedly execute the Tower of Hanoi algorithm for similar number of steps and see how many will do so flawlessly.
They didn't measure "the diminishing returns of 'deep learning'"- they measured limitations of asking a model to act as a dumb interpreter repeatedly with a parameter set that'd ensure errors over time.
For a paper that poor to get released at all was shocking.
Maybe we should look to science and start using the term pseudo-engineering to dismiss the frivolous terms. I don’t really like that though since pseudoscience has an invalidating connotation whereas eg prompt engineering is not a lesser or invalid form of engineering - it’s simply not engineering at all, and no more or less ”valid”. It’s like calling yourself a ”canine engineer” when teaching your dog to do tricks.
Context is very difficult for computers to acquire and understand. In some cases it requires knowing what a person is thinking or their entire life history. The sensors currently available are very limited in their ability to gather context; for example sensing mood, human relationships, intentions, tastes, fashion, local air temperature, knowledge about building layout, customs, norms, and a lot more. Context is a huge ontology problem and it's not going to be solved any time soon. So agents are going to be limited in what they can do for a long time. At a minimum an agent probably needs to know your entire life history, and the life history of everyone you know, and the past history of the area you are in. More limited ideas of context may be useful, but context as humans understand it is immensely complex. Even if you define context as just what a person supplies to a chatbot as context, the person may not be able to supply everything relevant to the question at hand because context is difficult for people too. And everything relevant to a question is most certainly not always available on the web or in a database.
0: https://github.com/x1xhlol/system-prompts-and-models-of-ai-t...
Whats really stopping you to parse and prioritise CUSTOM CONTEXT if given as text instruction in prompt engineering.