Backlog.md – Markdown‑native Task Manager and Kanban visualizer for any Git repo

1. mrlesk ◴[06 Jul 25 19:55 UTC] No.44483531[source]▶

I threw Claude Code at an existing codebase a few months back and quickly quit— untangling its output was slower than writing from scratch. The fix turned out to be process, not model horsepower.

Iteration timeline

==================

• 50 % task success - added README.md + CLAUDE.md so the model knew the project.

• 75 % - wrote one markdown file per task; Codex plans, Claude codes.

• 95 %+ - built Backlog.md, a CLI that turns a high-level spec into those task files automatically (yes, using Claude/Codex to build the tool).

Three step loop that works for me 1. Generate tasks - Codex / Claude Opus → self-review.

2. Generate plan - same agent, “plan” mode → tweak if needed.

3. Implement - Claude Sonnet / Codex → review & merge.

For simple features I can even run this from my phone: ChatGPT app (Codex) → GitHub app → ChatGPT app → GitHub merge.

Repo: https://github.com/MrLesk/Backlog.md

Would love feedback and happy to answer questions!

replies(6): >>44484317 #>>44484770 #>>44486442 #>>44487755 #>>44490648 #>>44559194 #

2. mitjam ◴[06 Jul 25 21:39 UTC] No.44484317[source]▶

>>44483531 (TP) #

Really love this.

Would love to see an actual end to end example video of you creating, planning, and implementing a task using your preferred models and apps.

replies(1): >>44484679 #

3. mrlesk ◴[06 Jul 25 22:26 UTC] No.44484679[source]▶

>>44484317 #

Will definitely do. I am also planning to run a benchmark with various models to see which one is more effective at building a full product starting from a PRD and using backlog for managing tasks

replies(2): >>44484751 #>>44485226 #

4. bazooka5798 ◴[06 Jul 25 22:37 UTC] No.44484751{3}[source]▶

>>44484679 #

I'd love to see openRouter connectivity to try non Claude models for some of the planning parts of the cycle.

5. unshavedyak ◴[06 Jul 25 22:40 UTC] No.44484770[source]▶

>>44483531 (TP) #

Would love more detail on your integration with claude. Are you telling claude to use backlog to plan X task? Feels like some MCP integration or something might make it feel more native?

Though i've not had much luck in getting Claude to natively use MCPs, so maybe that's off base heh.

replies(1): >>44486820 #

6. westurner ◴[06 Jul 25 23:47 UTC] No.44485226{3}[source]▶

>>44484679 #

Is there an established benchmark for building a full product?

- SWE-bench leaderboard: https://www.swebench.com/

- Which metrics for e.g. "SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork"? https://news.ycombinator.com/item?id=43101314

- MetaGPT, MGX: https://github.com/FoundationAgents/MetaGPT :

> Software Company as Multi-Agent System

> MetaGPT takes a one line requirement as input and outputs user stories / competitive analysis / requirements / data structures / APIs / documents, etc. Internally, MetaGPT includes product managers / architects / project managers / engineers. It provides the entire process of a software company along with carefully orchestrated SOPs.

- Mutation-Guided LLM-based Test Generation: https://news.ycombinator.com/item?id=42953885

- https://news.ycombinator.com/item?id=41333249 :

- codefuse-ai/Awesome-Code-LLM > Analysis of AI-Generated Code, Benchmarks: https://github.com/codefuse-ai/Awesome-Code-LLM :

> 8.2 Benchmarks: Integrated Benchmarks, Evaluation Metrics, Program Synthesis, Visually Grounded Program, Synthesis, Code Reasoning and QA, Text-to-SQL, Code Translation, Program Repair, Code Summarization, Defect/Vulnerability Detection, Code Retrieval, Type Inference, Commit Message Generation, Repo-Level Coding

- underlines/awesome-ml/tools.md > Benchmarking: https://github.com/underlines/awesome-ml/blob/master/llm-too...

- formal methods workflows, coverage-guided fuzzing: https://news.ycombinator.com/item?id=40884466

- "Large Language Models Based Fuzzing Techniques: A Survey" (2024) https://arxiv.org/abs/2402.00350

replies(1): >>44489740 #

7. thelittleone ◴[07 Jul 25 03:19 UTC] No.44486442[source]▶

>>44483531 (TP) #

I've had same experience. Taskmaster-ai was pretty good, but sometimes the agent ignored it as the project grew larger (can probably prevent that now using claude code hooks).

Trying this project today looks nice. I see you have sub-tasks. Any thoughts on a 'dependency' relation? I.e., don't do X if it is dependent on task A which is not complete.

FYI, there is a 404 in the AGENTS.md GEMINI.md etc pointing to a non existing README.md.

replies(1): >>44488727 #

8. mrlesk ◴[07 Jul 25 04:41 UTC] No.44486820[source]▶

>>44484770 #

No mcp, just custom instructions.

When you initialize backlog in a folder it asks you if you want to set up agent’s instructions like CLAUDE.md. It is important to say yes here so that Claude knows how to use Backlog.md.

Afterwards you can just write something like: Claude please have a look at the @prd.md file and use ultrathink to create relevant tasks to implement it. Make sure you correctly identify dependencies between tasks and use sub tasks when necessary.

Or you can just paste your feature request directly without using extra files.

Feels a bit like magic

replies(1): >>44487796 #

9. jwpapi ◴[07 Jul 25 07:52 UTC] No.44487755[source]▶

>>44483531 (TP) #

Hey man amazing work! You’re a legend

10. jwpapi ◴[07 Jul 25 08:00 UTC] No.44487796{3}[source]▶

>>44486820 #

How can I change from gemini to claude ?

Also I’m not fully sure about your setup. In my fresh pov I would next set up agents that check my github repo for backlog tasks and do pull requests on those tasks. If I write a good description and ideally tests I can optimize the results of these.

This creates the possibility of agents checking your backlog and prepare the work.

I usually work with aider everyday and I’m quite fast in achieving task, the next limitation would be the latency and some back and forth. I have some dead time in between. I can definitely define tasks faster than 1-1 AI.

Yeah if you could share a bit more how you do this with Claude we would all be thankful, also I havent seen anywhere to sponsor/tip you, would love to!

11. mrlesk ◴[07 Jul 25 10:29 UTC] No.44488727[source]▶

>>44486442 #

Yep. Dependecies are supported via —dep parameter.

Will check the 404 issues. Thanks for reporting it

12. Leave_OAI_Alone ◴[07 Jul 25 12:37 UTC] No.44489740{4}[source]▶

>>44485226 #

You have compiled an interesting list of benchmarks and adjacent research. The implicit question is whether an established benchmark for building a full product exists.

After reviewing all this, what is your actual conclusion, or are you asking? Is the takeaway that a comprehensive benchmark exists and we should be using it, or is the takeaway that the problem space is too multifaceted for any single benchmark to be meaningful?

replies(1): >>44496363 #

13. beef_rendang ◴[07 Jul 25 14:19 UTC] No.44490648[source]▶

>>44483531 (TP) #

>ChatGPT app (Codex) → GitHub app → ChatGPT app → GitHub merge

I look forward to a future where we are reduced to rubberstamping fully-agentic-generated code on our glass slates for $0.01 eurodollars a PR.

14. westurner ◴[08 Jul 25 02:11 UTC] No.44496363{5}[source]▶

>>44489740 #

The market - actual customers - is probably the best benchmark for a product.

But then outstanding liabilities due to code quality and technical debt aren't costed in by the market.

There are already code quality metrics.

SAST and DAST tools can score or fix code, as part of a LLM-driven development loop.

Formal verification is maybe the best code quality metric.

Is there more than Product-Market fit and infosec liabilities?

15. knownhoot ◴[14 Jul 25 12:09 UTC] No.44559194[source]▶

>>44483531 (TP) #

why codex for planning?