Yuming — Portfolio

I asked Claude to write a summary. It was pretty good. I rewrote half of it. The next day I asked again — same summary, same rewrite, same half. A week later, nothing changed. The model was getting smarter every quarter. My revision burden stayed exactly the same.

So I did what any reasonable person would do: I went down the configuration rabbit hole. Hooks — shell commands that fire automatically after every tool call. Skills — structured instructions that encode how I want specific tasks done. Memory files, preference files, rule files. A CLAUDE.md so detailed it reads like a prenuptial agreement.

It helped. A little. The system stopped making some obvious mistakes — and started making new ones. Confidently applying rules in the wrong context. Following instructions so literally the output went rigid. False positives from my own configuration. I was debugging my own rules more than I was doing actual work.

Then the real realization hit: I had spent more time maintaining the system than doing the tasks the system was supposed to help with.

This is productivity theater — the seductive feeling of building infrastructure around work instead of doing the work. You tweak the config, you refine the prompt, you add another rule. Every step feels like progress. The actual output doesn't move.

The reason it doesn't move is structural. The system remembers everything I've ever told it. It remembers nothing about what I've done to its work.

The chef who never checks the plates

Imagine a chef who memorizes every order perfectly. Allergies, substitutions, special requests — all stored. But this chef never once looks at which dishes come back to the kitchen half-eaten. Never notices that table 4 always sends back the risotto. Never registers that the sauce gets scraped off every single time.

That's how AI memory works today. Perfect recall of what you asked for. Zero awareness of what happened after delivery.

Every memory startup right now is building better recall. Better retrieval. Longer context. More facts about you stored in more sophisticated graphs. The benchmarks tell the same story: LongMemEval, LoCoMo, STARK, HaluMem — the most cited evaluations in the field all measure one thing: can the system remember and retrieve? They test recall accuracy. They do not test procedural alignment. Even their definition of "preference" stops at what the user explicitly stated in conversation. Whether the system can learn from a pattern of corrections, whether it can infer standards from a sequence of edits — nobody is measuring that.

The entire field is optimizing for the easy problem.

The hard problem is procedural: not "what does this user like" but "how does this user want this type of work done." And the hardest version of that problem is generalization. The way you write a technical spec — structured, evidence-first, precise about scope — is different from how you write a blog post. But it should carry over when you write an implementation plan, a design doc, a project brief. Same discipline, different material. A system that learns your spec-writing procedure and recognizes when a new task demands the same rigor — that is memory that actually matters.

No system does this today. You either re-explain your standards every time, or you write them down in a config file and spend your weekends maintaining it.

You already have a trust map. You just never wrote it down.

Your AI writes an email draft. You glance at it, hit send. It writes a financial summary. You read every line, check two numbers against the source, rewrite the conclusion. It writes a meeting agenda. You don't even open the preview.

You've already decided — for each type of task — how much of your attention it deserves. There is a trust map in your head, precise down to the task type, shifting with context, updating constantly based on experience:

This kind of task? Just show me "done."

That kind of task? I need to see the evidence.

This other kind? Don't even start without checking with me first.

This is the most valuable data about how to build a better agent. And it evaporates completely. No system captures it. No system even tries.

How many agents can one person manage? Today, maybe five to ten — not because the agents are slow, but because every output demands the same undifferentiated attention. A calibrated trust map is what turns that from an O(n) review burden into something that actually scales.

I've corrected the same formatting issue maybe forty times now. The AI has apologized forty times. We are both very committed to this relationship.

What you do after delivery is the real feedback

People are terrible at describing what they want. Ask someone how they'd like a report structured and you'll get a vague answer. Watch them rewrite your report and you'll learn more in five minutes than in an hour of conversation.

People don't know what they think until they see what they don't want. Standards are never described — they are corrected into existence.

And yet the entire feedback surface in today's AI systems is conversational. The system captures what you type in the chat window. It captures nothing about what happens after delivery — the edits, the rewrites, the deletions, the takeovers. Whether you hit "accept" in two seconds or spent forty-five minutes rewriting, the system registers the same event. It cannot distinguish "this is perfect" from "this is so wrong I'm starting over."

It's like managing an employee who hands in work, watches you rewrite it at your desk every evening, and never once thinks: maybe I should do it differently next time.

The patterns are hiding in plain sight:

What do they always rewrite? — Procedural gap.

What do they accept without looking? — Earned trust.

What do they take over entirely? — Boundary.

What do they keep doing after the AI stops? — The part of the job the AI never knew existed.

When AI finishes a task and the human keeps working — restructuring, adding evidence, changing the framing — that's not extra work. That is the work. The AI just didn't know the job extended that far.

Why this matters now, and more every day

A year ago, AI output was bad enough that the bottleneck was obvious: make the model smarter. Today, the output is good enough that a subtler bottleneck has emerged. Not intelligence — fit.

The model writes something 80% right. You spend your time on the last 20%. That 20% is different for every person, every task, every context. No base model improvement will fix it. No longer context window will fix it. The only thing that fixes it is watching what humans actually do to the output.

The better models get, the more this matters. When output is terrible, you throw it away — there's nothing to learn from rejection. When output is 80% right, every edit is a precise, high-signal correction. Every rewrite is a lesson in taste. The better the AI gets, the more valuable the behavioral trace becomes.

We are building memory systems that get better at listening to what people say, at the exact moment when what people do is becoming the more important signal.

The last mile isn't intelligence. It's taste.

Every task that can be optimized by reinforcement learning will eventually be optimized. Coding is already there — code can be tested, so you can RL your way to better code. But most knowledge work has no test suite. The "test" is a human looking at the output and deciding whether it meets their standard.

That standard — taste, judgment, the instinct for "good enough" — is the last mile. You can't prompt your way to it. You can't configure your way to it. You can only learn it by watching what people do when they think nobody's watching.

The next generation of AI won't be the smartest. It'll be the first one that actually pays attention after it stops talking.

AI Remembers Everything You Say. It Learns Nothing From What You Do

The chef who never checks the plates

You already have a trust map. You just never wrote it down.

What you do after delivery is the real feedback

Why this matters now, and more every day

The last mile isn't intelligence. It's taste.