Problem SolvingPost-SiliconMethodologyYield
My Problem-Solving Playbook for Post-Silicon Failures
When a chip fails and millions of dollars are on the line, panic is not a strategy. Here is the structured approach I use to go from "something is wrong" to root cause.
••3 min read
A production lot just failed. Yield dropped 8 points overnight. The program manager wants answers by end of day. The fab says nothing changed on their end. The test house says the testers are fine. Everyone is pointing fingers.
This is Tuesday.
Over the years I have developed a problem-solving framework that works reliably under this kind of pressure. It is not revolutionary. It is disciplined. And discipline is what separates a fast resolution from weeks of thrashing.
Step 1: Contain Before You Investigate
The first instinct is to start debugging. Resist it. Before you understand anything, you need to stop the bleeding. Put the affected lots on hold. Tighten the test limits if needed to protect outgoing quality. Notify the downstream stakeholders.
Containment buys you time to investigate properly instead of rushing to a wrong conclusion under pressure to release material.
Step 2: Define the Problem With Data, Not Opinions
"Yield dropped" is not a problem statement. "Bin 7 leakage failures increased from 0.3% to 4.1% starting on lot N, affecting only wafers from the center of the boat, across all three test insertions" is a problem statement.
The difference is specificity. A vague problem leads to vague investigation. I pull the data immediately: yield trend by lot, bin pareto comparison (before vs. after), parametric distributions for the failing population vs. passing, wafer maps showing spatial patterns.
Eighty percent of the time, the data tells you where to look before you touch any hardware.
Step 3: Isolate Variables Systematically
Now comes the actual detective work. The goal is to isolate what changed. I think in terms of the 5M framework: Man, Machine, Material, Method, Measurement.
Was there a new operator or a shift change at the test house? (Man)
Did the tester or handler have a maintenance event? (Machine)
Did the fab change a process step or material supplier? (Material)
Was the test program updated or a new limit applied? (Method)
Is the measurement system itself drifting? (Measurement)
I do not brainstorm all possibilities and then try to prove each one. I use the data from Step 2 to eliminate categories. If the failure is on specific wafer positions, it is almost certainly Material or Machine, not Method. If it affects all products on one tester but not others, it is Machine or Measurement.
Each elimination narrows the search space. Within two or three rounds of data queries, you typically have one or two hypotheses left.
Step 4: Verify With a Controlled Experiment
A hypothesis is not a root cause until you can demonstrate it. If I suspect a tester contact issue, I retest the failing devices on a known-good tester. If I suspect a fab process shift, I compare parametric data across wafer lots that straddled the suspected change date.
The experiment must be designed to either confirm or kill the hypothesis. If it is ambiguous, the experiment was poorly designed.
Step 5: Fix, Validate, and Prevent Recurrence
Finding the root cause is only half the job. The corrective action needs to address why the problem happened and why it was not caught sooner. I always ask two questions: "What do we fix?" and "What monitoring do we add so this never surprises us again?"
That monitoring usually becomes a new feature in our analytics platform or a new alert in our predictive model. Every failure is an opportunity to make the system smarter.
The Meta-Lesson
The framework works because it resists the most common failure mode in problem solving: jumping to conclusions. When yield drops, everyone has a theory. The fab suspects the test house. The test house suspects the design. The design team suspects the fab.
Data does not have opinions. Start there.
← back to blogUpdated Feb 28, 2026