Code First or Concept First? The Real Dilemma Stopping Your ML Career

Why knowing model.fit() isn't enough, and when reading the math is a waste of time.

Sep 07, 2025

Epoch 1/100: ... Loss: NaN

You’ve checked your code 20 times. The pipeline loads, the tensors have the right shape, the API calls are all correct. The bug, you slowly realize, isn't in your code. It's in your concept.

This is the central dilemma of our field. Every engineer lives on a spectrum between two extremes:

The Framework Specialist: This engineer knows every API call in Keras, PyTorch, and Scikit-learn. They can build and deploy a complex pipeline in a weekend. But ask them why the model is converging poorly, and their main tool is random hyperparameter tuning. They treat the model as a black box.
The Theoretical Purist: This engineer can derive backpropagation from first principles and explain the statistical proofs behind six different optimizers. But ask them to ship a production endpoint, and they get lost in data pipelines, environment configs, and API latency issues.

The secret? The best engineers aren't just one or the other. They live in the middle. They use code to build intuition and concepts to debug reality.

The "Code-First" Trap (And Why We All Fall For It)

When you're starting out, "code-first" is the only way to learn. It feels productive. Frameworks like PyTorch and Hugging Face are designed to abstract the math away so you can build amazing things fast.

This is a good thing... until it isn't.

This approach optimizes you for the "happy path" the 90% of the time when everything just works. The tutorial runs, the model trains, the loss goes down.

The problem comes when the abstraction leaks. When the model fails silently (like converging to 50% accuracy on a binary problem) or fails loudly (that dreaded NaN), API knowledge is suddenly useless. You're left staring at a black box, randomly tweaking learning rates, batch sizes, and activation functions, hoping something sticks.

This isn't engineering. It's gambling.

When "Concept" is Your Only Flashlight

The purpose of theory isn't to win arguments. The purpose of theory is to give you a flashlight and a map when you’re lost in the dark.

When your code breaks, the bug is almost never a syntax error. It's a violated mathematical assumption. Concept is what allows you to debug systematically instead of just guessing.

Example 1: The Exploding Loss (NaN)

Symptom: Your loss instantly goes to NaN or Inf.
Code-Only Debugging: "My learning rate is too high." You lower it. Still NaN. "Maybe I'll switch from Adam to SGD." Still NaN. "Maybe my data isn't normalized?" You add normalization. Still NaN.
Concept-Driven Debugging: "A NaN loss means an operation resulted in an undefined value, like a divide-by-zero or the log of zero. An Inf loss means the numbers got too big (an exploding gradient). This is a mathematical stability problem."

This conceptual understanding doesn't just give you a guess; it gives you a targeted menu of solutions:

The Concept: Poor weight initialization is causing activations to compound and explode during the forward pass, leading to massive gradients.

The Code Fix: You're using a ReLU activation. The "He" (or Kaiming) initialization concept was designed specifically for this. You add it to your layers:

# Instead of the default, you explicitly apply the right concept
for module in self.modules():
    if isinstance(module, torch.nn.Linear):
        torch.nn.init.kaiming_uniform_(module.weight, nonlinearity='relu')

The Concept: The gradients themselves are just too large on a specific batch.
The Code Fix: Implement gradient clipping. This is a purely conceptual fix for an unstable math problem. It puts a "ceiling" on the gradient vector's norm before the optimizer step.
```
# During the training loop

loss.backward()

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) # The "conceptual fix"

optimizer.step()
```

Example 2: The "Good Training, Bad Val" Mystery

Symptom: Your training accuracy hits 99%, but your validation accuracy is stuck at 55% (barely better than a coin flip).
Code-Only Debugging: "It's classic overfitting!" You add more nn.Dropout layers. You increase weight_decay. It helps a little, but the gap is still huge.
Concept-Driven Debugging: "Such a massive gap isn't just overfitting; it's a data distribution mismatch. The model learned rules that only apply to the training set's unique quirks. This means my validation set doesn't look like my training set."

The solution isn't in your model.py file. The concept of statistical drift tells you to stop debugging the model architecture and start investigating the data pipeline. Is your validation split truly random? Are you leaking data from the future into the past? Is your training data from one source (e.g., California users) and your validation data from another (e.g., New York users)?

The concept told you exactly where the "bug" really was in the data, not the code.

But Theory Won't Write the Pipeline

Let's be clear: this isn't an argument for living in a library. The "Theoretical Purist" is just as stuck as the "Framework Specialist."

A paper presents a concept in a vacuum. A real-world project forces that concept to meet messy data, compute budgets, latency requirements, and code versioning.

You will never truly understand an attention head from just reading the "Attention Is All You Need" paper. You will only understand it when you try to code one from scratch in NumPy or PyTorch, watch your matrix dimensions mismatch 15 times, and finally realize why you need masking, why Q and K are multiplied, and what the softmax is actually "paying attention" to.

Code is the lab where theory gets proven or broken. It’s the sandpit that turns abstract knowledge into concrete intuition.

The Verdict: How to Balance Your Time

So, what's the answer? Stop asking "code or concept?"

Start practicing Concept-Driven Debugging.

Here is the simple, actionable framework I use:

When your code works, ship it. Don't fall into a theory rabbit hole you don't need. If the simple API call works and the metrics are good, move on. Your job is to deliver results.
When your code breaks (or underperforms), stop debugging the code. The bug is almost never in the syntax; it's in the concept. Go read the paper. Open the textbook. Ask why the math would fail. Find the concept you violated, then use that to find the specific line of code to fix.

Your goal isn't to be a mathematician or just a software engineer. Your goal is to build intuition. Intuition is the magic that connects theory to practice, and it’s the only thing that will solve that next, inevitable Loss: NaN.

Thanks for reading! This post is public so feel free to share it.