When to act, when to verify, and when to stop during a GTM audit

During a recent GTM audit, I deleted four tags in under five minutes. I also spent a full week deciding what to do about one.

The difference wasn't complexity. The dead scroll tracking library sending data to Universal Analytics endpoints that stopped processing in July 2023 was arguably more complex than the customer ID encoder that hadn't fired in a year. But the scroll library had zero downside to removal (GA4's native scroll tracking had recorded over 90,000 events in the same period), while the customer ID encoder pushed values to the data layer that another system might be reading. "Might" was enough to pause.

That gap showed up repeatedly through the session. By the end, I had a heuristic for it. Three questions that I've been applying to every GTM change since.

A three-question test for GTM changes

Every finding in a GTM audit sits somewhere on a confidence spectrum. At one end, you act immediately. At the other, you need to talk to people before touching anything. Three questions tell you where you are:

1. Can I verify the change is safe with data I have right now?

If the audit data, combined with GA4 event data and what you can see in Preview mode, gives you enough evidence to confirm a change is safe, you can move fast. Dead UA tags with no equivalent functionality elsewhere in the container are the cleanest example: the evidence is self-contained, and you don't need to check with anyone.

If verifying the change requires information you don't have (what another team's system reads from the data layer, whether a tag feeds a reporting dashboard someone depends on), you slow down.

2. Is the fix reversible?

GTM has version control built in. You can always publish a new version that restores previous behavior. But "reversible" in practice means something more specific: can you undo the change without downstream consequences?

Deleting a dead tag is reversible. Changing consent types on advertising tags is technically reversible, but if you switch 44 tags from one consent mode to another, the conversion volume in Google Ads changes immediately, and reverting a week later means a week of inconsistent reporting. The change is undoable. The data gap is not.

Pausing a tag (rather than deleting it) is the middle ground. It stops execution without destroying the configuration. When you're not sure, pause. You can always delete later.

3. Does the fix touch systems outside GTM?

A tag that only sends data to GA4 is contained within a system you can observe. A tag that writes to Salesforce form fields, triggers a webhook, or feeds a data layer that other scripts read has external dependencies that the audit can't fully map.

The customer ID encoder I paused for a week was a good example. It hadn't fired in over a year based on GA4 event data. But it pushed to the data layer, and another system could have been reading from that data layer push independently of GA4. The risk wasn't technical. It was organizational: I didn't know enough about the broader system to be sure.

When all three questions are favorable (you can verify, it's reversible, it's self-contained), action is fast. When any one is unfavorable, you slow down or stop.

High confidence: act immediately

The dead scroll tracking library was the clearest case. A third-party JavaScript library was loading on every page to send scroll depth events to a UA property ID. The UA property stopped processing data in July 2023. GA4's native scroll tracking was active and had months of data. The library added page weight for zero functional value.

Evidence self-contained? Yes, the UA property was dead and GA4 had native scroll tracking active. Reversible? Yes, the tag could be restored from version history. Self-contained? Yes, scroll tracking goes to analytics only.

Deleted in minutes.

Consent timing was similar, once the correct configuration was understood. The CMP was set to a 0ms wait for consent, which meant tags could fire before the consent dialog loaded. The fix was a single setting change (0ms to 500ms), already validated on staging. The evidence was clear, the change was reversible, and it touched only the consent timing mechanism.

Medium confidence: one more check

Medium confidence findings need one additional check to convert to high confidence. The check is usually fast, but skipping it would be reckless.

A YouTube iframe API loader tag was probably redundant. GTM has native YouTube video triggers. But "probably" isn't good enough when you're dealing with a client's production container. The check: open GTM settings, confirm that "Add JavaScript API support" is enabled for the YouTube trigger. Thirty seconds in the GTM interface converted medium confidence to high confidence. The tag was redundant. Deleted.

The consent mode fixes were medium confidence until the copilot explained the difference between Built-In and Additional Consent Checks. Six tags were set to NOT_NEEDED (bypass consent entirely) when they should have been respecting consent. The fix was straightforward, but only after understanding what each consent level actually does. Without that explanation, I would have either applied the wrong level or not made the change at all.

This is where having structured audit data alongside an AI copilot pays off. The data tells you something is wrong. The copilot tells you how to fix it correctly. I needed both.

Low confidence: defer and document

Low confidence findings aren't necessarily the most complex. They're the ones with unanswered organizational questions.

The customer ID encoder hadn't produced a GA4 event in a full year. By most metrics, it was dead. But it wrote to the data layer, and I couldn't confirm that no other system was reading from that push. Deleting it would be technically easy and operationally risky. I paused it instead and flagged it for the team to confirm.

A 44-tag consent type correction was technically clear: each tag needed a specific consent configuration based on its type. But changing consent types on advertising tags shifts conversion volume in ad platforms. Running 44 changes without telling the demand gen team that their Google Ads conversion numbers would look different for a few days is a communication problem, not a technical one. The fix was deferred until the client could be briefed.

The UTM attribution system (hundreds of lines of custom JavaScript) was the lowest confidence item. The code was parsing cookies that no longer existed, but three different fix paths were possible, and each one depended on answers the client hadn't provided yet: Which system should own attribution? Does HubSpot's native UTM handling cover the reporting needs? Is the Salesforce integration still active? No amount of technical analysis resolves questions about organizational intent.

Using the test in practice

The three-question test is most useful when applied to groups of findings, not individual ones. An audit that surfaces 50 findings will typically produce a distribution something like:

10 to 15 high confidence findings you can act on immediately (dead code, straightforward consent fixes, orphaned triggers)
15 to 20 medium confidence findings that need a quick verification step
10 to 20 low confidence findings that require organizational context

That distribution is useful in itself. It tells you how much you can do independently versus how much requires team coordination. It shapes your timeline and your expectations for what one session can accomplish.

If your audit data is already organized by effort (quick wins, focused work, structural projects), the planning falls into place. High-confidence quick wins first. Medium-confidence items with their verification steps next. Low-confidence items packaged as documented questions for the team.

The audit that prompted this framework started at roughly 30 out of 100 on functional health. After one session of working through the high and medium confidence findings, the compliance issues were resolved and the score improved. The low-confidence items became a concrete backlog with specific blocking questions, not a vague list of things to worry about.

I've found it useful to know not just what to fix, but why I'm deciding to wait on something. The team appreciates it too.