What actually happens when you fix up a GTM container with an AI copilot

The first time I used an LLM to help with a GTM container was in 2023, mid-scramble during the Universal Analytics sunset. I needed to verify that a client's GA4 migration was complete and nothing was still firing into a dead UA property. I pasted a chunk of container config into ChatGPT and asked it to flag anything still referencing UA. It worked. Rough, but it worked.

Since then, the capabilities have improved substantially. I use Claude Code for my GTM audits now. LLMs are strong at this kind of work because we're dealing with structured data that doesn't leave much room for interpretation. The copilot earns its keep on research, documentation, and keeping a running changelog as I go. I trust but verify throughout, because they hallucinate, and a production container is not the place for confident guesses.

This is a walkthrough of a recent audit. The container had about 150 tags across four management eras. If you've audited a GTM container before, you've probably found the same things I did:

Custom HTML tags doing things I couldn't immediately parse
Consent configurations that looked right on the surface but weren't applied consistently
Tags referencing cookies that hadn't existed since Universal Analytics sunset in July 2023
Dead JavaScript loading on every page for no functional reason
Naming conventions from three different contributors, none of whom documented what they built

The kind of container where everything is probably fine, but nobody can tell you which parts are definitely fine and which parts are quietly wrong.

Starting with structured data

The container JSON export was close to 500KB. I've learned from working with copilots that there's a Goldilocks problem with context. Too much in one go and you're not going to get a useful result. A 500KB JSON file in a one-shot prompt asking the AI to "audit this container" produces surface-level observations: "you have many tags," "some triggers appear unused," "consider reviewing your consent setup." Technically true. Practically useless.

It's on us as the operator to chunk things into manageable pieces. So before I touched anything in GTM, I created a structured dataset from the container export. Parse the container. Categorize findings by type. Score the functional health separately from container hygiene. Organize issues by effort, not severity. This gives the AI something it can actually work with, and it gives me a checklist I can move through systematically.

If you're skimming this and want the takeaway: get your AI to categorize the container contents, flag the problems, and produce a structured doc you can use to parse each issue individually. Don't ask it to audit and fix in one breath. The separation matters.

Planning and documenting the session

The sheer volume was the first challenge. So many tags, triggers, variables, and scripts running that it was hard to see the shape of the container. I used the Claude Code instance to plan out my strategy and document the changelog as I went. This kept me in the loop on every change and gave me a record I could hand to the client afterward.

As I ran into things I didn't understand, the copilot became a research assistant. I Google things myself, I read Simo Ahava and Analytics Mania, I check vendor docs. But having all of that embedded in the same working session meant I was building a living document: part audit, part analysis, part research log. Everything in one place.

The prioritization step mattered more than I expected. Some findings in a typical GTM audit are the kind of thing you fix last. Naming conventions, folder organization. Not going to hurt anybody. Then there's stuff that needs dealing with immediately: consent gaps, broken attribution, dead code adding page weight.

Over a single session, the practical wins included:

Fixed consent mode timing (the CMP was set to a 0ms wait, meaning tags could fire before consent loaded)
Corrected six tags that were bypassing consent entirely
Deleted four dead Custom HTML tags (roughly 220 lines of JavaScript running on every page for no reason)
Paused one ambiguous tag for team confirmation instead of guessing
Fully analyzed a broken UTM attribution system (about 640 lines of dead code parsing cookies that no longer exist)
Produced documentation for five different audiences

Breaking things into small enough steps

There's an expectation that you feed a GTM container to an AI and it returns a perfect audit. I don't think that's realistic. This container had tags interacting with Salesforce form fields, HubSpot tracking, Google Ads conversion pixels, a custom attribution system built on UA-era cookies, and a Cookiebot consent implementation that was half-configured. No AI is synthesizing all of that in one pass.

What actually works: break the audit into the smallest possible steps and take them one at a time. Check the consent timing. Verify the consent types on advertising tags. Figure out which Custom HTML tags still do something useful. Research whether HubSpot handles UTM tracking natively before deciding whether to keep the custom attribution code.

Each step is small enough that you can verify the answer before moving on. And the small steps compound. The copilot looked up the GA4 channel grouping regex patterns and cross-referenced 16 UTM values against them in about three minutes. Finding the right Google support page and manually checking each value would have been a 30-minute detour. The consent mode distinction between Built-In and Additional Consent Checks stopped the session cold until the copilot explained it in a way that made the fix obvious.

Research, not code generation

The research was the surprise. I expected code generation to be the main value. It wasn't.

"Can we read GA4's source/medium the way we used to read Universal Analytics?" No. UA stored attribution in a browser cookie. GA4 sends data directly to Google's servers. There's nothing in the browser to read. That killed a whole line of investigation in seconds.

"What UTM medium values does GA4 actually recognize?" The copilot pulled documentation, extracted regex patterns, cross-referenced 16 values. Five work. Eleven would show as "Unassigned." That shaped the entire recommendation.

"Can HubSpot handle UTM tracking natively?" Specific answer: custom properties must be created manually, added as hidden form fields, auto-populated from URL parameters, synced to Salesforce via field mapping. Not out of the box. That specificity is the difference between an honest recommendation and a guess.

AI copilots are at their best when you're adjacent to your expertise. I work with GTM regularly, but I don't have every GA4 spec memorized. The copilot fills that gap with speed and specificity.

Where the copilot got it wrong

On consent, the copilot consistently pushed toward a stricter interpretation than the situation called for. That's a sensible default (err on the side of compliance), but it would have meant over-blocking tags that were legitimately configured for the client's consent approach. I had to know enough about the actual consent model to catch that.

It also made assumptions about tag dependencies. It would suggest deleting a tag based on evidence that it was firing against a dead endpoint, without considering whether another system was reading from the same data layer push. So I built checkpoints into the workflow: verify before acting, check dependencies before deleting. LLMs hallucinate. They make confident leaps. Having the structured data to check against meant I could catch those leaps before they became mistakes.

By the end of the session, the copilot had accumulated enough context about the container, the consent requirements, and the client's setup that its suggestions got noticeably better. It just took a few rounds of correction to get there.

The remaining work plan

The container improved meaningfully in one session. Consent timing fixed. Dead code removed. The most pressing compliance issues resolved. But the majority of the work plan is still ahead: a batch of advertising tags that need consent type correction (which will affect conversion volume reporting, so the client needs a heads-up first), the UTM attribution decision (three viable paths, all blocked on client answers), and a set of paused tags waiting on team confirmation.

That's what a real GTM audit looks like. The easy wins go fast. The structural work needs coordination and organizational context. The copilot got me to clarity on what needs doing, produced the documentation to communicate it, and handled research that would have taken days to compile on my own.

The container went from roughly 30 out of 100 on functional health to measurably better, with a clear path forward for everything that remains. For a container that had been accumulating complexity across four management eras with no documentation, I'll take that from a single session.