I have a theory about AI tools that I've confirmed across every project I've used them on: the quality of the output is directly proportional to the quality of the input. Garbage in, garbage out. Good data in, good data out. It sounds obvious, but most people working with LLMs spend their time on prompts when they should be spending their time on data.
This is especially true for GTM audits. A container JSON export is a configuration file designed for Google's systems. It describes what exists: tags, triggers, variables, folders, consent settings, all nested inside each other. When you load that into an AI and ask "what's wrong," you get observations that are technically true but operationally thin. "You have 47 Custom HTML tags." "Some triggers appear to have broad firing conditions." "Consider reviewing your consent setup." Sure, but which tags? What's wrong with them? What should they be instead?
The AI isn't the bottleneck. The input format is.
Components of a structured audit dataset
A structured audit dataset for a GTM container typically includes:
Categorized findings. Not "here's everything wrong" but "here are the consent issues, here are the dead code issues, here are the GA4 data quality issues." Categorization means you work through one area at a time, and the AI can focus on a specific domain instead of trying to process everything at once.
Scored impact. Not every finding matters equally. A tag bypassing consent entirely is a compliance risk. A tag with a non-standard name is a hygiene issue. Scoring tells you what matters now versus what can wait.
Affected tags. Each finding lists the specific tags involved. "Six tags bypass consent" is a start. Listing those six tags by name, type, and platform (Google Ads, Facebook, LinkedIn) is what you need to actually open GTM and fix things.
Effort classification. Some fixes take five minutes. Some take an afternoon. Some are projects that need scoping and coordination. Knowing which is which before you start lets you batch the quick wins and plan the bigger work separately.
Evidence. What triggered the finding? Which line of code, which configuration, which trigger condition? Evidence lets you verify the finding instead of trusting it on faith. It also builds confidence to act: when you can see why something was flagged, you're more likely to fix it instead of deferring it.
Raw JSON vs. structured data in practice
With raw JSON, a typical session starts with: "Analyze this container and tell me what needs fixing." The AI skims the file, identifies patterns, and returns a list mixing compliance risks with naming inconsistencies. You spend as much time triaging the output as you would have spent auditing manually.
With structured data, you can say: "Here are the consent findings. Walk me through each one. For each tag, tell me what the correct configuration should be." The AI spends its compute on interpretation and research instead of parsing.
I think of it as two layers. The deterministic layer (structured audit rules) handles work that rules do reliably: identifying misconfigured consent, flagging dead code, detecting duplicates, checking naming conventions. Rules don't hallucinate. They either match or they don't. The probabilistic layer (AI copilot) handles work that needs synthesis: explaining why a finding matters, researching the correct fix for a specific tag type, pulling platform documentation and interpreting it in context.
A practical example: the structured data tells me that six advertising tags fire without consent. That's a rule match. Clear, verifiable, no ambiguity. I then ask the copilot what the correct consent configuration should be for each tag type, and it researches the specific enforcement levels based on each platform's requirements. The rule caught the problem. The AI figured out the fix. Each layer does what it's good at.
Data doesn't lie, but it also doesn't tell you much on its own. It needs interpretation. When the data going in is clean, categorized, and specific, the interpretation that comes out is noticeably better.
A real friction point: findings-first vs tags-first
One thing I ran into during my own audits: structured data is typically organized by finding ("here are all the consent issues"), but the actual work happens by tag ("I'm in GTM editing tag 65, what's wrong with it?"). When you're in the GTM interface, you want everything about that tag in one place.
The copilot can bridge this. You ask "tag 65 appears in three findings: consent bypass, jQuery dependency, dead UA reference. Give me the full picture." That works, but it's an extra step every time you switch tags. Ideally your structured data supports both directions from the start: "show me all consent issues" and "show me everything about tag 65." If you're building your own audit process, a tag-level profile alongside the category-level report gives you that flexibility.
This came up repeatedly during my audits. I'd be in GTM editing a tag, and I'd need to quickly understand the full picture for that specific tag across all categories. Having to mentally reassemble findings from different sections added friction. The ideal format lets the copilot answer both "what's wrong in this category?" and "what's wrong with this tag?" without reformulating.
You can build this yourself
The structured audit layer doesn't have to come from any specific tool. You could export the container JSON, set up a Claude project with your own rules, and produce a dataset that works exactly how you want. I know consultants who have done exactly this: defined their own categories, their own scoring, their own output format, and standardized it across every client container they manage. The upfront work pays off when you're auditing your twentieth container and the process is the same every time.
The approach that works well: start with the categories that matter most for your clients. Consent compliance. Dead code. Measurement accuracy. Define what "bad" looks like for each one (a tag set to NOT_NEEDED that should respect consent, a UA property ID that stopped processing two years ago, duplicate event tags for the same conversion). Write those as rules. Tell the AI to check each rule against the container JSON and return findings with the specific tags involved. Build it up over time. Each audit teaches you something new to check for, and the rule set grows.
The key is that the output needs to be structured enough for the AI to work from later. A paragraph of prose about consent issues is harder to work with than a list of six tag names with their current consent configuration and what it should be instead. The more specific the dataset, the more specific the copilot's responses when you start the interpretation phase.
TagManifest is how I've codified my own approach. It runs 85 rules across 10 categories and produces the structured dataset in about two minutes, with scoring, effort classification, and evidence assembled. Free, browser-based, nothing leaves your machine. It saves the 15 to 30 minutes of assembly and adds rules refined across dozens of containers. But the principle holds regardless of how you get the structured data.
The investment in structuring your data pays off every time you use it. The first audit takes longer because you're building the process. The fifth audit takes a fraction of the time because the structured approach is established and the AI knows exactly what to do with it.
When you sit down with an AI copilot to work through a GTM audit, the copilot needs something structured to work from. The container JSON is the raw material. The structured audit is where the real work starts.