Container Size and Health: The Entropy Curve Across 2,000 GTM Containers

The relationship between container size and health score is the most consistent pattern in our State of GTM dataset. Every additional 50 tags costs roughly 3 health points. The degradation is linear. There's no cliff, no magic number where containers break. Entropy accumulates gradually.

But the story underneath that line is more interesting than the line itself. Errors don't scale with size. Optimization findings do. The largest containers in the dataset aren't more dangerous than small ones. They're messier.

Score by container size

Size	Containers	Mean	Median	P10	P90
1-10	301	90.1	91	82	96
11-25	376	80.4	80	73	91
26-50	455	75.9	76	68	84
51-100	489	71.8	73	63	80
101-200	283	69.7	71	60	78
200+	86	67.3	69	60	75

The P90 column shows what "good" looks like at each size. A container with 200+ tags at P90 (75) scores lower than a container with 26-50 tags at P25 (72). Size imposes a ceiling that even well-maintained containers can't fully escape.

The range narrows as containers grow. A 1-10 tag container spans 14 points between P10 and P90 (82 to 96). A 200+ tag container spans 15 points (60 to 75), but the entire range sits lower. The ceiling compresses: the best large container scores roughly the same as the worst small container.

Grade distribution by size

Size	A	B	C	D/F
1-10	58%	41%	1%	0%
11-25	11%	73%	16%	0%
26-50	2%	59%	37%	2%
51-100	1%	34%	59%	6%
101+	0%	22%	68%	9%

58% of containers with 1-10 tags score an A. Among containers with 101+ tags, zero percent achieve an A, and 68% are C grade.

This is why 93% of all A grades in the dataset come from containers with fewer than 25 tags. The A grade reflects simplicity, not quality. A 7-tag container with a Google Tag config and two conversion tags has almost nothing to misconfigure. It earns an A by having little surface area, not by being well-maintained.

We label containers with fewer than 10 active tags as "Emerging" rather than assigning a letter grade. With so few configuration decisions to evaluate, a score describes simplicity, not health.

What drives the degradation

The finding breakdown by container size reveals the mechanism:

Size	Mean errors	Mean warnings	Mean optimizations	Mean total
1-25	0.0	1.4	3.0	6.0
26-50	0.1	1.9	6.8	11.8
51-100	0.2	2.1	9.6	15.8
101+	0.2	2.1	12.8	20.1

Errors don't scale with size. Mean errors are 0.0 to 0.2 across all size brackets. A 200-tag container doesn't have more PII exposure, more dead code executing, or more security patterns than a 25-tag container. The genuinely dangerous findings (dataLayer PII pushes, eval usage, HTTP script loading) occur at roughly the same rate regardless of container complexity.

Warnings barely move. 1.4 to 2.1 across the range. Consent infrastructure gaps and missing configurations are present-or-absent findings that don't multiply with tag count.

Optimizations drive the difference. 3.0 for small containers, 12.8 for large. More ad platforms means more configuration to review. More Custom HTML means more vendor scripts to evaluate. More triggers means more potential for overloaded or duplicated firing logic. More GA4 events means more naming inconsistencies and Enhanced Measurement overlap.

The practical implication: a large container scoring 69 (C) probably has the same error profile as a small container scoring 91 (A). The 22-point gap is almost entirely optimization findings, not danger. A 200-tag container at the C/B boundary isn't unsafe. It has configuration debt.

Which rules scale most with container size

The rules that fire dramatically more often in large containers reveal what complexity produces:

Rule	Small (1-25)	Large (101+)	Gap
trigger-overloaded	8%	83%	+75
ad-tag-proliferation	1%	68%	+66
trigger-duplicate-custom-events	8%	69%	+61
ad-id-hardcoded	15%	75%	+60
stale-paused-tags	35%	92%	+57
replaceable-custom-html	23%	74%	+51
ga4-duplicate-event-names	4%	56%	+51
ga4-event-name-casing	23%	72%	+49

trigger-overloaded goes from 8% to 83%. That's 10x. But a trigger firing 10+ tags isn't a misconfiguration in a container with 4 ad platforms. One trigger firing a GA4 event, a Google Ads conversion, a Meta Pixel event, and a LinkedIn tag is standard multi-platform architecture. The finding is real (the trigger is complex) but the interpretation depends on context.

ga4-event-name-casing at 72% in large containers reflects what happens when multiple people contribute to a container over time. One person uses camelCase, another uses snake_case, a third uses PascalCase. Each naming choice was reasonable in isolation. The inconsistency accumulates without anyone noticing because each event works regardless of casing.

These are complexity findings. They emerge as containers grow, not because anyone made a mistake.

What the largest containers look like

86 containers have 200+ tags. Notable examples:

Container	Tags	Score	Paused	Custom HTML
Squarespace	508	70 (C)	205	109
Slack	484	63 (C)	301	65
ADP	315	76 (B)	85	109
Klaviyo	240	79 (B)	29	71
Deloitte	222	77 (B)	19	1
Snowflake	208	75 (B)	2	148
HubSpot	205	35 (F)	4	68

Slack has 301 paused tags out of 484 total. 62% of the container is disabled-but-present weight. The active container is closer to 183 tags.

Deloitte (222 tags, 1 Custom HTML tag, score 77) shows what a well-maintained large container looks like: almost everything runs through native tag types with minimal Custom HTML. It reaches B grade despite its size.

Snowflake (208 tags, 148 Custom HTML, score 75) takes the opposite approach: heavy Custom HTML but still maintaining a B. This is possible with active consent management and consistent naming across the Custom HTML tags.

HubSpot's own container scores 35/F, the lowest of any large container in the dataset. The marketing automation platform that powers a significant portion of B2B web has one of the lowest-scoring GTM containers in the sample.

The "Emerging" containers

279 containers (14% of the dataset) have fewer than 10 tags. These average 4.3 tags and score 90.4.

Metric	Value
Count	279
Mean tags	4.3
Mean score	90.4
Has GA4	66%
Has ad platforms	34%
Has Custom HTML	47%

66% have GA4 but only 34% have ad platforms. These are likely early-stage implementations, single-purpose containers, or containers where the primary measurement happens through other channels (direct gtag.js, Segment, server-side). They're too small to grade meaningfully. A score of 91 on a container with 3 tags says "nothing is wrong with your 3 tags," which is true but not informative.

The entropy curve

The linear relationship has no cliff. No threshold where containers degrade suddenly. The curve is smooth:

1-10 tags: Median 91. Simple enough that almost nothing goes wrong.
26-50 tags: Median 76. The typical B2B SaaS container. Enough complexity for patterns to emerge, not enough to overwhelm.
101-200 tags: Median 71. Enterprise-scale containers. The floor is 60 (P10), the ceiling is 78 (P90). An 18-point range, all within C/B territory.
200+ tags: Median 69. The ceiling compresses to 75 (P90). Even well-maintained containers of this size can't escape the optimization finding accumulation.

A 200-tag container scoring 67 may be better maintained than a 10-tag container scoring 91. The smaller container has less surface area for patterns to emerge. The larger container has more things configured, more platforms integrated, more measurement running. Its lower score reflects the natural entropy of a complex system, not neglect.

The useful question isn't "what's my score?" but "given my container's size and complexity, where do I sit relative to similar containers?" A 200-tag container at the 75th percentile (score 71) is outperforming most containers its size. The same score on a 25-tag container (75th percentile: 84) would indicate significant issues.

Companion to the State of GTM in B2B SaaS. 1,990 containers analyzed. April 2026. Scoring: TagManifest health metric (not an industry standard).