Evaluation framework for B2B agencies
Good evaluation frameworks are pragmatic checklists, not vendor bingo cards. You want to know what they can actually do for you, fast.
Core capability checklist
Don’t accept vague claims. For each capability ask for a deliverable and the named person who will deliver it.
- Strategy: documented 90-day plan with hypotheses and KPIs.
- Demand generation: repeatable playbook for at least two channels.
- Content: buyer-stage mapping, sample pieces, distribution plan.
- Paid media: channel mix, audience lists, bid logic.
- SEO: prioritized technical backlog and content calendar.
- Martech: connectors, data flows, and ownership.
- Analytics: attribution model, raw event schema, dashboard.
Score each capability 1-5 and demand a concrete sample or artifact. If an agency can’t show a deliverable, they don’t have the capability.
Proofs of performance
Numbers without context are useless.
Ask for:
- Raw dashboards or exports showing baseline, test, and outcome.
- What the starting conversion rates were and how they shifted.
- CAC and lead quality at multiple funnel stages, not headline MQL counts.
- A breakdown of which tactics drove which results.
Red flags: anonymized case studies with no timelines, no conversions shown, or clients who refuse to let you speak to the actual campaign owner. Quick test: ask for the SQL conversion rate for a named campaign. If the answer is “it varies”, press for the export.
Tech stack and integrations
Most failures happen at the seams.
Require:
- An architecture diagram for your integration points: CRM, MA, CDP, analytics, website.
- A list of maintained connectors and whether they are first party or custom.
- How they handle identity resolution and event fidelity.
- Data residency and transfer flows, especially for regulated buyers.
Ask whether tracking is server-side or browser-side, how they version tracking, and whether they have a backfill plan for lost events. If they say “we just use the platform default”, expect surprises.
Team structure and roles
Headcount matters more than brand.
Typical composition for a mid-size account:
- 1 engagement lead (owns outcomes)
- 1 channel lead per major channel
- 1 data/analytics lead
- 1 content producer
- Design resources (shared or dedicated)
Insist on named roles and percent allocation. Part-time “shared” leads are fine for small pilots. For scaling, insist on dedicated channel owners with shadow coverage. If their pitch includes a rotating pool of juniors, push back.
Risk and compliance indicators
Compliance is operational, not legal theater.
Ask for:
- A signed Data Processing Agreement.
- Details on subcontractors and where work is done.
- Security posture: SOC or equivalent, how credentials are stored.
- Incident response commitments and notification timelines.
Immediate deal-breakers: no DPA, inability to name subprocessors, or refusal to support audits. Include a right-to-audit clause and a clear deletion certificate on termination.
Choosing by company stage
Different stages need different priorities. Stop treating agencies like a one-size-fits-all tool.
Early-stage priorities
You need speed and learning, not scale.
Priorities:
- Cheap, fast tests with clear hypothesis and kill criteria.
- Channel mixes that fit tight budgets.
- Deliverables that directly inform product-market fit: messaging tests, one-page landing validations, demo-to-meeting conversion tracking.
Contracting tip: short, timeboxed pilots with daily access to the team. Expect high iteration and messy creative. That’s OK. You want learnings, not a polished brochure.
Scale-up priorities
Now you need predictability and handoff to internal ops.
Focus on:
- Repeatable campaigns that can be automated.
- Clean tracking and attribution so revenue ops can own numbers.
- Playbooks and training to transfer knowledge.
Structure: hybrid retainers where a portion pays for delivery and a portion is earmarked for transition and training. Your metric split should include immediate pipeline and a metric that shows internal enablement, such as number of playbooks transferred.
Enterprise priorities
Procurement and security will drive 60 percent of the decision.
Expect:
- Detailed SLAs and legal negotiation.
- Integration with internal platforms and enterprise SSO.
- Multi-stakeholder governance and audit trails.
Don’t let procurement trade capability for a lower price. If an agency can’t operate inside your security requirements, it doesn’t matter how good their creative is.
RFP, pilots, and shortlisting
RFPs can kill good vendors. Make yours useful.
Must-ask RFP items
Skip marketing fluff. Ask for:
- Explicit deliverables and timelines.
- Named team and CVs.
- Pricing broken down by deliverable and by hour.
- Three client references that match your company stage.
- Security and data handling documentation.
- Example of a failed test and what was learned.
If a vendor refuses to show a failed test, they are selling a highlight reel.
Pilot scope and success metrics
A pilot should be a scientific test, not an extended trial.
Design it with:
- Hypotheses and stop/go criteria.
- Fixed timebox, typically 60 to 90 days.
- Minimum viable integrations completed in week one.
- Clear success thresholds: minimum lead volume, SQL rate, or cost per SQL.
Example pilot: run two channel tests for 90 days. Success = at least 30 qualified leads with SQL conversion above your historical baseline by X percent, plus handover docs and dashboard.
Scoring matrix template
Simple, transparent, repeatable.
| Criterion | Weight |
|---|---:|
| Capability fit | 25 |
| Proofs of performance | 20 |
| Technical fit | 15 |
| Cultural / communication | 10 |
| Price | 15 |
| Risk & compliance | 15 |
Score each 1-5. Passing threshold: weighted score > 3.6. Use this to short-list to two finalists.
Reference and work-sample checks
Go beyond "Would you hire them again?"
Ask references:
- Who on the agency did the work and what percent of time did they allocate?
- Did results match the dashboard exports?
- Why did the engagement end or continue?
- How was handover executed?
Ask for raw work samples: full analytics exports, unredacted creative tests, and the original brief. If they only give polished PDFs, push for the raw files.
Contracting, pricing, SLAs (Sweden)
Local practice matters. Don’t assume global templates fit.
Common pricing models
You will see:
- Monthly retainer for ongoing management.
- Project fees for discrete work.
- Performance fees tied to leads or revenue.
- Hybrid models that mix a base retainer and a bonus.
In Sweden, expect VAT on top of fees. Also expect procurement processes that prioritize clarity over flexibility. Put hard caps on monthly spend and clear change-order rules to avoid budget surprises.
Budget benchmark ranges
Expect wide variance, but rough ranges:
- Early-stage pilots: 10,000 to 50,000 EUR total.
- Small monthly retainer: 3,000 to 8,000 EUR per month.
- Scale-up retainers: 10,000 to 30,000 EUR per month.
- Enterprise engagements: 30,000 EUR per month and up, plus project fees.
These are directional. Always tie budgets to expected outputs, not hours.
Service-level and reporting clauses
What to put in the SOW:
- Data tracking uptime, for example 99.5 percent for key events.
- Lead delivery and notification times, e.g., leads posted to CRM within 30 minutes of capture.
- Reporting cadence and raw data access rights.
- Credits or termination rights for repeated SLA failures.
Don't rely on “we’ll report weekly”. Insist on the exact fields, formats, and access method for data exports.
IP, data, and termination terms
Negotiate:
- Ownership of creative: usually assign you full rights on paid creative and content after payment.
- Data portability: provide exports of all lead and performance data in machine readable form on termination.
- Transition support: minimum 30 days of support with named personnel, and a handover schedule.
- Reuse of learnings: agency may keep anonymized learnings, but not your proprietary playbooks.
Put code and scripts in escrow or require documentation for any bespoke integrations.
Onboarding and performance governance
Bad onboarding burns months of runway. Make the first 90 days count.
30/60/90-day launch plan
30 days:
- Kick-off, stakeholder map, and access list.
- Tracking map, basic dashboards, and first creative drafts.
- Two live micro-tests.
60 days:
- Scale successful micro-tests.
- Automated reporting and weekly cadence established.
- Playbook drafts for repeatable campaigns.
90 days:
- Handover of playbooks, training sessions, and performance review against hypothesis.
- Decision: scale, re-scope, or stop.
Make the plan a living checklist and assign owners.
Reporting cadence and dashboards
Match cadence to decision speed.
- Weekly: tactical metrics and anomalies; raw lead exports.
- Monthly: performance versus pipeline targets; deep-dive on spend efficiency.
- Quarterly: strategy review, budget adjustments, channel roadmap.
Dashboards must include raw data access. If you can’t run SQL against the dataset yourself, you don’t own the truth.
Optimization and testing rhythm
Testing is how you improve, not an appendix.
- Weekly micro-tests: subject lines, landing variations, audiences.
- Monthly experiments: creative concepts, funnel flows.
- Quarterly bets: new channels or major landing redesigns.
Define stop rules. For example: if a test spends X and conversion is below Y for Z days, stop and reallocate.
Stakeholder roles and escalation
Clear governance avoids meetings that go nowhere.
- Tactical owner: agency day-to-day contact.
- Business owner: internal sponsor accountable for results.
- Data steward: owns integrations and schema.
- Executive sponsor: resolves budget or strategic conflicts.
Escalation ladder: tactical lead > engagement director > your sponsor > contract executive. Embed explicit SLA breach triggers for escalation, including response windows and corrective actions.