Domain-specific AI infrastructure / Harvey analysis

Harvey, the Legal AI company, just closed $300M at a $5B valuation - doubling from $3B four months ago. Yes, Harvey is the poster child for applied AI, but those numbers are almost incomprehensible for a 3 y/o app-layer startup.

After the Scale AI x Meta announcement last week, I’ve gone from rabbit-hole to rabbit-hole. I knew the impact of data-labeling for the foundation model companies, but frankly I had no concept of the scale nor the dependency these data-specialists create. Imagine holding OpenAI in a chokehold with 5 years of experience and no outside capital.

Through chain-of-thought reasoning (funny?) and open questions, recent developments led me to a full-scale Harvey analysis. Harvey is a legal AI platform, not an infrastructure company. However, Harvey’s success has revealed many adjacent opportunities in domain-specific applied AI.

Quick note - all observations are framed around my expertise as a domain-specific consultant.

The process data goldmine nobody's mining

The most valuable data for case-based AI doesn't exist yet - it's informal, poorly documented, and intuitive. It separates good teams from great ones; it justifies high bill rates and resume credibility; it exists primarily in the brains of true subject matter experts (SMEs).

Consider your last major engagement. How did it start vs. how did it end? What path did you take to reach your final observations? What analysis did you perform that you didn’t anticipate? I'm as critical of consultants as anyone - there’s something satisfying about scrutinizing the unreliability of ‘actionable insights’. Yet, consultancies keep growing and consultants keep getting hired. Why? Just watch a McKinsey onboarding video. Consultants ask the right questions.

There's a massive gap between written procedures and how experienced consultants and SMEs operate within engagements. Judgment calls, unexpected data sources, ability to dive deep or take a step back - this is what makes a successful project. While companies like Scale AI and Surge AI continue to make the headlines, they will never be able to provide what’s most important: domain-specific process data.

Evaluation infrastructure as a competitive moat

Academic benchmarks are useless for real project work. Harvey found this out the hard way and had to create their own eval system, “Big Law Bench”. Turns out it paid off big-time - Harvey’s evals don’t just produce higher-quality results, they do it at 10x the speed.

In financial crimes, current AI vendors optimize for generic fraud detection metrics while missing investigation-specific quality measures. 95%+ false positives overwhelm compliance teams because evaluation frameworks don't match real investigation complexity. Everything relies on post-analysis metrics when multi-step workflows - evidence gathering → framing → analysis → reporting → recycle - need component-level evaluation. This will only get more complex as the industry shifts to multi-agent frameworks and ‘human-in-the-loop’ evaluation systems.

Most teams think they need better AI. What they actually need is better ways to measure whether their AI is working. General eval solutions won’t cut it for industries that heavily rely on SMEs. While competitors build better models and more advanced frameworks, domain-specific evaluation infrastructure becomes the sustainable differentiator.

Operational visibility is the missing layer

No black boxes - operational visibility will separate useful from useless:

(1) Project teams need real-time visibility into workflows, decision paths, and model behavior.

(2) Project decisions need documented reasoning chains for legal defensibility.

(3) Audit trails must be purpose-built for both native AI systems and humans.

Real-time visibility changes how teams operate. Quality control is a substantial part of any case-based work, and agent-to-agent communication only accentuates this. Applied AI platform users need to understand the decisions being made, the prompts leveraged, the assumptions relied upon, and the code executed. All steps need to be easily located, reproducible, and adjustable. The fact is: LLM’s are probabilistic models - they cascade by nature and need to be continuously monitored.

Legal defensibility is self-explanatory.

Purpose-built audit infrastructure handles AI complexity. Most applied AI platforms have some version of this. Harvey set aside 10% of their organization to focus on the security/compliance side of it alone. However, multi-step investigations, for example, need audit trails that capture both deterministic and probabilistic decision elements. Generic application logs simply fail to capture these nuances. If a probabilistic model makes a decision for a deterministic process, how do we verify its accuracy? Not only do audit trails need to support ‘human-in-the-loop’ evaluation and decisions, but they also need to be AI-native for agent-to-agent communication and adjustments.

Don’t replace every system

The value isn't in replacing every system - it's in making AI seamlessly integrate with existing workflows. Harvey took this approach early on, but now clearly wants to become the one-stop shop for case management, document storage, document review, analysis, workflows, reporting, and more. I think this is the wrong approach - while Harvey may prove to have enough momentum to pull this off, I don’t think this is a sustainable model.

Legacy industries love what they know, and they don’t change quickly. Hell, even FORTRAN is still used (don’t come at me, I know the performance arguments). Convincing eDiscovery professionals to stop using Relativity or technical analysts to stop using their favorite SQL server is a fool’s errand.

Compound value

Process data capture feeds evaluation systems, which generate audit trails. Sounds simple enough.

AI adoption accelerates faster than infrastructure development. Regulatory pressure and quality obsessiveness increase the need for transparent systems. Case-based teams have the budget for risk-reduction infrastructure rather than pure efficiency tools.

While current vendors compete on AI capabilities, domain-specific AI infrastructure will separate the winners and losers. Everyone is looking for a competitive advantage - consultancies need to leverage internal tooling and proprietary solutions to justify their rates. General observability and eval platforms exist, but without domain-specificity, they lack the context to articulate any useful metrics or catch any detrimental decisions.

A friend asked me a question the other day that stuck with me: assuming most consultancies have internal AI dev teams that focus on industry-specific solutions, what can you build that can improve their solutions?

While everyone builds better applied AI, the real opportunity is building the domain-specific infrastructure that makes AI tools actually valuable.