We Built a Domain-Specific LLM for HR

Chih-Po Wen, CTO and Co-Founder, 7 min read

We Built a Domain-Specific LLM for HR

Discover why Wisq built HRLM, the first large language model for HR, and Hurdle, the industry’s first benchmark for HR-specific AI reasoning and compliance tasks.

Table of content

There’s a reason HR leaders often describe their work as both art and science. Every day, they’re making decisions grounded in empathy and governed by regulation: balancing people, policy, and the business.

At Wisq, we believe AI can help HR leaders do their best work faster, with greater confidence and precision. But general-purpose large language models (LLMs) aren’t cutting it, particularly for the types of cases Harper, our AI HR generalist, encounters. Harper requires knowledge of regulations, and she needs to have a very specific set of reasoning capabilities.

That’s why we built HRLM: a domain-specific reasoning language model trained to handle the complexities of HR operations. Our goal was to build an LLM just for HR that could increase accuracy and speed while simultaneously reducing cost.

And we didn’t stop there. We also created Hurdle, an industry-first benchmark designed to test and improve performance on real-world HR scenarios, especially ones that demand regulatory fluency and contextual judgment.

Why Now?

The timing was critical. Several shifts made this the right moment to build an HR-specific LLM:

The problem is complex: HR decisions often require interpreting overlapping federal and state regulations while keeping in mind policy and precedent. It’s easy to miss a critical nuance or misjudge applicability. Existing LLMs weren’t reliable in these high-stakes scenarios.
Generic models aren’t designed for HR: Off-the-shelf LLMs are powerful, but they’re trained mostly on math, code, and general web content. They’re not optimized for policy-heavy, compliance-sensitive domains like HR (and hallucinate more than can be tolerated).
Leading reasoning models are expensive and slow: Models like OpenAI’s o series of reasoning LLMs can deliver solid results, but the cost and latency make them impractical or even prohibitive for enterprise-scale HR operations.
Open source models have matured: The gap between open-source models and commercial ones has closed dramatically. With the right tuning, open source models can now deliver near parity at a fraction of the cost. Open-weight models are closing the gap with closed models, reducing the performance difference from 8% to just 1.7% on some benchmarks in a single year, according to a Stanford report.
We’ve developed a way to train and test efficiently: We used a process called distillation, a technique where a larger, more complex model, or the "teacher," transfers its knowledge to a smaller, more efficient model, or the "student.” A key component of our approach is a proprietary test-time compute algorithm, which optimizes how the model reasons in real time. Research shows this method can enable smaller models to outperform significantly larger ones in certain tasks. “On problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14× larger model,” wrote Google DeepMind researchers in a 2024 paper, Test-Time Compute Can Scale Small Language Models. Using Hurdle as our benchmark, we trained HRLM to reason like a top-tier model, without needing a massive amount of data collection and a huge infrastructure bill. As recent research has shown, “training on only 1,000 samples with next-token prediction and controlling thinking duration via a simple test-time technique we refer to as budget forcing leads to a strong reasoning model that scales in performance with more test-time compute." It’s a strong signal that we’re on the right track—by pairing smart fine-tuning with our custom test-time compute algorithm, we’re getting performance that stands up to much larger (and more expensive) models.

Introducing Hurdle: Our AI Benchmark for HR

Benchmarks are how we measure and improve model performance, but existing "reasoning" benchmarks focus mostly on math and coding. They also provide the foundation for verifiable tasks—problems with clearly correct answers—which are essential for training a reasoning model to perform well in a specific domain. But existing benchmarks didn’t reflect HR's real-world challenges. So we built Hurdle, a regulatory proficiency benchmark rooted in HR-specific tasks like leave eligibility, workplace accommodations, and policy interpretation.

Hurdle 1.0 focuses on HR issues related to federal and state regulations. Each question in Hurdle mirrors the format of certification exams like those from SHRM or HRCI, with scenarios that test the model’s ability to apply relevant laws and policies.

On Hurdle, HRLM outperforms many general-purpose models, delivering accuracy at parity with leading reasoning LLMs at 1/60th the cost.

We evaluated each major LLM using our benchmark. To generate trustworthy results, our benchmark test suite was designed to meet key criteria, including validity, complexity, and consistency. We found that HRLM matches the performance of models that are over 100 times larger, proving that with the right training data and test-time strategies, you don’t need massive infrastructure to achieve state-of-the-art results.

Accuracy Where It Counts

One natural question: how does our model compare in overall quality to premium models? While some models may have a slight edge in raw accuracy, we exceed them in overall performance—delivering faster, more cost-effective, and context-aware results where it matters most.

The answer: very little. In practice, the trade-off is worth it. While the top commercial models score within the margin of error of HRLM's performance, they are impractical due to their ~60-times-higher cost. For most HR workflows, HRLM delivers the precision needed without the price tag or performance lag.

We are constantly improving HRLM. Our internal testing shows HRLM is on par with the leading reasoning models from OpenAI, Anthropic, and others, and in many cases, better suited for HR-specific reasoning thanks to our tailored training data and test-time inference algorithms that manage how the model “thinks” efficiently and productively and delivers the right answer.

Built for HR, Built for the Future

Creating HRLM is about a commitment to the future of HR, one where AI agents are fast, accurate, and grounded. We know it matters to HR leaders that their technology understands the nuance and regulations they navigate every day. That’s why HRLM is domain-aware and deeply trustworthy, so HR leaders can rely on it for the moments that matter.

Our patent-pending approach combines fine-tuning with advanced control over how the model processes and outputs answers. It’s more than just training; it’s orchestration. And it’s what allows us to deliver intelligent, cost-effective results across high-volume, high-risk HR tasks.

This is the beginning of the Agentic Era in HR, and Wisq’s HRLM is at the center of it.

If this kind of work excites you, we are growing our team — check out our Platform Engineering role.

Sources:

The 2025 AI Index Report

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

s1: Simple test-time scaling

View all

Welcome to HR's Agentic Era: AI-First & Deeply Human. Behind Wisq's New $15M Fundraise

Wisq raises $15M to accelerate the Agentic Era of HR—bringing AI-powered speed, clarity, and humanity to HR teams with its purpose-built AI HR Generalist, Harper.