Capability diagnostics for time-series AI

Find the gaps in time-series AI. Then close them.

Time-series language models are landing in hospitals, wearables, and factories — and the teams shipping them have no way to know what they actually do. AHRI maps the gaps and retrains them closed, so the models you ship can be trusted with the work.

See the approach

Amplitude comparison/ Event presence/ 2-feature conjunction/ Event localization/ Inter-event regularity/ Trend/ Temporal lag/ Spectral decomposition/ Frequency estimation/ Envelope classification/ Periodicity/ Frequency comparison/ Change-point/ Frequency change direction/ 3-feature conjunction/ Monotonicity/ Peak counting/ Frequency change measurement/ Event-count comparison/ Segment labeling/ Anomaly identification/ Frequency band/ Amplitude comparison/ Event presence/ 2-feature conjunction/ Event localization/ Inter-event regularity/ Trend/ Temporal lag/ Spectral decomposition/ Frequency estimation/ Envelope classification/ Periodicity/ Frequency comparison/ Change-point/ Frequency change direction/ 3-feature conjunction/ Monotonicity/ Peak counting/ Frequency change measurement/ Event-count comparison/ Segment labeling/ Anomaly identification/ Frequency band/

01 Approach

Deployed faster than they’re understood.

Time-series language models are already reading ECGs, scoring sleep, recognising activity from wearables — and the applied results are genuinely good. But a strong score on a handful of real-world tasks doesn’t tell you what a model actually learned. AHRI breaks signal understanding down into its component skills and tests each one on its own, so you can see where a model is sharp and where it’s just guessing.

1.1 · Trend Direction

# One signal. One question. One skill.

[SIGNAL TOKENS]
Question: Does this signal have an upward trend, a downward trend, or no trend?
Answer:

# No statistics in the text.
# The answer must come from the signal.

One skill at a time.

A real ECG question pulls on signal reading, medical knowledge, and a bit of guesswork all at once. AHRI’s tasks isolate one skill apiece — so a score measures exactly that, and nothing else.

Test

Run every model through 22 controlled capability probes — each one isolating a single skill.

Diagnose

Get a per-task score map: which skills are solid, which are fragile, which never showed up.

Reinforce

The tasks a model trips on are exactly the ones worth drilling — comparing amplitudes, say, or spotting when several features overlap at once. AHRI stacks those weak skills, in ascending difficulty, into a focused training regime: the Ascending Harmonic Reasoning Instruction the project is named for.

Compare

Score architectures and pretraining strategies head-to-head against the same fixed yardstick.

02 What We Test

Five levels of skill, ascending.

Each level is strictly harder than the one below it — from noticing something is there, up through measuring it, comparing it, tracking how it changes, and finally combining everything at once. The harder levels are where the gaps tend to hide.

Sample · 2 of 22 Task 1.2 · Frequency band

79.7% in-distribution · 65.8% out-of-distribution over ten epochs on a small Time Series Language Model. Every task in the framework gets one of these.

Detection

1.1Trend 1.2Frequency band 1.3Periodicity 1.4Event presence 1.5Monotonicity

Measurement

2.1Frequency estimation 2.2Peak counting 2.3Event localization 2.4Change-point

Comparison

3.1Frequency comparison 3.2Amplitude comparison 3.3Event-count comparison 3.4Temporal lag

Change over time

4.1Frequency change direction 4.2Frequency change measurement 4.3Envelope classification 4.4Segment labeling 4.5Inter-event regularity

Putting it together

5.12-feature conjunction 5.23-feature conjunction 5.3Anomaly identification 5.4Spectral decomposition

Ascending Harmonic Reasoning Instruction

TSLM

Time Series Language Model

With AHRI

↑ stronger

TSLM

Time Series Language Model

Without AHRI

baseline

03 Research

What these models learn — and which design learns best.

Three questions drive the work: which skills a model genuinely learns rather than fakes, when those skills appear during training, and — with the tests held fixed as a yardstick — which architecture learns them best.

Which skills are real?

A model can answer an ECG question by reading the signal — or by pattern-matching the words around it. AHRI’s controlled tasks tell the two apart, so a skill only counts when the model actually earned it.

When do they appear?

Skills don’t fade in gradually. They tend to snap on partway through training. Knowing when — and in what order — is what makes training them deliberately possible.

Which design learns best?

With the tests held fixed as a yardstick, the next step swaps the model itself — comparing time-series encoder architectures to find the one that learns fastest and generalises furthest.

Forthcoming

The Generalization Landscape of Time-Series Language Models

2026

Forthcoming

The Architecture–Objective Landscape of Time-Series Language Models

2026

The capability layer for time-series AI.

Time-series language models are landing in hospitals, wearables, and factory floors faster than anyone knows what they actually do. AHRI tests them on every capability, surfaces the gaps, and retrains until those gaps close. That’s the path from impressive demo to production-ready.

Built for teams shipping TSLMs in clinical, wearable, and industrial settings.

Schedule a demo

Or by email — tony@ahriai.com