Agentic Humanitarian Data Analyst — a small model that writes the plan before it runs the numbers

Community Article
Published June 14, 2026

Drop in a humanitarian survey, ask a question, and get back a reviewable data-analysis plan — which indicators this dataset can measure, which it can only proxy, and which it can't support at all — every verdict traced to a published standard, not the model's memory.

Live Space: build-small-hackathon/agentic-humanitarian-data-analyst The open-source skill: yannsay/humanitarian-data-analyst Demo: youtu.be/q2qjPJakLGk

The problem

Humanitarian analysis is a specialist domain. Ask for "food security" and you've invoked a specific, named set of indicators — the Food Consumption Score (FCS), the reduced Coping Strategies Index (rCSI), the Household Hunger Scale (HHS) — each with an exact definition, a required set of survey questions, and a documented list of ways people get them wrong. The expertise isn't vague; it's a catalog.

Hand that work to a general LLM and a standard indicator gets computed from questions that don't actually support it — producing a result that looks plausible but isn't an indicator the sector recognises, or one that doesn't exist. Not hypothetical: the test case for this project is a real rapid needs assessment that shipped four documented indicator errors — an rCSI from the wrong columns, a misread JMP water ladder, a misapplied Sphere threshold, and an FCS reported even though the survey had no dietary-recall question to build it from.

And every survey is built differently, so the same indicator maps to different questions on every new form. There's no fixed lookup — the mapping has to be redone each time. That's exactly the messy, judgement-heavy work you'd want an LLM for.

From skill to small model

This hackathon entry comes out of a longer project: a skill that brings two ideas from software engineering — a semantic layer (a governed catalog of indicator definitions) and spec-driven development (write a reviewable plan, get sign-off, then execute) — to humanitarian data analysis.

The conclusion that kept surfacing while building it: the errors aren't knowledge gaps, they're attention failures. A capable model already knows the methodology — but handing it the catalog as passive context changes nothing. It only works when the prompt forces the consult: read the definition and its known errors before writing a verdict.

That finding sets up the question this hackathon entry tests:

If the methodology is supplied in the prompt rather than recalled from weights, how small can the model be?

Once the definition is handed to the model, the task shrinks to a single per-indicator verdict — and that's a job a small model does reliably.

How it works

analyst question + Kobo survey
        │
   ROUTE   question → sectors        (LLM: translate fuzzy input)
        │
   SELECT  sectors → indicators      (deterministic script — same in, same out)
        │
   MAP     indicator → survey qs     (LLM: propose candidate variables, one at a time)
        │              → verdict: Measurable / Proxy / Not measurable
        │
   PLAN    code assembles the data-analysis plan = the spec
        │              ← HARD STOP: human analyst reviews & approves
        │
   ANALYSE  agent runs the analysis against the approved plan   (next; not in this demo)

The LLM runs at exactly two points — translating the question into sectors (ROUTE) and proposing which survey questions measure each indicator (MAP). Everything else is deterministic code over a governed catalog: selecting indicators, assembling the plan, rendering it. The model never decides which tool to run; the orchestrator does.

We stop the demo at PLAN — the reviewable data-analysis plan a human signs off on. The actual agent analysis (running numbers against the approved plan) comes next; it's out of scope here.

The semantic layer

Three governed layers, vendored from the skill repo:

  • Framework — an analytical ontology of 11 humanitarian sectors (derived from HumSet/DEEP). ROUTE resolves the question into it.
  • Indicators — ~41 indicators across WASH, Food Security, and CCCM, from authoritative sources (the JMP, the Global Food Security Cluster handbook, Sphere, CCCM standards). Each carries its definition, thresholds, common implementation errors, and what a key-informant survey can and can't assess.
  • Binding — the live MAP from this survey's questions to those indicators, with coverage gaps surfaced.

Every verdict in the plan points back to a published standard — the line between something an analyst can defend in a report and something they can't.

The model

Inference runs on Qwen/Qwen2.5-32B-Instruct — at the 32B cap — behind an OpenAI-compatible endpoint, served on Modal (vLLM, A100-80GB). Because the per-indicator task is so narrow — one definition, its candidate variables, one verdict — the prompt is small and the model can be too; that's the whole bet of the "small question" above.

Try it

⏳ Heads up: inference runs on Modal's on-demand GPUs, so the first run after the Space has been idle takes ~5 minutes to spin up. Runs after that are fast.


Built for the Hugging Face × Gradio Build Small hackathon — Backyard AI track. Part of a longer project on bringing a semantic layer and spec-driven development to AI analysis.

Community

Sign up or log in to comment

Free AI Image Generator No sign-up. Instant results. Open Now