Why Foundation Models Need ASL Training Data
Every major foundation model lab is working on multimodal AI. Text, vision, audio — these modalities are increasingly well-covered. But there is a gap that none of them have addressed, and it is not a small one.
Sign language.
Not as an edge case. Not as an accessibility add-on. As a primary human language modality used by at least 70 million people worldwide — and one with structural properties that no existing training dataset captures correctly.
This is not a niche problem. It is the next fundamental gap in multimodal AI.
What "multimodal" actually means
When researchers talk about multimodal foundation models, they typically mean models that can handle text and images — or text, images, and audio. GPT-4o, Gemini, Claude. These are genuinely impressive achievements.
But "multimodal" means more than that. It means covering the full range of human communication. And the full range of human communication includes languages that are not spoken or written. It includes sign languages — visual-spatial languages with their own grammar, morphology, and phonology, expressed through movement in three-dimensional space.
A model that cannot process sign language is not fully multimodal. It has covered the modalities that were easy to collect data for. Sign language was harder. So it was left for later.
Later is now.
Why sign language is not a vision problem
The first instinct of most AI teams encountering sign language is to treat it as a video understanding problem. The model already handles video — why not just fine-tune it on signing footage?
This intuition is wrong, and it fails in production every time it is tried.
Sign language is not gesture. It is not motion. It is not a visual encoding of spoken language. It is a complete, independent linguistic system with its own grammar that operates in a fundamentally different way from any language these models were trained on.
Consider what a single sign actually encodes:
Handshape is the equivalent of a phoneme — the base unit of form. ASL has 19 distinct handshapes, and substituting one for another changes the meaning of the sign completely, the same way changing a consonant changes a spoken word.
Location relative to the body carries grammatical information. The same handshape and movement in front of the chest versus at the temple are different signs with different meanings.
Movement encodes both lexical content and grammatical structure. The path, direction, and quality of movement are all linguistically significant. Speed and repetition change aspect — whether something happened once, repeatedly, or is ongoing.
Non-manual markers — facial expressions, eyebrow position, mouth shape, head tilt, gaze direction — are not emotional tone on top of the message. They are grammatical markers. A raised eyebrow is a question marker in ASL. Puffed cheeks indicate size or effort. These are obligatory parts of the grammar. Without them, the utterance is incomplete or ungrammatical.
Spatial grammar means that signers establish referents in the space around them and then use that space to encode grammatical relationships between those referents — who did what to whom, temporal relationships, conditionals. This spatial syntax has no equivalent in text or spoken language.
A video model trained on general video data has learned none of this. It has learned what movements look like. It has not learned what movements mean. The gap between those two things is the entire field of sign language linguistics.
Why existing datasets do not work
The datasets that exist for sign language AI are, with almost no exceptions, inadequate for foundation model training. This is not a controversial claim among researchers in the field — it is the consensus.
Scale is insufficient. The largest publicly available sign language datasets contain tens of thousands of clips. ImageNet, for comparison, has 14 million images. Common Crawl contains petabytes of text. The data infrastructure for sign language does not exist at the scale that makes foundation model training viable.
Annotation quality is inconsistent. Most existing datasets were labeled by hearing researchers using English glosses — written approximations of signs that flatten the spatial grammar, drop non-manual markers, and miss regional variation. Models trained on these annotations learn a degraded version of the language. They perform in controlled demos. They collapse in production.
Signer diversity is too narrow. Models need data from signers across age, region, dialect, and skill level. Most existing datasets were collected in university labs with a small number of signers. The variation in natural, community signing is not represented.
Linguistic depth is missing. Raw video annotated with English translations is not a sign language dataset in any useful sense. It is a translation dataset. For a foundation model to understand sign language — not just translate it — it needs access to the phonological, morphological, and syntactic structure of the language. That requires annotation by trained sign language linguists.
What ASL training data actually requires
Building ASL training data that can support foundation model training is not a video collection problem. It is a linguistics infrastructure problem.
The data needs to capture all five parameters of sign formation simultaneously. It needs non-manual markers annotated as grammatical features, not discarded as noise. It needs spatial grammar documented in a way that a model can learn the productive rules of the system, not just pattern-match on examples it has seen before.
It needs provenance. Every clip needs to know who signed it, in what dialect, in what register, in what context. Signers need to have consented to their data being used for AI training. The Deaf community, which is the source of this linguistic wealth, needs to be a named stakeholder in how that data is used — not a pool of unlabeled subjects.
It needs versioning. Sign language evolves. New signs emerge. Regional variation shifts. A dataset built once and never updated is a snapshot of a living language at a single moment. Foundation models need data infrastructure that can grow with the language it represents.
And it needs to be built by people who know the language. This is the constraint that is hardest to import. You cannot hire hearing annotators, give them a labeling interface, and produce linguistically valid sign language data. Every annotation decision requires knowledge of the grammar. That knowledge lives in the Deaf community.
Where this leaves foundation model labs
The labs building the next generation of multimodal AI are facing a choice. They can treat sign language as a problem for later — and continue producing models that are not genuinely multimodal. Or they can address the data gap now, before their competitors do, and establish a position in a modality that 70 million people rely on as their primary language.
The data infrastructure does not exist yet at the scale needed. That is the constraint. Not the models. Not the compute. Not the research interest. The data.
CLERC is building that infrastructure — native ASL video corpus, expert-annotated to linguistic standards, with provenance tracking, consent compliance, and versioned releases designed for foundation model training. The architecture is built to scale to every sign language, starting with ASL.
If you are working on multimodal AI and you do not have a sign language strategy, the gap in your model is known, it is measurable, and it is closing — with or without you.