From Raw Video to Labeled Signs

Raw footage in, production-ready sign language data out.

Sign language has never had a clean dataset, not in the way computer vision has ImageNet or NLP has Common Crawl. What exists today are fragmented corpora, small and inconsistent, built by hearing teams who label what they see rather than what is actually being said.

Raw video of someone signing is not data. It is footage, and the gap between the two is exactly where every Sign Language AI model breaks down. CLERC exists to close that gap.

The problem with raw

A video of a Deaf signer contains a dense, three-dimensional linguistic signal where handshape, movement, location, facial grammar, and spatial reference all happen simultaneously at conversational speed.

Most existing datasets flatten all of this into a single English word per clip and call it done, which is not structure but a rough approximation at best. Models trained on these datasets perform well in controlled demos and collapse in production because they cannot handle signer variation, register shifts, or open vocabulary. The architecture is rarely the problem. The training data almost always is.

What CLERC does differently

We built a pipeline that transforms raw signing footage into structured, linguistically grounded data in three stages, with no shortcuts along the way.

Stage 01 / Source. Every video is recorded with native Deaf signers at their natural signing pace, without scripts adapted from English sentence structure and without hearing actors. The source material is authentic because the people producing it are native signers of the language.

Stage 02 / Structure. We extract the motion signal frame by frame across body, hands, and face, capturing and timestamping every micro-movement. This is not pose estimation for its own sake but the raw spatial grammar of the language, preserved at full resolution.

Stage 03 / Meaning. Each sign is mapped to its gloss, its linguistic label, in context. The question is not just "what handshape is this" but "what does this sign mean in this sentence, signed by this person, at this moment." Temporal boundaries are marked at the millisecond, and the data carries meaning rather than just motion.

The output is production-ready sign language data where every sign has been captured, segmented, and mapped to context.

What this looks like in practice

Take a simple sentence: "Are you tired?"

In most datasets, this would be a single video file tagged with an English translation, with no internal structure, no way to isolate individual signs, and no temporal information whatsoever.

In the CLERC pipeline, the same sentence becomes two distinct segments: YOU from 0.10 to 0.30s and TIRED starting at 0.60s. Each gloss carries its own temporal boundary, and each segment can be replayed, analyzed, and trained on in isolation. That level of granularity is what separates footage from data, and it does not exist in most sign language datasets today.

Why this matters

If you are training a sign language recognition model, your performance ceiling is determined by the quality of your data rather than the sophistication of your architecture. A transformer trained on poorly segmented and inconsistently glossed video will plateau early and fail on every edge case, and every researcher in the field already knows this even though very few have the infrastructure to actually fix it.

If you are building a product that relies on sign language understanding, whether that is a recognition system, an avatar, a search engine, or something entirely new, the same constraint applies. Your model is only as good as the data underneath it, and right now that data barely exists in a structured form.

CLERC is building that infrastructure with structured glosses from native signers and millisecond-level segmentation designed to be the foundation that AI and ML teams can actually build on.

Built by the Deaf community, for everyone

This is not a dataset built about Deaf people but a dataset built by Deaf people.

The founder is Deaf, and the signers are native, which is not a value statement but a data quality requirement. A hearing team making judgment calls on gloss boundaries, regional variants, and prosodic features introduces systematic error that compounds silently through every downstream model. We eliminated that error at the source by making sure the people who build the data are the same people who live the language.

See it in action

The video below shows the pipeline in action, from raw footage to structured, labeled sign language data. This is what the foundation looks like when it is built correctly.

If you are working on Sign Language AI and want to dig into the technical details, we are happy to walk you through it. Reach out at florian@clerc.io.