ArXiv

Accurate and Efficient Statistical Testing for Word Semantic Breadth

Authors
Yo Ehara
Categories
cs.CL
arXiv
https://arxiv.org/abs/2605.08048v1
PDF
https://arxiv.org/pdf/2605.08048v1

Brief

Yo Ehara proposes a Householder-aligned permutation test to more accurately compare word semantic breadth from contextualized token embeddings. By reflecting one token cloud to align mean directions before permutation testing, the method prevents directional differences from masquerading as dispersion effects. Empirically it cut Type‑I error by 32.5% and, with a GPU-batched implementation, ran 23× faster than a CPU baseline; accepted to ACL 2026.

Why it matters

Introduces a Householder-aligned permutation test that applies a single Householder reflection to align mean directions of two token-vector clouds before nonparametric permutation testing, isolating dispersion (semantic breadth) from directional differences in contextualized embeddings.

Key details

  • On evaluated cases the alignment reduced Type‑I error by 32.5% while preserving sensitivity to true breadth differences (compared with naive tests that confound direction and dispersion).
  • Provides a GPU-oriented, batched implementation that achieves a 23× speedup over the CPU baseline; paper by Yo Ehara posted to arXiv 2026-05-08 and accepted to ACL 2026 Main Conference.
Source evidence

Abstract

Measuring the breadth of a word's meaning, or its spread across contexts, has become feasible with contextualized token embeddings. A word type can be represented as a cloud of token vectors, with dispersion-based statistics serving as proxies for contextual diversity (Nagata and Tanaka-Ishii, ACL2025). These measurements are useful for deciding appropriate sense distinctions when constructing thesauri and domain-specific dictionaries. However, when comparing the breadth of two word types, naive hypothesis testing on dispersion can be misleading: differences in semantic direction can masquerade as dispersion differences, inflating Type-I error and yielding "statistically significant" outcomes even when there is no true breadth difference. This is problematic because significance testing should distinguish genuine effects from incidental fluctuations in small-difference regimes. We propose a Householder-aligned permutation test to isolate dispersion differences from directional differences. Our method applies a single Householder reflection to align the mean directions of the two word types and then performs a permutation test on the aligned token clouds, yielding calibrated, non-parametric p-values. For practicality, we introduce a GPU-oriented implementation that batches permutations and linear algebra operations. Empirically, our alignment reduced Type-I error by 32.5% while preserving sensitivity to genuine breadth differences, and achieved a 23x speedup over the CPU baseline.

Comment: Accepted to ACL 2026 Main Conference