Classification Pipeline: All 17 labor themes are classified using English-language keyword regex patterns applied to title + full extract body (up to 10,000 characters). Non-English documents are systematically disadvantaged, contributing to higher cross-cutting rates in regions with lower English proficiency.
Language Bias: The theme classifier uses English keyword patterns exclusively. Statements in non-English primary languages are almost certainly under-classified on all themes, not just surveillance. This is an epistemological limitation: concepts like “gig work” and “just transition” carry different meanings across linguistic traditions.
Text Extraction: Document extracts are capped at 10,000 characters (EXTRACT_CAP), creating position-dependent classification bias for longer documents. If surveillance context appears beyond 10,000 characters, it will not be detected.
Region Assignment: Items without geographic metadata are assigned to “Global/International” by default via normalize_region(), inflating that category.
Surveillance Theme (v13 update): The surveillance theme uses a two-pass keyword gate with co-occurrence requirements. Three items (STMT-0496, STMT-1551, STMT-2040) are excluded by hard-coded rules. The v12 dashboard reported heuristic-precision of 51.3% based on a 1,500-character scan window. This was a methodological artifact: the window-cap fix (identified post-hoc after inspecting false positives) widened the scan to the full 10,000-character extract body, correcting heuristic-precision to 87.2% (n=117 census, no CI). The 87.2% figure is a corrected-pipeline estimate, not a replication. 15 residual candidate false positives remain for human adjudication. The surveillance theme is composite, encompassing algorithmic management, biometric monitoring, bossware, and workplace monitoring.
Religious Ethics Tracker: Originally developed to measure Catholic Social Teaching (CST) citations. Expanded in v12 to include Islamic, Protestant/Ecumenical, Buddhist, Jewish, and Gandhian/Indian ethical economics sources. Detection rates reflect both corpus composition and keyword specificity; absence of matches does not indicate absence of discourse in a tradition.
Collection Bias: The Tapestry database conducted 12+ targeted ingestion waves that systematically prioritized 2023–2026 documents. Temporal concentration figures reflect collection strategy alongside genuine discourse growth.
Heuristic vs. Human Judgment: All precision figures in this dashboard are heuristic-precision — the fraction of keyword-flagged items judged relevant by the heuristic rules themselves. Only manual domain-expert review can establish true classification precision. Heuristic-precision is a lower bound on true precision only if the heuristic’s errors are biased toward false negatives.