AI Training Datasets: Comparative Analysis

This analysis examines four major AI training datasets using ORP v0.2 post-hoc reconstruction to reveal patterns of constitutive opacity, systemic harm, and accountability gaps across the AI ecosystem.

Overview

We reconstructed four foundational AI training datasets using the OpenReason Protocol v0.2:

ImageNet ILSVRC-2012 - Computer vision dataset (1.28M images)
LAION-5B - Web-scale image-text pairs (5.85B)
Common Crawl - Web text corpus (250TB, 250B pages)
GitHub Copilot - Code training dataset (100M+ repositories)

Total documentation: 8,328 lines of ORP YAML capturing constitutive decisions, accountability gaps, and systemic harms.

Why These Four Datasets?

These datasets represent the foundational infrastructure of modern AI:

ImageNet enabled the deep learning revolution (AlexNet 2012 → ResNet 2015 → Vision Transformers 2020)
LAION-5B powers generative AI (Stable Diffusion, DALL-E reproductions)
Common Crawl trains virtually all large language models (GPT-3, GPT-4, LLaMA, Claude, Gemini)
GitHub Copilot represents code AI (1M+ users, $100M+ revenue)

Together, they demonstrate how constitutive decisions made 2008-2022 shape AI systems deployed in 2024.

Comparative Table

Dataset	Size	Domain	Key Harm	Tier	Affected Population	Remediation
ImageNet	1.28M images	Computer vision	Taxonomic bias, offensive person categories	3-4	Billions (downstream model bias)	Possible (categories deprecated 2021)
LAION-5B	5.85B pairs	Multi-modal	3,226 CSAM URLs, medical privacy violations	1	CSAM victims, medical patients	Possible (URLs removed Aug 2024)
Common Crawl	250TB, 250B pages	NLP	90%+ English hegemony, linguistic extinction	1	4-5B people (low-resource languages)	Impossible (extinction irreversible)
GitHub Copilot	100M+ repos	Code	GPL violations, OSS labor exploitation	2-3	10M+ OSS maintainers	Impossible (baked into model weights)

Harm Tiers:

Tier 1: Existential (CSAM, language extinction)
Tier 2: Legal ($1-10B lawsuits)
Tier 3: Economic (labor exploitation, bias)
Tier 4: Ecosystem (sustainability crises)

Common Patterns Across All Four

1. Constitutive Opacity

None of the four datasets documented constitutive decisions prospectively.

ImageNet (2009): WordNet taxonomy selection rationale not documented
LAION-5B (2022): No safety filtering justification documented
Common Crawl (2008): Universal crawl decision (no linguistic targets) not documented
GitHub Copilot (2021): License non-compliance rationale not documented

All four relied on academic papers (Deng et al. 2009, Schuhmann et al. 2022, Chen et al. 2021) that did not capture constitutive-layer decisions.

2. Absent Nodes (Layer 3)

All four datasets had absent nodes - stakeholders with zero representation in decision-making:

Dataset	Absent Node	Impact
ImageNet	49,000 Amazon Mechanical Turk workers	Shaped annotations but had no governance voice
LAION-5B	CSAM victims, medical patients	3,226 CSAM URLs, medical privacy violations
Common Crawl	4-5B low-resource language speakers	90%+ English hegemony, linguistic extinction
GitHub Copilot	10M+ open source maintainers	$0 compensation despite $100M+ revenue

Pattern: Vulnerable populations systematically excluded from governance, then disproportionately harmed.

3. Accountability Gaps (Layer 4)

All four datasets have critical accountability gaps:

Alternatives never evaluated: License-compliant training, linguistic diversity targets, safety filtering, worker compensation
Cost-benefit never documented: Hash-matching ($50-100K), revenue sharing ($10M/year), linguistic diversity ($1-50M/year)
Governance mechanisms absent: No process to address harms discovered years later
Downstream use untracked: No audit trail for models trained on these datasets

4. Systemic Harm Propagation

Downstream cascade: Dataset biases propagate to models, then to deployed systems, affecting billions of people.

Example cascade:

ImageNet (2009): 80% North American/European images
→ AlexNet/ResNet/Vision Transformers trained on ImageNet
→ Facial recognition systems deployed globally
→ 40% higher error rates on non-Western faces (Buolamwini & Gebru 2018)
→ Millions misidentified, some wrongfully arrested

Similar cascades for LAION-5B → Stable Diffusion, Common Crawl → GPT-4, GitHub Copilot → 1M+ developers.

Unique Insights by Dataset

ImageNet: Taxonomic Decisions Have Decade-Long Impact

Key insight: WordNet taxonomy selection (2007-2009) embedded English-language Western categorization that persisted for 12 years (2009-2021). 54% of person categories later found offensive, but taxonomy decisions were never documented or governable.

What ORP reveals: Layer 1 data provenance would have documented taxonomy selection rationale. Layer 3 empathy mapping would have identified affected communities (people in offensive categories). Layer 4 accountability ledger would have enabled governance for deprecation decisions.

LAION-5B: Safety-By-Design vs Post-Hoc Filtering

Key insight: CLIP threshold (0.28) prioritized dataset size over safety. No hash-matching against NCMEC PhotoDNA despite being standard practice for user-generated content platforms. Result: 3,226 CSAM URLs distributed for 21 months (federal crime, 18 U.S.C. § 2252).

What ORP reveals: Layer 2 consequence simulation would have forced cost-benefit analysis: hash-matching ($50-100K, 2-3 months delay) vs Tier 1 harm (child exploitation). Layer 3 empathy mapping would have identified CSAM victims as vulnerable population requiring absolute protection.

Common Crawl: Irreversible Harm at Civilizational Scale

Key insight: Constitutive decision in 2008 (universal crawl, no linguistic targets) never revisited despite AI training becoming primary use case by 2018. Result: 90%+ English hegemony baked into GPT-3, GPT-4, LLaMA, Claude, Gemini, and 100+ LLMs.

What ORP reveals: Layer 3 empathy mapping would have identified 4-5B low-resource language speakers as vulnerable population bearing Tier 1 irreversible harm (language extinction). Unlike LAION-5B (post-hoc remediation possible), extinction cannot be undone.

GitHub Copilot: Legal Violations Baked Into Model Weights

Key insight: GPL copyleft potentially violated, MIT/Apache attribution stripped, DMCA § 1202 violations (removing copyright management information). Result: $1-10B lawsuit pending, 10M+ OSS maintainers receive $0 compensation despite $100M+ revenue.

What ORP reveals: Layer 1 data provenance would have documented license distribution. Layer 3 empathy mapping would have identified OSS maintainers as absent node bearing disproportionate harm (labor exploitation, license violations). Layer 2 consequence simulation would have evaluated alternatives (license-compliant training, revenue sharing).

What ORP Enables: Prospective Governance

Key finding: All four datasets demonstrate that constitutive opacity creates systemic harm.

ORP enables prospective governance:

Layer 1 (Data Provenance): Document constitutive decisions transparently
Layer 2 (Consequence Simulation): Simulate harms before dataset release
Layer 3 (Empathy Mapping): Identify vulnerable populations and absent nodes
Layer 4 (Accountability Ledger): Create governance mechanisms for course correction
Layer 5 (Fork Registry): Enable alternative datasets when harm discovered

Implications

For AI Developers

Current practice: Maximize training data size for best model performance, address harms post-hoc (if discovered).

ORP recommendation: Conduct prospective governance analysis before training. Layer 2 consequence simulation and Layer 3 empathy mapping would have prevented:

LAION-5B: 3,226 CSAM URLs (Tier 1 harm)
GitHub Copilot: $1-10B lawsuit (Tier 2 harm)
Common Crawl: Linguistic extinction acceleration (Tier 1 harm)

Cost: $50K-50M depending on dataset scale (hash-matching, linguistic diversity, license compliance). Benefit: Prevent Tier 1-2 harms, avoid lawsuits, improve ecosystem sustainability.

For Policymakers

Current regulation: GDPR, EU AI Act focus on downstream use (deployed models). Do not address constitutive layer (dataset creation).

ORP recommendation: Require prospective ORP documentation for AI training datasets:

Layer 1: Data provenance and constitutive decisions
Layer 3: Vulnerable population assessment
Layer 4: Accountability mechanisms

Precedent: GDPR Article 35 (Data Protection Impact Assessment) for high-risk processing. ORP extends to constitutive-layer impact assessment.

For Researchers

Current academic documentation: Papers focus on technical methods (CLIP filtering, WordNet taxonomy). Do not capture constitutive decisions (why these choices? what alternatives considered?).

ORP recommendation: Publish ORP documents alongside academic papers. Example: Schuhmann et al. (2022) LAION-5B paper + ORP document would have documented safety filtering decisions, enabling community governance.

For the Public

Current transparency: Academic papers (technical), terms of service (legal), model cards (limited). No unified framework for constitutive-layer transparency.

ORP enables: Cross-dataset analysis (like this page), identifying systemic patterns, holding creators accountable for constitutive decisions.

Next Steps

Immediate Actions

Validate these examples: Use the online validator to check ORP compliance
Fork and improve: All four examples are open for community improvement via GitLab forks
Contribute more examples: We need ORP reconstructions of more AI training datasets

Long-term Goals

Prospective ORP adoption: Next AI training dataset documented using ORP before release
Regulatory integration: ORP becomes standard for AI training data impact assessment
Community governance: ORP documents enable stakeholder participation in dataset governance
Ecosystem sustainability: Revenue sharing, attribution, consent mechanisms become standard practice

Download All Examples

ImageNet (YAML) - 1,513 lines, 88KB
LAION-5B (YAML) - 2,162 lines, 123KB
Common Crawl (YAML) - 2,232 lines, 123KB
GitHub Copilot (YAML) - 2,421 lines, 132KB

Methodology note: These four post-hoc reconstructions are based on published papers, technical reports, legal filings, and external research. They represent external analysts’ best efforts to document constitutive decisions using available information. Original dataset creators were not involved in producing these ORP documents.

Validation: All four examples validate successfully as ORP-PostHoc (ORP v0.2 compliance level).

Danish Property Tax Reform ImageNet Training Data