Data Governance
Article 10 requirements for training, validation, and testing data.
Data Governance (Article 10)
Learning Objectives
By the end of this chapter, you will be able to:
- Implement comprehensive data governance for AI training, validation, and testing
- Apply Article 10 data quality requirements in practice
- Conduct systematic bias examination across protected characteristics
- Navigate the GDPR intersection for processing sensitive personal data
- Document data governance processes for compliance demonstration
Article 10 establishes mandatory data governance practices for high-risk AI systems. Since AI systems are fundamentally shaped by their training data, poor data governance leads to unreliable, biased, or unsafe AI. This article ensures data quality from collection through model deployment.
Scope: What Data is Covered
Article 10 applies to three categories of datasets:
| Dataset Type | Purpose | Governance Requirement |
|---|---|---|
| Training Data | Model learning and development | Full Article 10 requirements |
| Validation Data | Model tuning and hyperparameter selection | Full Article 10 requirements |
| Testing Data | Performance evaluation and verification | Full Article 10 requirements |
Compliance Note
Per Article 10(6), for the development of high-risk AI systems not using techniques involving the training of AI models, paragraphs 2 to 5 apply only to the testing data sets.
The Data Governance Framework
Article 10(2): Mandatory Governance Practices
You must implement governance practices covering:
| Requirement | What It Means | Practical Implementation |
|---|---|---|
| (a) Design choices | Document why specific data was selected | Data selection criteria documentation |
| (b) Collection processes | Record how data was gathered and its origin | Data provenance tracking |
| (c) Preparation operations | Document annotation, labelling, cleaning | Data pipeline documentation |
| (d) Assumptions | State what the data is meant to measure | Data dictionary and metadata |
| (e) Availability/suitability | Assess if data is sufficient for purpose | Data adequacy assessment |
| (f) Bias examination | Check for discriminatory patterns | Bias audit processes |
| (g) Bias mitigation | Appropriate measures to detect, prevent and mitigate possible biases identified according to point (f) | Bias mitigation implementation |
| (h) Gaps/shortcomings | Identify what's missing or problematic | Data gap analysis |
Data Quality Requirements
Article 10(3): Quality Criteria
Datasets must meet these quality standards:
Relevant
- Data directly relates to the intended purpose
- Features are predictive of target outcomes
- Domain-appropriate data sources
Sufficiently Representative
- Covers the deployment population
- Includes edge cases and boundary conditions
- Geographic and demographic coverage
Free of Errors (to the best extent possible)
- Accurate labelling and annotation
- Correct data values
- Minimal measurement errors
Complete (in view of intended purpose)
- No critical missing data
- Sufficient sample sizes
- Temporal coverage as needed
Data Quality Checklist
| Quality Dimension | Assessment Questions |
|---|---|
| Accuracy | Are labels correct? Are measurements precise? |
| Completeness | Is required data present? Are there gaps? |
| Consistency | Is data formatted uniformly? Are definitions stable? |
| Timeliness | Is data current? Does it reflect deployment conditions? |
| Representativeness | Does data reflect the deployment population? |
| Relevance | Does data relate to the intended purpose? |
Bias Examination Requirements
Article 10(2)(f): Mandatory Bias Assessment
You must examine datasets for biases "likely to affect health and safety of persons, have a negative impact on fundamental rights, or lead to discrimination."
Types of Bias to Examine
| Bias Type | Description | Example |
|---|---|---|
| Selection Bias | Non-representative sampling | Recruiting AI trained only on tech workers |
| Measurement Bias | Inconsistent data collection | Different interview standards for groups |
| Label Bias | Discriminatory labelling patterns | Historical bias in performance ratings |
| Representation Bias | Under/over-representation of groups | Medical AI trained mostly on one gender |
| Aggregation Bias | Grouping hides disparities | One model for diverse populations |
| Historical Bias | Data reflects past discrimination | Credit data reflecting redlining |
Protected Characteristics to Assess
Under EU non-discrimination law, examine bias across:
- Sex/Gender
- Racial or ethnic origin
- Religion or belief
- Disability
- Age
- Sexual orientation
- Nationality
Bias Examination Process
Step 1: Demographic Analysis
- Analyse representation of protected groups
- Identify under/over-represented populations
- Document representation gaps
Step 2: Label Distribution Analysis
- Examine outcome labels across groups
- Identify historical discrimination patterns
- Assess label consistency across groups
Step 3: Feature Analysis
- Identify features correlated with protected characteristics
- Assess proxy discrimination risks
- Document feature selection rationale
Step 4: Subgroup Performance
- Test model performance across groups
- Identify disparate accuracy or error rates
- Document performance gaps
Processing Sensitive Personal Data
Article 10(5): Special Category Data Exception
Processing special category data (Article 9 GDPR) for bias monitoring is permitted only when all six conditions set out in Article 10(5)(a)-(f) are met:
Conditions (all must be met):
- (a) The bias detection and correction cannot be effectively fulfilled by processing other data, including synthetic or anonymised data
- (b) The special categories of personal data are subject to technical limitations on re-use and state-of-the-art security and privacy-preserving measures, including pseudonymisation
- (c) The special categories of personal data are subject to measures to ensure that the personal data processed are secured, protected, subject to suitable safeguards, including strict controls and documentation of the access
- (d) The special categories of personal data are not to be transmitted, transferred or otherwise accessed by other parties
- (e) The special categories of personal data are deleted once the bias has been corrected or the personal data has reached the end of its retention period, whichever comes first
- (f) The records of processing activities pursuant to GDPR Regulations include the reasons why the processing of special categories of personal data was strictly necessary
Required Safeguards:
- Technical measures (pseudonymisation, access controls)
- Organisational measures (policies, training)
- Prohibition of processing for any other purpose
- Deletion after bias monitoring complete
| GDPR Article 9 Category | AI Act Treatment |
|---|---|
| Racial/ethnic origin | May process for bias monitoring with safeguards |
| Political opinions | May process for bias monitoring with safeguards |
| Religious beliefs | May process for bias monitoring with safeguards |
| Health data | May process for bias monitoring with safeguards |
| Sex life/orientation | May process for bias monitoring with safeguards |
| Biometric data | May process for bias monitoring with safeguards |
Expert Insight
The AI Act creates a specific legal basis for processing sensitive data to prevent AI discrimination. This is a significant departure from GDPR's otherwise restrictive approach to special category data. Document your justification carefully.
Data Provenance and Lineage
Tracking Data Origins
For each dataset, document:
| Element | Required Information |
|---|---|
| Source | Where data originated |
| Collection method | How data was gathered |
| Collection date | When data was collected |
| Legal basis | Lawful basis for collection |
| Transformations | How data was processed |
| Chain of custody | Who handled the data |
Third-Party Data Considerations
When using external data:
- Verify provider's data governance practices
- Obtain representations about data quality
- Conduct independent quality assessment
- Document due diligence process
Documentation Requirements
Data Governance Documentation
Your technical documentation (Annex IV) must include:
| Document | Content |
|---|---|
| Data Documentation | Description of datasets, collection, preparation |
| Bias Assessment Report | Methods and findings of bias examination |
| Data Quality Assessment | Evidence of quality criteria compliance |
| Sensitive Data Justification | If applicable, justification for Article 10(5) |
| Gap Analysis | Identified shortcomings and mitigation |
Integration with Other Requirements
| Requirement | Data Governance Connection |
|---|---|
| Risk Management (Art. 9) | Data risks feed into risk assessment |
| Technical Documentation (Art. 11) | Data governance is mandatory documentation content |
| Accuracy (Art. 15) | Data quality directly affects accuracy |
| Post-Market Monitoring (Art. 72) | Monitor for data drift and degradation |
Data Governance Compliance Checklist
- Training, validation, and testing data identified
- Data design choices documented
- Collection processes and origins recorded
- Preparation operations documented
- Data assumptions stated
- Availability and suitability assessed
- Bias examination completed across protected characteristics
- Data gaps and shortcomings identified
- Quality criteria (relevant, representative, error-free, complete) assessed
- Sensitive data processing justified (if applicable)
- Data governance documentation complete
What You Learned
Key concepts from this chapter
Article 10 applies to **training, validation, AND testing data**
Data must be **relevant, representative, error-free, and complete** for the intended purpose
**Bias examination is mandatory** across protected characteristics
**Sensitive personal data** may be processed for bias monitoring under strict conditions
**Document everything**—data governance is core to technical documentation