Training and Data Mining Assertion

The C2PA technical specification allows actors in a workflow to make cryptographically signed assertions about the produced C2PA asset.

The training and data mining assertion enables a human actor to provide a C2PA Manifest Consumer information about whether an asset with C2PA metadata may be used as part of a data mining or AI/ML training workflow.

Version 1.1 Draft 22 July 2024 · Version history

The Creator Assertions Working Group expects to release an update to this specification late in 2024. This page is the working draft of that update. Until that time, implementers should refer to the 1.0 version of this specification.

Maintainers:

License

This specification is subject to the Community Specification License 1.0.

Additional information about this specification’s scope and governance can be found at the project’s GitHub repository (creator-assertions/training-and-data-mining-assertion). The Community Specification License documents at the root of that repository are the authoritative governance documents for this specification.

Contributing

This section is non-normative.

This specification is an active working draft. If you wish to contribute to its development, you are invited to:

Foreword

Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. No party shall not be held responsible for identifying any or all such patent rights.

Any trade name used in this document is information given for the convenience of users and does not constitute an endorsement.

This document was prepared by the Creator Assertions Working Group.

Known patent licensing exclusions are available in the specification’s notices.md file.

Any feedback or questions on this document should be directed to the specifications repository (GitHub: creator-assertions/training-and-data-mining-assertion).

THESE MATERIALS ARE PROVIDED “AS IS.” The Contributors and Licensees expressly disclaim any warranties (express, implied, or otherwise), including implied warranties of merchantability, non-infringement, fitness for a particular purpose, or title, related to the materials. The entire risk as to implementing or otherwise using the materials is assumed by the implementer and user. IN NO EVENT WILL THE CONTRIBUTORS OR LICENSEES BE LIABLE TO ANY OTHER PARTY FOR LOST PROFITS OR ANY FORM OF INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES OF ANY CHARACTER FROM ANY CAUSES OF ACTION OF ANY KIND WITH RESPECT TO THIS DELIVERABLE OR ITS GOVERNING AGREEMENT, WHETHER BASED ON BREACH OF CONTRACT, TORT (INCLUDING NEGLIGENCE), OR OTHERWISE, AND WHETHER OR NOT THE OTHER MEMBER HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Table of contents

1. Introduction

This section is non-normative.

1.1. Scope

For purposes of the Community Specification License, the scope.md document at the root of this project’s GitHub repository is the governing document of this specification’s scope.

3. Assertion definition

3.1. Overview

This assertion enables a human actor to provide a Manifest Consumer information about whether the asset may be used as part of a data mining or AI/ML training workflow. This is expressed in the assertion through a map of one or more training-mining-entries. Each entry describes whether its use is allowed, notAllowed, or constrained.

There are four pre-defined entries:

cawg.data_mining

Can any text or data content be extracted from the asset for purposes of determining “patterns, trends, and correlations.”

This would include images containing text, where the text could be extracted via OCR.
cawg.ai_inference

Can the asset be used as input to a trained AI/ML model for the purposes of inferring a result.

cawg.ai_generative_training

Can the asset be used as training data to an AI/ML model that could generate assets.

cawg.ai_training

Can the asset be used as data to train non-generative AI/ML models, such as those used for classification, object detection, etc.

cawg.ai_generative_training and cawg.ai_training are separate values because generative AI training enables new assets to be created from training assets, while other types, such as object detection, do not.

In addition to the pre-defined entries, a claim generator may also add their own custom keys, provided that they conform to the same syntax for custom labels as defined in Section 6.2, “Labels,” of the C2PA Technical Specification. Labels beginning with the prefix cawg. are reserved for use in future versions of this specification and MUST NOT be assigned by any other claim generator.

The value of constrained implies that permission is not unconditionally granted for this usage. Consumers of this content that wish to use the content in this way may wish to contact the actor which is the rights holder, author, or signer to get more info or obtain permission. In the absence of additional information, constrained shall be treated as equivalent to notAllowed. More details on the constraints may be provided in the constraints_info text field.

Some possible things that could be put into constraints_info include a well-known description of a license (e.g., Creative Commons), a URL to a policy file, or just some free text.

A training and data mining assertion SHALL have a label of cawg.training-mining.

Notice to implementers of previous (C2PA 1.x) definition of this assertion

Implementers who are transitioning from the earlier definition of this assertion should pay special attention to label names.

The training and data mining assertion as defined in version 1.4 of the C2PA technical specification used labels with the prefix c2pa. for the assertion itself and for the pre-defined training-mining-map entries.

This specification is not a product of the C2PA itself, so it can not use the c2pa. prefix. Therefore, though structurally similar to the C2PA 1.x definition, the labels have been changed to cawg. in this specification.

3.2. Schema and example

The CDDL Definition for this type is:

; Assertion for specifying whether the associated asset and its data
; may be used for training an AI/ML model or mined for its data (or both).


; Possible values
$training-mining-choice /=  "allowed"
$training-mining-choice /=  "notAllowed"
$training-mining-choice /=  "constrained"

; Description of the data structure
training-mining-map-entry = {
	"use": $training-mining-choice,
	? "constraint_info": tstr .size (1..max-tstr-length) ; information about the use of `constrained`
}

training-mining-map = {
	? "cawg.data_mining"				: $training-mining-map-entry,		
	? "cawg.ai_inference"				: $training-mining-map-entry,		
	? "cawg.ai_training"				: $training-mining-map-entry,		
	? "cawg.ai_generative_training"  	: $training-mining-map-entry,
	* tstr => any	; allow for any other custom use case
	? "metadata": $assertion-metadata-map    	; additional information about the assertion
}

An example in CBOR Diagnostic Format (.cbordiag) is shown below:

{
  "entries":
	"cawg.ai_training": {
		"use": "allowed"
	},
	"cawg.ai_generative_training": {
		"use": "notAllowed"
	},
	"cawg.data_mining": {
		"use": "constrained",
		"constraint_info": "may only be mined on days whose names end in 'y'"
	}
}

Appendix A: Version history

This section is non-normative.

17 July 2024

22 July 2024

  • Promoted from pre-draft to draft status.