These questions came up in a FESCo ticket [1] recently and the primary purpose of this thread is to have some public record of the conversation around the handling of pre-trained weights for AI/ML models as packaged for Fedora.
[1] https://pagure.io/fesco/issue/3175
Intro and Definitions =====================
Previous conversations have involved a decent amount of confusion around terminology and I want to be clear about what I'm asking so I'm starting with a few definitions in the context of my questions.
Artificial Neural Network (ANN) - effectively structured data consisting of neurons (nodes containing some value) organized into layers with various connections between the neurons. There are connections between neurons which control the flow of data through the entire network. The exact value of how the connections affect flow through the network is found through the training process and these values are generally referred to as weights.
Model - A model by itself is a description of a specific ANN - how layers are configured, how they interact with each other, how model training is done, how data needs to be structured for using a trained model and so on. A model by itself is rarely, if ever useful. Models generally need to be trained on data before they can be used but many models offer a mechanism through which weights can be loaded from a model which has already been trained. An untrained model without pre-trained weights or training is pretty much code.
Pre-Trined Weights - Pre-trained weights are essentially the data contained in a model after training the model on some input data. Training modern ANN models is a very expensive and time consuming process; pre-trained weights allow people to use models without having to train the model locally or even have access to data needed to train the model.
Questions =========
1. Are pre-trained weights considered to be normal non-code content/data or do they require special handling?
2. If an upstream offers pre-trained weights and indicates that those weights are available under a license which is acceptable for non-code content in Fedora, can those pre-trained weights be included in Fedora packages?
3. Extending question 2, is it considered sufficient for an upstream to have a license on pre-trained weights or would a packager/reviewer need to verify that the data used to train those weights is acceptable?
4. Is it acceptable to package code which downloads pre-trained weights from a non-Fedora source upon first use post-installation by a user if that model and its associated weights are a. For a specific model? b. For a user-defined model which may or may not exist at the time of packaging?
I can provide examples of any of these situations if that would be helpful.
Thanks,
Tim
On Mon, Feb 26, 2024 at 6:32 PM Tim Flink tflink@fedoraproject.org wrote:
- Are pre-trained weights considered to be normal non-code content/data or do they require special handling?
My thought is that they should be considered "content" for Fedora packaging purposes. The legal docs say:
"For purposes of Fedora license classification, “content” means any material that is not clearly code, documentation, fonts or firmware. Here are some examples of content: graphic image files audio files nonfunctional data sets AppStream metainfo.xml files standards documents certain files relating to functionality and management of markup languages, including XML schema files and resource resolution files, XSL files, SGML declaration files, and ancillary informal documentation accompanying such files"
First, aren't trained model weights a kind of "nonfunctional data set"? (We don't define what "nonfunctional" means -- I'm pretty sure we copied that phrase from the old wiki documentation -- and frankly I'm not sure what it means, but I think it goes to the nonexecutable nature of the data. Weights don't function by themselves.)
Second, it seems to me that pretrained model weights are "not clearly code, documentation, fonts or firmware". One of the purposes of the content category is to allow relaxed license criteria for noncode/non-documentation things needed by Fedora packages. Note though that the current relaxed criteria only extends to two features: "The license may restrict or prohibit modification The license may say that it does not cover patents or grant any patent licenses" (the latter being a reference to CC0)
However, there is a reason why I felt it was important to bump this issue to FESCo. I thought FESCo might wish to take a position that pretrained weights, being the result of a training process on some training data, are analogous to object code (even if not "code" for Fedora license classification purposes). It sounds like they don't want to take a position on this.
This topic relates very closely to certain current issues of interest in the wider world, for example the Open Source Initiative's effort to define "Open Source AI" (see: https://discuss.opensource.org/) There is definitely some sentiment among some participants in that effort that, for a so-called "AI system" to be "open source", training data must be "open", largely because it is thought that this is necessary for users to exercise rights of modification. I don't think that debate is dispositive of the Fedora question. If a Fedora package contains pretrained weights, it is not necessarily an assertion that such a package is "open source" in a precise sense, any more than Fedora is asserting that firmware packages are "open source". It is true that Fedora cannot reasonably claim to be 100% FOSS if it packages stuff like firmware, or "content" under licenses that prohibit modification.
You might say that holders of those viewpoints in the OSI effort are adopting a view that model weights are "code", if you map things to Fedora license approval concepts.
Anyway, I'm struggling to see a justification for not classifying pretrained weights as "content". I am not sure it is of much practical significance though.
- If an upstream offers pre-trained weights and indicates that those weights are available under a license which is acceptable for non-code content in Fedora, can those pre-trained weights be included in Fedora packages?
This is what I thought ought to be a FESCo question. If FESCo doesn't actually care and sees this as a Fedora legal question, then this question is really equivalent to the first question, isn't it? If it's "content", and it's under a license acceptable for "content", then as far as Fedora legal is concerned it can be included in Fedora packages.
- Extending question 2, is it considered sufficient for an upstream to have a license on pre-trained weights or would a packager/reviewer need to verify that the data used to train those weights is acceptable?
So this is where I think we should initially be a little cautious and look at these things on a case-by-case basis, perhaps until we get more experience with handling this topic. Maybe there could be circumstances where given what is disclosed, or not disclosed, about how a model was trained, we might want to not package the pretrained weights in Fedora. I think that is unlikely, but not impossible.
- Is it acceptable to package code which downloads pre-trained weights from a non-Fedora source upon first use post-installation by a user if that model and its associated weights are a. For a specific model? b. For a user-defined model which may or may not exist at the time of packaging?
I can provide examples of any of these situations if that would be helpful.
Can you elaborate on 4a/4b with examples?
Richard
On 2/26/24 19:06, Richard Fontana wrote:
<snip>
- Is it acceptable to package code which downloads pre-trained weights from a non-Fedora source upon first use post-installation by a user if that model and its associated weights are a. For a specific model? b. For a user-defined model which may or may not exist at the time of packaging?
I can provide examples of any of these situations if that would be helpful.
Can you elaborate on 4a/4b with examples?
There are 2 simple examples for the two cases I mentioned (4a and 4b) at the bottom of this email
Tim
----------------------------------------------------------------- 4a - code that downloads pre-trained weights for a specific model -----------------------------------------------------------------
torchvision [1] is a pytorch adjacent library which contains "Datasets, Transforms and Models specific to Computer Vision". torchvision contains code to implement several pre-defined model structures which can be used with or without pre-trained weights [2]. torchvision is distributed under a BSD 3-clause license [3] and is currently packaged in Fedora as python-torchvision but all of the specific model code is removed at package build time and not distributed as a Fedora package.
As an example, to instantiate a vision transformer (ViT) base model variant with 16x16 input patch size and download pre-trained weights, the following python code could be used:
``` import torchvision
vitb16 = torchvision.models.vit_b_16() ```
The code describing the vit_b_16 model is included in torchvision but the weights are downloaded from an external site when the model is first used. At the time I write this, the weights are downloaded from https://download.pytorch.org/models/vit_b_16-c867db91.pth
In this case and for all the other models contained in torchvision, the exact links to the pretrained weights are all contained within the torchvision code.
Something worthy of note is that the weights for vit_b_16 are from Facebook's SWAG project [4] which is distributed as CC-BY-NC-4.0 [5] and would not be acceptable for use in a Fedora package. For the other models in torchvision, some of the pre-trained weights have an explicit license (like ViT) but many of them are not distributed under any explicit license (ResNet[6] as an example).
[1] https://github.com/pytorch/vision [2] https://github.com/pytorch/vision/tree/main/torchvision/models [3] https://github.com/pytorch/vision/blob/main/LICENSE [4] https://github.com/facebookresearch/SWAG [5] https://github.com/facebookresearch/SWAG/blob/main/LICENSE [6] https://pytorch.org/hub/pytorch_vision_resnet/
---------------------------------------------------- 4b - code that downloads an somewhat arbitrary model ----------------------------------------------------
One of the newer features of pytorch (which is still considered to be in beta) is the ability to interface with "PyTorch Hub" [7] to use pre-defined and pre-trained models which have been uploaded by other users. At the time of this writing, the pytorch hub appears to be moderated by the pytorch team but the underlying code which supports loading of semi-arbitrary models from user-defined locations at runtime.
As an example, this code loads a MiDaS v3 large model with pre-trained weights directly from intel's github repo [8]. ``` model_type = "DPT_Large" midas = torch.hub.load("intel-isl/MiDaS", model_type)
```
Similar to the ViT example above, this model will download weights from a url (https://github.com/isl-org/MiDaS/releases/download/v3/dpt_large_384.pt at the time of this writing) but unlike the ViT example, the definitions of the model and where the weights are located are determined by code contained in the github repository specified by the user [9] and downloaded at runtime to determine the exact link to any code and pre-trained weights. The MiDaS repository is distributed under an MIT license [10].
[7] https://pytorch.org/hub/ [8] https://github.com/isl-org/MiDaS [9] https://github.com/isl-org/MiDaS/blob/master/hubconf.py#L218 [10] https://github.com/isl-org/MiDaS/blob/master/LICENSE
On Tue, Feb 27, 2024 at 5:58 PM Tim Flink tflink@fedoraproject.org wrote:
On 2/26/24 19:06, Richard Fontana wrote:
<snip>
- Is it acceptable to package code which downloads pre-trained weights from a non-Fedora source upon first use post-installation by a user if that model and its associated weights are a. For a specific model?
What do you mean by "upon first use post-installation"? Does that mean I install the package, and the first time I launch it or whatever, it automatically downloads some set of pre-trained weights, or is this something that would be controlled by the user? The example you gave suggests the latter but I wasn't sure if I was misunderstanding.
Richard
b. For a user-defined model which may or may not exist at the time of packaging?
I can provide examples of any of these situations if that would be helpful.
Can you elaborate on 4a/4b with examples?
There are 2 simple examples for the two cases I mentioned (4a and 4b) at the bottom of this email
Tim
4a - code that downloads pre-trained weights for a specific model
torchvision [1] is a pytorch adjacent library which contains "Datasets, Transforms and Models specific to Computer Vision". torchvision contains code to implement several pre-defined model structures which can be used with or without pre-trained weights [2]. torchvision is distributed under a BSD 3-clause license [3] and is currently packaged in Fedora as python-torchvision but all of the specific model code is removed at package build time and not distributed as a Fedora package.
As an example, to instantiate a vision transformer (ViT) base model variant with 16x16 input patch size and download pre-trained weights, the following python code could be used:
import torchvision vitb16 = torchvision.models.vit_b_16()
The code describing the vit_b_16 model is included in torchvision but the weights are downloaded from an external site when the model is first used. At the time I write this, the weights are downloaded from https://download.pytorch.org/models/vit_b_16-c867db91.pth
In this case and for all the other models contained in torchvision, the exact links to the pretrained weights are all contained within the torchvision code.
Something worthy of note is that the weights for vit_b_16 are from Facebook's SWAG project [4] which is distributed as CC-BY-NC-4.0 [5] and would not be acceptable for use in a Fedora package. For the other models in torchvision, some of the pre-trained weights have an explicit license (like ViT) but many of them are not distributed under any explicit license (ResNet[6] as an example).
[1] https://github.com/pytorch/vision [2] https://github.com/pytorch/vision/tree/main/torchvision/models [3] https://github.com/pytorch/vision/blob/main/LICENSE [4] https://github.com/facebookresearch/SWAG [5] https://github.com/facebookresearch/SWAG/blob/main/LICENSE [6] https://pytorch.org/hub/pytorch_vision_resnet/
4b - code that downloads an somewhat arbitrary model
One of the newer features of pytorch (which is still considered to be in beta) is the ability to interface with "PyTorch Hub" [7] to use pre-defined and pre-trained models which have been uploaded by other users. At the time of this writing, the pytorch hub appears to be moderated by the pytorch team but the underlying code which supports loading of semi-arbitrary models from user-defined locations at runtime.
As an example, this code loads a MiDaS v3 large model with pre-trained weights directly from intel's github repo [8].
model_type = "DPT_Large" midas = torch.hub.load("intel-isl/MiDaS", model_type)
Similar to the ViT example above, this model will download weights from a url (https://github.com/isl-org/MiDaS/releases/download/v3/dpt_large_384.pt at the time of this writing) but unlike the ViT example, the definitions of the model and where the weights are located are determined by code contained in the github repository specified by the user [9] and downloaded at runtime to determine the exact link to any code and pre-trained weights. The MiDaS repository is distributed under an MIT license [10].
[7] https://pytorch.org/hub/ [8] https://github.com/isl-org/MiDaS [9] https://github.com/isl-org/MiDaS/blob/master/hubconf.py#L218 [10] https://github.com/isl-org/MiDaS/blob/master/LICENSE
--
On 2/28/24 19:03, Richard Fontana wrote:
On Tue, Feb 27, 2024 at 5:58 PM Tim Flink tflink@fedoraproject.org wrote:
On 2/26/24 19:06, Richard Fontana wrote:
<snip>
- Is it acceptable to package code which downloads pre-trained weights from a non-Fedora source upon first use post-installation by a user if that model and its associated weights are a. For a specific model?
What do you mean by "upon first use post-installation"? Does that mean I install the package, and the first time I launch it or whatever, it automatically downloads some set of pre-trained weights, or is this something that would be controlled by the user? The example you gave suggests the latter but I wasn't sure if I was misunderstanding.
Once the package is installed, pre-trained weights would downloaded if and only if code written to use a specific model with pre-trained weights is run. In the cases I'm aware of, code that would cause the weights to be downloaded is not directly part of the packaged libraries and anything that could trigger the downloading of pre-trained weights would have to be written by a user or contained in a separate package. If a specific model with pre-trained weights is not used and not executed by another library/application, the weights will not be downloaded. With the ViT example, the vitb16 weights would be downloaded when that code (not included in the package) is run but the vitb32 weights would not be downloaded unless the example was changed or something else specified a pre-trained ViT model with the vitb32 weights. Similarly, the weights for other models (googlenet, as an example) would not be downloaded unless code that uses that specific model in its pre-trained form is executed post-installation.
The implementations that I'm familiar with will check for downloaded weights as the code is initialized. When done in this way, the download is transparent to the user and unless code using these models/weights is written in such a way that the user a choice, there is not much a user could do to change the download URL or prevent the weights from being downloaded. The only ways I can think of off hand would be to modify the underlying libraries to override the hard-coded URLs or maybe put identically named files in the cache location but that would end up being dependant on model implementation. For the specific libraries I used as examples, I don't know what the local download folder is off the top of my head, nor do I know if they do any verification of downloads so putting files into the cached location may not work if they don't match the intended file contents.
This is just my opinion but I doubt that many people writing code that uses pre-trained models are going to go out of their way to help users avoid downloading pre-trained weights. I know that for code that I've written using pre-trained models, it might be able to execute without the pre-trained weights but the output would just be noise in that situation. I would have a hard time justifying the work needed to make those downloads optional since it would make the code useless for what it was intended to do.
It may also be worth noting that some models with pre-trained weights are almost useless without those weights. For some (mostly older) models, it's feasible to train a model from scratch but for many of the recent models, it's just not feasible. As an example, the weights for Meta's Llama 2 took 3.3 million hours of GPU time to train [1] with a cost into the millions of USD ignoring what it would take to obtain enough data to train a model that large.
Apologies for my verbosity but I hope that I answered your question and the extra bits weren't entirely useless.
Tim
Richard
b. For a user-defined model which may or may not exist at the time of packaging?
I can provide examples of any of these situations if that would be helpful.
Can you elaborate on 4a/4b with examples?
There are 2 simple examples for the two cases I mentioned (4a and 4b) at the bottom of this email
Tim
4a - code that downloads pre-trained weights for a specific model
torchvision [1] is a pytorch adjacent library which contains "Datasets, Transforms and Models specific to Computer Vision". torchvision contains code to implement several pre-defined model structures which can be used with or without pre-trained weights [2]. torchvision is distributed under a BSD 3-clause license [3] and is currently packaged in Fedora as python-torchvision but all of the specific model code is removed at package build time and not distributed as a Fedora package.
As an example, to instantiate a vision transformer (ViT) base model variant with 16x16 input patch size and download pre-trained weights, the following python code could be used:
import torchvision vitb16 = torchvision.models.vit_b_16()
The code describing the vit_b_16 model is included in torchvision but the weights are downloaded from an external site when the model is first used. At the time I write this, the weights are downloaded from https://download.pytorch.org/models/vit_b_16-c867db91.pth
In this case and for all the other models contained in torchvision, the exact links to the pretrained weights are all contained within the torchvision code.
Something worthy of note is that the weights for vit_b_16 are from Facebook's SWAG project [4] which is distributed as CC-BY-NC-4.0 [5] and would not be acceptable for use in a Fedora package. For the other models in torchvision, some of the pre-trained weights have an explicit license (like ViT) but many of them are not distributed under any explicit license (ResNet[6] as an example).
[1] https://github.com/pytorch/vision [2] https://github.com/pytorch/vision/tree/main/torchvision/models [3] https://github.com/pytorch/vision/blob/main/LICENSE [4] https://github.com/facebookresearch/SWAG [5] https://github.com/facebookresearch/SWAG/blob/main/LICENSE [6] https://pytorch.org/hub/pytorch_vision_resnet/
4b - code that downloads an somewhat arbitrary model
One of the newer features of pytorch (which is still considered to be in beta) is the ability to interface with "PyTorch Hub" [7] to use pre-defined and pre-trained models which have been uploaded by other users. At the time of this writing, the pytorch hub appears to be moderated by the pytorch team but the underlying code which supports loading of semi-arbitrary models from user-defined locations at runtime.
As an example, this code loads a MiDaS v3 large model with pre-trained weights directly from intel's github repo [8].
model_type = "DPT_Large" midas = torch.hub.load("intel-isl/MiDaS", model_type)
Similar to the ViT example above, this model will download weights from a url (https://github.com/isl-org/MiDaS/releases/download/v3/dpt_large_384.pt at the time of this writing) but unlike the ViT example, the definitions of the model and where the weights are located are determined by code contained in the github repository specified by the user [9] and downloaded at runtime to determine the exact link to any code and pre-trained weights. The MiDaS repository is distributed under an MIT license [10].
[7] https://pytorch.org/hub/ [8] https://github.com/isl-org/MiDaS [9] https://github.com/isl-org/MiDaS/blob/master/hubconf.py#L218 [10] https://github.com/isl-org/MiDaS/blob/master/LICENSE
--
legal mailing list -- legal@lists.fedoraproject.org To unsubscribe send an email to legal-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/legal@lists.fedoraproject.org Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
On Fri, Mar 1, 2024 at 4:52 PM Tim Flink tflink@fedoraproject.org wrote:
On 2/28/24 19:03, Richard Fontana wrote:
On Tue, Feb 27, 2024 at 5:58 PM Tim Flink tflink@fedoraproject.org wrote:
On 2/26/24 19:06, Richard Fontana wrote:
<snip>
- Is it acceptable to package code which downloads pre-trained weights from a non-Fedora source upon first use post-installation by a user if that model and its associated weights are a. For a specific model?
What do you mean by "upon first use post-installation"? Does that mean I install the package, and the first time I launch it or whatever, it automatically downloads some set of pre-trained weights, or is this something that would be controlled by the user? The example you gave suggests the latter but I wasn't sure if I was misunderstanding.
Once the package is installed, pre-trained weights would downloaded if and only if code written to use a specific model with pre-trained weights is run. In the cases I'm aware of, code that would cause the weights to be downloaded is not directly part of the packaged libraries and anything that could trigger the downloading of pre-trained weights would have to be written by a user or contained in a separate package. If a specific model with pre-trained weights is not used and not executed by another library/application, the weights will not be downloaded. With the ViT example, the vitb16 weights would be downloaded when that code (not included in the package) is run but the vitb32 weights would not be downloaded unless the example was changed or something else specified a pre-trained ViT model with the vitb32 weights. Similarly, the weights for other models (googlenet, as an example) would not be downloaded unless code that uses that specific model in its pre-trained form is executed post-installation.
The implementations that I'm familiar with will check for downloaded weights as the code is initialized. When done in this way, the download is transparent to the user and unless code using these models/weights is written in such a way that the user a choice, there is not much a user could do to change the download URL or prevent the weights from being downloaded. The only ways I can think of off hand would be to modify the underlying libraries to override the hard-coded URLs or maybe put identically named files in the cache location but that would end up being dependant on model implementation. For the specific libraries I used as examples, I don't know what the local download folder is off the top of my head, nor do I know if they do any verification of downloads so putting files into the cached location may not work if they don't match the intended file contents.
This is just my opinion but I doubt that many people writing code that uses pre-trained models are going to go out of their way to help users avoid downloading pre-trained weights. I know that for code that I've written using pre-trained models, it might be able to execute without the pre-trained weights but the output would just be noise in that situation. I would have a hard time justifying the work needed to make those downloads optional since it would make the code useless for what it was intended to do.
It may also be worth noting that some models with pre-trained weights are almost useless without those weights. For some (mostly older) models, it's feasible to train a model from scratch but for many of the recent models, it's just not feasible. As an example, the weights for Meta's Llama 2 took 3.3 million hours of GPU time to train [1] with a cost into the millions of USD ignoring what it would take to obtain enough data to train a model that large.
Apologies for my verbosity but I hope that I answered your question and the extra bits weren't entirely useless.
This sounds like it falls in the same bucket as pip, snapd, gem, and other similar "package manager" functionality.
On Fri, Mar 1, 2024 at 4:54 PM Neal Gompa ngompa13@gmail.com wrote:
This sounds like it falls in the same bucket as pip, snapd, gem, and other similar "package manager" functionality.
I agree, that's a good analogy.
Richard
On 3/1/24 14:54, Neal Gompa wrote:
On Fri, Mar 1, 2024 at 4:52 PM Tim Flink tflink@fedoraproject.org wrote:
On 2/28/24 19:03, Richard Fontana wrote:
On Tue, Feb 27, 2024 at 5:58 PM Tim Flink tflink@fedoraproject.org wrote:
On 2/26/24 19:06, Richard Fontana wrote:
<snip>
- Is it acceptable to package code which downloads pre-trained weights from a non-Fedora source upon first use post-installation by a user if that model and its associated weights are a. For a specific model?
What do you mean by "upon first use post-installation"? Does that mean I install the package, and the first time I launch it or whatever, it automatically downloads some set of pre-trained weights, or is this something that would be controlled by the user? The example you gave suggests the latter but I wasn't sure if I was misunderstanding.
Once the package is installed, pre-trained weights would downloaded if and only if code written to use a specific model with pre-trained weights is run. In the cases I'm aware of, code that would cause the weights to be downloaded is not directly part of the packaged libraries and anything that could trigger the downloading of pre-trained weights would have to be written by a user or contained in a separate package. If a specific model with pre-trained weights is not used and not executed by another library/application, the weights will not be downloaded. With the ViT example, the vitb16 weights would be downloaded when that code (not included in the package) is run but the vitb32 weights would not be downloaded unless the example was changed or something else specified a pre-trained ViT model with the vitb32 weights. Similarly, the weights for other models (googlenet, as an example) would not be downloaded unless code that uses that specific model in its pre-trained form is executed post-installation.
The implementations that I'm familiar with will check for downloaded weights as the code is initialized. When done in this way, the download is transparent to the user and unless code using these models/weights is written in such a way that the user a choice, there is not much a user could do to change the download URL or prevent the weights from being downloaded. The only ways I can think of off hand would be to modify the underlying libraries to override the hard-coded URLs or maybe put identically named files in the cache location but that would end up being dependant on model implementation. For the specific libraries I used as examples, I don't know what the local download folder is off the top of my head, nor do I know if they do any verification of downloads so putting files into the cached location may not work if they don't match the intended file contents.
This is just my opinion but I doubt that many people writing code that uses pre-trained models are going to go out of their way to help users avoid downloading pre-trained weights. I know that for code that I've written using pre-trained models, it might be able to execute without the pre-trained weights but the output would just be noise in that situation. I would have a hard time justifying the work needed to make those downloads optional since it would make the code useless for what it was intended to do.
It may also be worth noting that some models with pre-trained weights are almost useless without those weights. For some (mostly older) models, it's feasible to train a model from scratch but for many of the recent models, it's just not feasible. As an example, the weights for Meta's Llama 2 took 3.3 million hours of GPU time to train [1] with a cost into the millions of USD ignoring what it would take to obtain enough data to train a model that large.
Apologies for my verbosity but I hope that I answered your question and the extra bits weren't entirely useless.
This sounds like it falls in the same bucket as pip, snapd, gem, and other similar "package manager" functionality.
Yeah, the capabilities do overlap but in my opinion, the intended uses are different and that may be worth noting.
pip, as an example is intended to allow users to install python packages sourced from outside Fedora repos. I don't believe that software which used pip after installation with no direct user interaction would be allowed in Fedora.
The pre-trained models that I'm familiar with, however, download things transparently to the user with no warning outside of a log message when the weights are first downloaded.
As an example, I wrote some code called openqa_classifier [1] to test the possibility of identifying OpenQA [2] test failures as duplicates of a long running issue. The code was written only to run an experiment so I wouldn't package it in its current form but for the sake of argument, let's say that I packaged it in its current form. Only one of the experiments is relevant here - the one that looks at whether existing, more sophisticated pre-trained models can outperform a simple custom model trained from scratch.
If you installed openqa_classifier (pretending that the data was already available and that I created a sane entry point for the cli) and ran 'openqa_classifier train torch', that command would almost immediately download pre-trained weights from a URL that's hardcoded in the torchvision module if those weights didn't already exist locally. The only user-facing indication that this had happend would be a few lines in the cli output and some new files on disk.
I'm not arguing against including code which could download pre-trained weights but I do want to be reasonably sure that I've explained all this correctly.
Tim
[1] https://pagure.io/fedora-qa/openqa_classifier [2] https://openqa.fedoraproject.org/
On 3/1/24 15:32, Tim Flink wrote:
On 3/1/24 14:54, Neal Gompa wrote:
On Fri, Mar 1, 2024 at 4:52 PM Tim Flink tflink@fedoraproject.org wrote:
On 2/28/24 19:03, Richard Fontana wrote:
On Tue, Feb 27, 2024 at 5:58 PM Tim Flink tflink@fedoraproject.org wrote:
On 2/26/24 19:06, Richard Fontana wrote:
<snip>
> 4. Is it acceptable to package code which downloads pre-trained > weights from a non-Fedora source upon first use post-installation > by a user if that model and its associated weights are > a. For a specific model?
What do you mean by "upon first use post-installation"? Does that mean I install the package, and the first time I launch it or whatever, it automatically downloads some set of pre-trained weights, or is this something that would be controlled by the user? The example you gave suggests the latter but I wasn't sure if I was misunderstanding.
Once the package is installed, pre-trained weights would downloaded if and only if code written to use a specific model with pre-trained weights is run. In the cases I'm aware of, code that would cause the weights to be downloaded is not directly part of the packaged libraries and anything that could trigger the downloading of pre-trained weights would have to be written by a user or contained in a separate package. If a specific model with pre-trained weights is not used and not executed by another library/application, the weights will not be downloaded. With the ViT example, the vitb16 weights would be downloaded when that code (not included in the package) is run but the vitb32 weights would not be downloaded unless the example was changed or something else specified a pre-trained ViT model with the vitb32 weights. Similarly, the weights for other models (googlenet, as an example) would not be downloaded unless code that uses that specific model in its pre-trained form is executed post-installation.
The implementations that I'm familiar with will check for downloaded weights as the code is initialized. When done in this way, the download is transparent to the user and unless code using these models/weights is written in such a way that the user a choice, there is not much a user could do to change the download URL or prevent the weights from being downloaded. The only ways I can think of off hand would be to modify the underlying libraries to override the hard-coded URLs or maybe put identically named files in the cache location but that would end up being dependant on model implementation. For the specific libraries I used as examples, I don't know what the local download folder is off the top of my head, nor do I know if they do any verification of downloads so putting files into the cached location may not work if they don't match the intended file contents.
This is just my opinion but I doubt that many people writing code that uses pre-trained models are going to go out of their way to help users avoid downloading pre-trained weights. I know that for code that I've written using pre-trained models, it might be able to execute without the pre-trained weights but the output would just be noise in that situation. I would have a hard time justifying the work needed to make those downloads optional since it would make the code useless for what it was intended to do.
It may also be worth noting that some models with pre-trained weights are almost useless without those weights. For some (mostly older) models, it's feasible to train a model from scratch but for many of the recent models, it's just not feasible. As an example, the weights for Meta's Llama 2 took 3.3 million hours of GPU time to train [1] with a cost into the millions of USD ignoring what it would take to obtain enough data to train a model that large.
Apologies for my verbosity but I hope that I answered your question and the extra bits weren't entirely useless.
This sounds like it falls in the same bucket as pip, snapd, gem, and other similar "package manager" functionality.
Yeah, the capabilities do overlap but in my opinion, the intended uses are different and that may be worth noting.
pip, as an example is intended to allow users to install python packages sourced from outside Fedora repos. I don't believe that software which used pip after installation with no direct user interaction would be allowed in Fedora.
The pre-trained models that I'm familiar with, however, download things transparently to the user with no warning outside of a log message when the weights are first downloaded.
As an example, I wrote some code called openqa_classifier [1] to test the possibility of identifying OpenQA [2] test failures as duplicates of a long running issue. The code was written only to run an experiment so I wouldn't package it in its current form but for the sake of argument, let's say that I packaged it in its current form. Only one of the experiments is relevant here - the one that looks at whether existing, more sophisticated pre-trained models can outperform a simple custom model trained from scratch.
If you installed openqa_classifier (pretending that the data was already available and that I created a sane entry point for the cli) and ran 'openqa_classifier train torch', that command would almost immediately download pre-trained weights from a URL that's hardcoded in the torchvision module if those weights didn't already exist locally. The only user-facing indication that this had happend would be a few lines in the cli output and some new files on disk.
Extending this example to the not-hardcoded-in-packaged-code variety, running 'openqa_classifier train huggingface' would almost immediately download model specifications from huggingface.co and whatever pre-trained weights those specifications currently point to.
An example of the pre-trained models that code uses is https://huggingface.co/microsoft/swinv2-large-patch4-window12-192-22k
Tim
I'm not arguing against including code which could download pre-trained weights but I do want to be reasonably sure that I've explained all this correctly.
Tim
[1] https://pagure.io/fedora-qa/openqa_classifier [2] https://openqa.fedoraproject.org/
On Fri, Mar 1, 2024 at 5:38 PM Tim Flink tflink@fedoraproject.org wrote:
pip, as an example is intended to allow users to install python packages sourced from outside Fedora repos. I don't believe that software which used pip after installation with no direct user interaction would be allowed in Fedora.
The pre-trained models that I'm familiar with, however, download things transparently to the user with no warning outside of a log message when the weights are first downloaded.
I feel like you're raising a more general issue here which I don't really know the answer to. This is not specific to pretrained models. Couldn't *any* Fedora package have behavior such that it "downloads things transparently to the user with no warning"? If so, what if any Fedora technical or packaging policy regulates this?
I can imagine a range of cases, such as:
1. Package provides a tool that can be used by a user to deliberately obtain arbitrary third-party content under the user's direction. This undoubtedly describes lots of existing Fedora packages and I think it's pretty clear that this should normally be okay. Otherwise we couldn't package firefox, wget, curl or pip.
2. Package causes the download (transparently to the user, unless you assume a sort of omniscient user) of some content that would not comply with default Fedora licensing policies if it were packaged directly in the package. I feel like there must already be examples of packages like this.
3. Package causes the download (transparently to the user ... ) of some third-party content that violates some non-license-related Fedora legal policy and which would not be permitted to be packaged directly.
4. Package causes the download of some third-party content that violates some non-legal Fedora policy (for example, some sort of content Fedora has deemed offensive).
5. Package causes the download of some third-party content that gives rise to a security issue, where knowledge of the security issue would have prevented direct packaging of the content.
I just skimmed through the Fedora packaging guidelines and the FESCo-related documentation and didn't seem to find anything on this sort of topic.
Richard
On Fri, Mar 1, 2024 at 10:08 PM Richard Fontana rfontana@redhat.com wrote:
On Fri, Mar 1, 2024 at 5:38 PM Tim Flink tflink@fedoraproject.org wrote:
pip, as an example is intended to allow users to install python packages sourced from outside Fedora repos. I don't believe that software which used pip after installation with no direct user interaction would be allowed in Fedora.
The pre-trained models that I'm familiar with, however, download things transparently to the user with no warning outside of a log message when the weights are first downloaded.
I feel like you're raising a more general issue here which I don't really know the answer to. This is not specific to pretrained models. Couldn't *any* Fedora package have behavior such that it "downloads things transparently to the user with no warning"? If so, what if any Fedora technical or packaging policy regulates this?
I can imagine a range of cases, such as:
- Package provides a tool that can be used by a user to deliberately
obtain arbitrary third-party content under the user's direction. This undoubtedly describes lots of existing Fedora packages and I think it's pretty clear that this should normally be okay. Otherwise we couldn't package firefox, wget, curl or pip.
- Package causes the download (transparently to the user, unless you
assume a sort of omniscient user) of some content that would not comply with default Fedora licensing policies if it were packaged directly in the package. I feel like there must already be examples of packages like this.
- Package causes the download (transparently to the user ... ) of
some third-party content that violates some non-license-related Fedora legal policy and which would not be permitted to be packaged directly.
- Package causes the download of some third-party content that
violates some non-legal Fedora policy (for example, some sort of content Fedora has deemed offensive).
- Package causes the download of some third-party content that gives
rise to a security issue, where knowledge of the security issue would have prevented direct packaging of the content.
I just skimmed through the Fedora packaging guidelines and the FESCo-related documentation and didn't seem to find anything on this sort of topic.
At this point, this discussion is a bit much.
We have game engines with data file downloaders for demo content, we have web browsers that auto-download things on launch, and so on.
If you're really worried about it, tweak pytorch to require configuration or make a prompt when it triggers the first time or something.
We did this with gdb with the debuginfod, and that's probably the closest pattern to go with for this.
But this is not a legal question per se, this is a functionality and philosophy question.
-- 真実はいつも一つ!/ Always, there's only one truth!
On Fri, Mar 1, 2024 at 10:25 PM Neal Gompa ngompa13@gmail.com wrote:
At this point, this discussion is a bit much.
We have game engines with data file downloaders for demo content, we have web browsers that auto-download things on launch, and so on.
If you're really worried about it, tweak pytorch to require configuration or make a prompt when it triggers the first time or something.
We did this with gdb with the debuginfod, and that's probably the closest pattern to go with for this.
But this is not a legal question per se, this is a functionality and philosophy question.
I mean, why isn't it a legal question in some cases? What if a package on first launch downloads an unauthorized copy of _Pirates of the Caribbean_? I might be okay with the answer that "this has never happened and we can deal with that problem if it ever arises".
It is a philosophical question, I think, whether Fedora would want to tolerate the possibility of a package circumventing Fedora licensing policy through post-installation downloads (leaving aside the issue of user agency in this). That is directly raised by this pretrained weights issue since at least some of the weights Tim is talking about wouldn't satisfy Fedora licensing guidelines if packaged directly. I don't think there's a right or wrong answer here, but I think Fedora ought to have a consistent position on this.
Richard
On Fri, Mar 1, 2024 at 10:40 PM Richard Fontana rfontana@redhat.com wrote:
On Fri, Mar 1, 2024 at 10:25 PM Neal Gompa ngompa13@gmail.com wrote:
At this point, this discussion is a bit much.
We have game engines with data file downloaders for demo content, we have web browsers that auto-download things on launch, and so on.
If you're really worried about it, tweak pytorch to require configuration or make a prompt when it triggers the first time or something.
We did this with gdb with the debuginfod, and that's probably the closest pattern to go with for this.
But this is not a legal question per se, this is a functionality and philosophy question.
I mean, why isn't it a legal question in some cases? What if a package on first launch downloads an unauthorized copy of _Pirates of the Caribbean_? I might be okay with the answer that "this has never happened and we can deal with that problem if it ever arises".
It is a philosophical question, I think, whether Fedora would want to tolerate the possibility of a package circumventing Fedora licensing policy through post-installation downloads (leaving aside the issue of user agency in this). That is directly raised by this pretrained weights issue since at least some of the weights Tim is talking about wouldn't satisfy Fedora licensing guidelines if packaged directly. I don't think there's a right or wrong answer here, but I think Fedora ought to have a consistent position on this.
Historically, the approach we've used is to make it interactive and allow the user to be informed about the nature of the thing. It's either that or decide to disable the functionality entirely. I would probably suggest making it interactive somehow.
On Fri, Mar 01, 2024 at 10:08:27PM -0500, Richard Fontana wrote:
I just skimmed through the Fedora packaging guidelines and the FESCo-related documentation and didn't seem to find anything on this sort of topic.
I think the relevant policy is under the FPC, in the packaging guidelines: https://docs.fedoraproject.org/en-US/packaging-guidelines/what-can-be-packag...
Some software is not functional or useful without the presence of external code dependencies in the runtime operating system environment. When those external code dependencies are non-free, legally unacceptable, or binary-only (with the exception of permissible firmware), then the dependent software is not acceptable for inclusion in Fedora.
If the code dependencies are acceptable for Fedora, then they should be packaged and included in Fedora as a pre-requisite for inclusion of the dependent software. Software which downloads code bundles from the internet in order to be functional or useful is not acceptable for inclusion in Fedora (regardless of whether the downloaded code would be acceptable to be packaged in Fedora as a proper dependency).
This specifically says "code" rather than content -- if we are comfortable defining models with weights as content (and that seemed to be the consensus), I think this is okay at under current guidelines.
I'm going to file a Fedora Council ticket to ask if we should ask FPC to add AI models (and perhaps some of Tim's helpful definitions) to the examples of "permissible content" higher on that same page.
But I am not going to do it today, because there is enough going on with xz and now the KDE change proposal that I can't handle it. :)
Following Tim's explanations of various things, here are revised answers to the questions:
On Mon, Feb 26, 2024 at 6:32 PM Tim Flink tflink@fedoraproject.org wrote:
Questions
- Are pre-trained weights considered to be normal non-code content/data or do they require special handling?
For Fedora license classification purposes, they should be considered "content". However, I think for any specific pre-trained weights that will actually be included in Fedora packages, for some initial period I'd like to do some further review (as noted upthread, because this is an important policy area and we don't have a lot of prior experience in it). I don't really care how that's done, that could be through this list or a Bugzilla or whatever.
We'll add "pre-trained weights" to the list of examples of what "content" is in the Fedora legal docs.
- If an upstream offers pre-trained weights and indicates that those weights are available under a license which is acceptable for non-code content in Fedora, can those pre-trained weights be included in Fedora packages?
Yes subject to my answer to 1.
- Extending question 2, is it considered sufficient for an upstream to have a license on pre-trained weights or would a packager/reviewer need to verify that the data used to train those weights is acceptable?
A packager/reviewer should not need to do that verification, which seems highly impractical (which is a point I think you may have previously made). However, that could be an aspect of the "initial legal review" I'm suggesting we may want to have for such cases.
- Is it acceptable to package code which downloads pre-trained weights from a non-Fedora source upon first use post-installation by a user if that model and its associated weights are a. For a specific model? b. For a user-defined model which may or may not exist at the time of packaging?
Given your explanations of these cases, I think this is pretty straightforward. 4a: Yes 4b: Yes
These answers only go to matters of Fedora legal/licensing policy. If there are technical issues raised by these questions (for example, if there ought to be some standards around packaging of upstream pre-trained weights) I can't give guidance or informed opinions on that beyond my initial suggestion to raise this topic with FESCo which seems to have been unsuccessful.
Richard
On Fri, Mar 1, 2024 at 5:20 PM Richard Fontana rfontana@redhat.com wrote:
Following Tim's explanations of various things, here are revised answers to the questions:
On Mon, Feb 26, 2024 at 6:32 PM Tim Flink tflink@fedoraproject.org wrote:
Questions
- Are pre-trained weights considered to be normal non-code content/data or do they require special handling?
For Fedora license classification purposes, they should be considered "content". However, I think for any specific pre-trained weights that will actually be included in Fedora packages, for some initial period I'd like to do some further review (as noted upthread, because this is an important policy area and we don't have a lot of prior experience in it). I don't really care how that's done, that could be through this list or a Bugzilla or whatever.
We'll add "pre-trained weights" to the list of examples of what "content" is in the Fedora legal docs.
- If an upstream offers pre-trained weights and indicates that those weights are available under a license which is acceptable for non-code content in Fedora, can those pre-trained weights be included in Fedora packages?
Yes subject to my answer to 1.
- Extending question 2, is it considered sufficient for an upstream to have a license on pre-trained weights or would a packager/reviewer need to verify that the data used to train those weights is acceptable?
A packager/reviewer should not need to do that verification, which seems highly impractical (which is a point I think you may have previously made). However, that could be an aspect of the "initial legal review" I'm suggesting we may want to have for such cases.
- Is it acceptable to package code which downloads pre-trained weights from a non-Fedora source upon first use post-installation by a user if that model and its associated weights are a. For a specific model? b. For a user-defined model which may or may not exist at the time of packaging?
Given your explanations of these cases, I think this is pretty straightforward. 4a: Yes 4b: Yes
These answers only go to matters of Fedora legal/licensing policy. If there are technical issues raised by these questions (for example, if there ought to be some standards around packaging of upstream pre-trained weights) I can't give guidance or informed opinions on that beyond my initial suggestion to raise this topic with FESCo which seems to have been unsuccessful.
With my FESCo hat on, the main question to answer is how we classify and identify them for package reviews, which is largely a Fedora Legal question. Personally, it's basically content to me, we do probably need some explicit documentation of this for the guidance that the AI/ML SIG can use to write packaging guidelines for FPC to review.
If you want to apply additional review to neural net coefficients, I suppose you might as well start with those already packaged in stockfish[1]. (I CC’d the stockfish-maintainers email alias to loop in the primary maintainer. I am a co-maintainer, and I did the original package review.)
Stockfish is a state-of-the-art chess engine. The code is licensed GPL-3.0-or-later, but it requires two pre-trained neural network coefficient files to function. These coefficient files are selected from those at [2], all licensed CC0-1.0, and they are compiled into the binaries rather than shipped as separate files. This is quite consistent with treating them as content; there is a long history of including content – like graphics, audio, or text files – as data in compiled executables.
The Fedora package always uses the “default” coefficient sets for a particular release of Stockfish, as defined in [3].
– Ben Beasley (FAS: music)
[1] https://src.fedoraproject.org/rpms/stockfish
[2] https://tests.stockfishchess.org/nns
[3] https://github.com/official-stockfish/Stockfish/blob/e67cc979fd2c0e66dfc2b2f...
On 3/1/24 5:19 PM, Richard Fontana wrote:
Following Tim's explanations of various things, here are revised answers to the questions:
On Mon, Feb 26, 2024 at 6:32 PM Tim Flinktflink@fedoraproject.org wrote:
Questions
- Are pre-trained weights considered to be normal non-code content/data or do they require special handling?
For Fedora license classification purposes, they should be considered "content". However, I think for any specific pre-trained weights that will actually be included in Fedora packages, for some initial period I'd like to do some further review (as noted upthread, because this is an important policy area and we don't have a lot of prior experience in it). I don't really care how that's done, that could be through this list or a Bugzilla or whatever.
We'll add "pre-trained weights" to the list of examples of what "content" is in the Fedora legal docs.
- If an upstream offers pre-trained weights and indicates that those weights are available under a license which is acceptable for non-code content in Fedora, can those pre-trained weights be included in Fedora packages?
Yes subject to my answer to 1.
- Extending question 2, is it considered sufficient for an upstream to have a license on pre-trained weights or would a packager/reviewer need to verify that the data used to train those weights is acceptable?
A packager/reviewer should not need to do that verification, which seems highly impractical (which is a point I think you may have previously made). However, that could be an aspect of the "initial legal review" I'm suggesting we may want to have for such cases.
- Is it acceptable to package code which downloads pre-trained weights from a non-Fedora source upon first use post-installation by a user if that model and its associated weights are a. For a specific model? b. For a user-defined model which may or may not exist at the time of packaging?
Given your explanations of these cases, I think this is pretty straightforward. 4a: Yes 4b: Yes
These answers only go to matters of Fedora legal/licensing policy. If there are technical issues raised by these questions (for example, if there ought to be some standards around packaging of upstream pre-trained weights) I can't give guidance or informed opinions on that beyond my initial suggestion to raise this topic with FESCo which seems to have been unsuccessful.
Richard
legal mailing list --legal@lists.fedoraproject.org To unsubscribe send an email tolegal-leave@lists.fedoraproject.org Fedora Code of Conduct:https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines:https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives:https://lists.fedoraproject.org/archives/list/legal@lists.fedoraproject.org Do not reply to spam, report it:https://pagure.io/fedora-infrastructure/new_issue
On 04/03/2024 18.44, Ben Beasley wrote:
If you want to apply additional review to neural net coefficients, I suppose you might as well start with those already packaged in stockfish[1]. (I CC’d the stockfish-maintainers email alias to loop in the primary maintainer. I am a co-maintainer, and I did the original package review.)
Stockfish is a state-of-the-art chess engine. The code is licensed GPL-3.0-or-later, but it requires two pre-trained neural network coefficient files to function. These coefficient files are selected from those at [2], all licensed CC0-1.0, and they are compiled into the binaries rather than shipped as separate files. This is quite consistent with treating them as content; there is a long history of including content – like graphics, audio, or text files – as data in compiled executables.
The Fedora package always uses the “default” coefficient sets for a particular release of Stockfish, as defined in [3].
– Ben Beasley (FAS: music)
[1] https://src.fedoraproject.org/rpms/stockfish
[2] https://tests.stockfishchess.org/nns
[3] https://github.com/official-stockfish/Stockfish/blob/e67cc979fd2c0e66dfc2b2f...
On 3/1/24 5:19 PM, Richard Fontana wrote:
Following Tim's explanations of various things, here are revised answers to the questions:
On Mon, Feb 26, 2024 at 6:32 PM Tim Flinktflink@fedoraproject.org wrote:
Questions
- Are pre-trained weights considered to be normal non-code
content/data or do they require special handling?
For Fedora license classification purposes, they should be considered "content". However, I think for any specific pre-trained weights that will actually be included in Fedora packages, for some initial period I'd like to do some further review (as noted upthread, because this is an important policy area and we don't have a lot of prior experience in it). I don't really care how that's done, that could be through this list or a Bugzilla or whatever.
We'll add "pre-trained weights" to the list of examples of what "content" is in the Fedora legal docs.
- If an upstream offers pre-trained weights and indicates that those
weights are available under a license which is acceptable for non-code content in Fedora, can those pre-trained weights be included in Fedora packages?
Yes subject to my answer to 1.
For data driven models such as pre-trained weights, some knowledge about the data used for the training is required. It would be good to have some consideration of this, for example is a neural network code generator which uses GPL code as training data also under GPL?
- Extending question 2, is it considered sufficient for an upstream
to have a license on pre-trained weights or would a packager/reviewer need to verify that the data used to train those weights is acceptable?
A packager/reviewer should not need to do that verification, which seems highly impractical (which is a point I think you may have previously made). However, that could be an aspect of the "initial legal review" I'm suggesting we may want to have for such cases.
- Is it acceptable to package code which downloads pre-trained
weights from a non-Fedora source upon first use post-installation by a user if that model and its associated weights are a. For a specific model? b. For a user-defined model which may or may not exist at the time of packaging?
Given your explanations of these cases, I think this is pretty straightforward. 4a: Yes 4b: Yes
These answers only go to matters of Fedora legal/licensing policy. If there are technical issues raised by these questions (for example, if there ought to be some standards around packaging of upstream pre-trained weights) I can't give guidance or informed opinions on that beyond my initial suggestion to raise this topic with FESCo which seems to have been unsuccessful.
Richard
* Tim Flink:
- Is it acceptable to package code which downloads pre-trained weights from a non-Fedora source upon first use post-installation by a user if that model and its associated weights are a. For a specific model? b. For a user-defined model which may or may not exist at the time of packaging?
Note that Firefox in Fedora already does this for this feature:
Firefox Translation https://support.mozilla.org/en-US/kb/website-translation
I don't know if it's more like the (a) or (b) option.
Thanks, Florian