Well llama is famously trained on the books3 dataset, which was full of stolen books.
You can’t even get that dataset anymore and the people who made the scripts that generated it got arrested.
Same goes for fb using all text data from almost all adults on their platform in Australia and the US. OpenAI seemingly used YouTube data, without permission, to train their sora model.
Copilot was trained on all public GitHub repos, regardless of license.
If you don’t think there are ethical concerns there… then I think we have different definitions of “ethics”
You can’t even get that dataset anymore and the people who made the scripts that generated it got arrested.
Same goes for fb using all text data from almost all adults on their platform in Australia and the US. OpenAI seemingly used YouTube data, without permission, to train their sora model. Copilot was trained on all public GitHub repos, regardless of license.
If you don’t think there are ethical concerns there… then I think we have different definitions of “ethics”