old (1 year = 1000 years in AI) accusation not relevant to expected upcoming deepseek breakthrough model. Distillation is used to make smaller models, and they are always crap compared to training on open data. Distillation is not a common technique anymore, though it’s hard to prove that more tokens wouldn’t be “cheat code”
This is more a desperation play from US models, even as youtube is in full, “buy $200/month subscriptions now or die” mode.
old (1 year = 1000 years in AI) accusation not relevant to expected upcoming deepseek breakthrough model. Distillation is used to make smaller models, and they are always crap compared to training on open data. Distillation is not a common technique anymore, though it’s hard to prove that more tokens wouldn’t be “cheat code”
This is more a desperation play from US models, even as youtube is in full, “buy $200/month subscriptions now or die” mode.