• Zarxrax@lemmy.world
    link
    fedilink
    English
    arrow-up
    29
    ·
    24 hours ago

    Inaccurate headline. The bill doesn’t ban web scraping, it just requires that bots accurately identify themselves through the user agent string, and maybe some additional requirements to disclose the purpose of scraping the data.

    • deliriousdreams@fedia.io
      link
      fedilink
      arrow-up
      7
      ·
      21 hours ago

      And there’s a fine if the company doesn’t comply which is basically now gonna be considered the cost of doing business

    • TryingToBeGood@reddthat.comOP
      link
      fedilink
      English
      arrow-up
      18
      ·
      1 day ago

      The New York measure defines a stealth crawler as any software that retrieves, scrapes or otherwise accesses a website, including AI agents. Under the bill, the attorney general’s office would be able to sue companies that fail to disclose such activity. Violations could net civil penalties of up to $15,000 per day.

      🤔

    • Rob T Firefly@lemmy.world
      link
      fedilink
      English
      arrow-up
      6
      ·
      23 hours ago

      The article is about “stealth bots” that don’t identify themselves as such. The Internet Archive bots have always been clearly identifiable.

  • Pika@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    9
    ·
    edit-2
    1 day ago

    Honestly, I would love if forced ident was required. but archival services need a hard exemption from being blocked as well.

  • rob200@retrofed.com
    link
    fedilink
    English
    arrow-up
    7
    arrow-down
    1
    ·
    24 hours ago

    Realistically what good would it do once you already had scraped the pattern of news sites it’s already over. All this is doing in actuality is preventing new start ups from competing in the ai space. so really this is the fastest enshitification world record of a medium. Whether you like or hate ai this is actually an enshitification of it. ( I hate ai.)

    • fonix232@fedia.io
      link
      fedilink
      arrow-up
      1
      ·
      3 hours ago

      What are you on about?

      For AI purposes the really useful part of a news site is the actual news - you know, the stuff that changes practically every minute - not the “structure” of the site.

      These news sites aren’t being scraped for training data anymore but to provide near-realtime up to date information to the models.

      Meaning e.g. Gemini can scan your news article, extract the useful information for the user, and deliver it to the user, without them ever going to your news site and providing the interaction that at the end of the day is converted to money - money your site needs to run.

    • nullspace@lemmy.world
      link
      fedilink
      English
      arrow-up
      5
      ·
      23 hours ago

      I’m guessing it’s to eliminate the issue of a site not getting clicks because the article you were about to read is already summarized for you. It also opens the door for revenue negotiations for allowing their content to be scraped for that purpose, as the scraper bots would now be identified.