A user on the online forum 4chan has leaked a massive 270GB of data purportedly belonging to The New York Times. This leak includes what is claimed to be the source code for the newspaper’s digital operations.

  • merthyr1831@lemmy.world
    link
    fedilink
    English
    arrow-up
    56
    arrow-down
    1
    ·
    edit-2
    6 months ago

    270GB feels insane for the source code of a single organisation. Is there media assets or backups in there too?

    EDIT: yep, multiple subsidiaries and slack Comms which could inflate it by a lot. we post a whole lot of uncompressed shit on our slack

  • daddy32@lemmy.world
    link
    fedilink
    English
    arrow-up
    53
    ·
    6 months ago

    NY Times has a freaking great data visualisations, they are (were?) employing a wizard in this space, doing custom extensions on d3.js.

      • DudeImMacGyver@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        19
        ·
        6 months ago

        Yeah, I guess I didn’t consider all the other operational shit that goes into providing content and funding for the website.

        • aStonedSanta@lemm.ee
          link
          fedilink
          English
          arrow-up
          22
          ·
          6 months ago

          It’s why our PCs have gotten insanely fast but websites still load like fucking trash. All the back end spying shit takes up a ton of cpu cycles. If you don’t already have em run ublock origin and no script and the internet is so fucking speedy 😆

    • MacN'Cheezus@lemmy.today
      link
      fedilink
      English
      arrow-up
      67
      arrow-down
      1
      ·
      edit-2
      6 months ago

      Anything more complicated than a static website is going to have a significant amount of server-side code.

      Also, the article explains that it’s not just the website, but ALL of their repos, which would include their smartphone apps, backend tools, etc.

  • Dark Arc@social.packetloss.gg
    link
    fedilink
    English
    arrow-up
    41
    arrow-down
    1
    ·
    edit-2
    6 months ago

    I doubt this will affect much … that’s a lot more source code than I’d expect though, dang.

    Presumably a lot of it is for internal operations (custom editing software or something of that ilk).

    • General_Effort@lemmy.world
      link
      fedilink
      English
      arrow-up
      8
      ·
      edit-2
      6 months ago

      In case anyone missed the hubbub: [ETA: This is from March 2024; unconnected to this hack/leak]

      https://apnews.com/article/new-york-times-wordle-clones-takedown-dmca-35d32b7548f7312ea74a2065b2cd31a6

      The Times has filed several Digital Millennium Copyright Act, or DMCA, takedown notices to developers of Wordle-inspired games, which cited infringement on the Times’ ownership of the Wordle name, as well as its look and feel — such as the layout and color scheme of green, gray and yellow tiles.

      Numerous impacted developers have also taken to social media to share their frustrations. Many said that their games, which range from Wordle-like offerings in other languages to more guessing games, would be taken down as a result.

      Still, Brauneis said he believes the Times’ arguments for Wordle copyright infringement are on “a little bit shaky ground” for several reasons. Rules of a game, for example, are not covered by copyright — and that can include the layout of the game itself, he said.

  • Autonomous User@lemmy.world
    link
    fedilink
    English
    arrow-up
    29
    arrow-down
    6
    ·
    edit-2
    6 months ago

    We still have no legal right to use, change and share its source code, control it both ourselves and in groups. It’s still anti-libre software.

    • seathru@lemmy.sdf.org
      link
      fedilink
      English
      arrow-up
      61
      arrow-down
      2
      ·
      edit-2
      6 months ago

      Anything that may help develop better adblockers/paywall bypasses or exposes how/what of our personal information is collected is a win in my book. And this may very well be none of those things.

      • Autonomous User@lemmy.world
        link
        fedilink
        English
        arrow-up
        5
        arrow-down
        2
        ·
        edit-2
        6 months ago

        They only exist when we keep them relevant and we already know we can’t prove it’s private but if it helps some people, that’s good.

      • 0xD@infosec.pub
        link
        fedilink
        English
        arrow-up
        8
        arrow-down
        23
        ·
        6 months ago

        Right, because fuck paying for proper journalism. Everything must be free!

        Remind me again, how does that work?

        • errer@lemmy.world
          link
          fedilink
          English
          arrow-up
          19
          ·
          6 months ago

          I pay for the NYT, and yet every other screen is a fucking ad (often the same ad repeated over and over). You already have my subscription money, and unless they decide not to be so greedy (haha), their ads get shoved up my pihole.

        • noisefree@lemmy.world
          link
          fedilink
          English
          arrow-up
          6
          ·
          6 months ago

          The inverse of this is where subscription services that previously had no ads for paying subscribers then add in ads on paid plans while also increasing the fees associated. It’s a pretty standard practice, NYT included. Adblocking is necessary.

    • magi@lemmy.blahaj.zone
      link
      fedilink
      English
      arrow-up
      3
      arrow-down
      1
      ·
      6 months ago

      Very few care about licenses unless the use of such material can be proven, and good luck with that

  • Dogyote@slrpnk.net
    link
    fedilink
    English
    arrow-up
    19
    arrow-down
    3
    ·
    6 months ago

    Did this leak happen before or after NYT published an investigation detailing how Israeli forces were raping and torturing defenseless Palestinian detainees brought in from the Gaza Strip?

  • skymtf@pricefield.org
    link
    fedilink
    English
    arrow-up
    19
    arrow-down
    5
    ·
    6 months ago

    I have not read the news in a really long time just cause paywalls are annoying as frick.

          • Serinus@lemmy.world
            link
            fedilink
            English
            arrow-up
            12
            arrow-down
            3
            ·
            6 months ago

            Pay for news if you want it to be independent, and not beholden to sponsors.

            I’d go as far as to say that paying for news (if you have the means to do so comfortably), is your duty as a commitment to democracy.

              • Serinus@lemmy.world
                link
                fedilink
                English
                arrow-up
                4
                arrow-down
                2
                ·
                6 months ago

                It’s amazing the number of times on Lemmy that someone will come in with the completely opposite “explanation” for what I was saying. Almost like they have an agenda.

                It’s so weird to turn my statement of “support the news with money” into “the mainstream media can’t be trusted”.

                Maybe it’s only happened twice, but it’s still weird that it’s happened twice.

                • Dark Arc@social.packetloss.gg
                  link
                  fedilink
                  English
                  arrow-up
                  2
                  arrow-down
                  1
                  ·
                  edit-2
                  6 months ago

                  I was wondering if that’s where you were going in part.

                  I think it’s a bit of the phrasing; you stated an opinion that’s vague to the point of tiptoeing towards the potentially loaded question: “who’s independent media?”

                  It’s not uncommon in the conservative media sphere to see a similar (typically series) of leading ambiguous questions. They’re never genuine, it’s always in the style of:

                  You know what the best operating system is? I’ll tell you what the best operating system is, it’s Linux. Do you know why Linux is the best operating system? It’s because it’s got penguins and penguins are great! Do you know why penguins are great? I mean, can you think of a more iconic bird? That’s why, that is why … and Big Microsoft is out to destroy your hopes and dreams aren’t they? Yes, yes they absolutely are, with their soulless Windows operating system that’s manufactured by the flying spaghetti monster. Now obviously folks, only use Linux if you support freedom not the unholy flying spaghetti monster. The flying spaghetti monster will destroy America. It’s its one true mission. Support freedom, support penguins, stop the flying spaghetti monster.

                  I think it’s made a bunch of if antsy lol

          • PrivateNoob@sopuli.xyz
            link
            fedilink
            English
            arrow-up
            8
            arrow-down
            4
            ·
            6 months ago

            He probably means one of these (or both):

            1. New York Times is a huge corporation. The commenter would only support a site which is run by one creator, or with a genuine small team, which is transparent and not an asshole.

            2. New York Times is biased politically or accepting bribery attempts from other corpos to make them look in a better light.

            • Serinus@lemmy.world
              link
              fedilink
              English
              arrow-up
              11
              arrow-down
              4
              ·
              6 months ago

              Jesus Christ, no. It’s almost like you’re trying to sow distrust in the news and facts.

              The NYT isn’t perfect, but it’s some of the most reliable news the world has.

              As of March 2023, The New York Times Company employs 5,800 individuals,[101] including 1,700 journalists according to deputy managing editor Sam Dolnick.[122] Journalists for The New York Times may not run for public office, provide financial support to political candidates or causes, endorse candidates, or demonstrate public support for causes or movements.[123] Journalists are subject to the guidelines established in “Ethical Journalism” and “Guidelines on Integrity”.[124] According to the former, Times journalists must abstain from using sources with a personal relationship to them and must not accept reimbursements or inducements from individuals who may be written about in The New York Times, with exceptions for gifts of nominal value.[125] The latter requires attribution and exact quotations, though exceptions are made for linguistic anomalies. Staff writers are expected to ensure the veracity of all written claims, but may delegate researching obscure facts to the research desk.[126] In March 2021, the Times established a committee to avoid journalistic conflicts of interest with work written for The New York Times, following columnist David Brooks’s resignation from the Aspen Institute for his undisclosed work on the initiative Weave.[127]

              • PrivateNoob@sopuli.xyz
                link
                fedilink
                English
                arrow-up
                2
                ·
                6 months ago

                Well it definitely seemed like that. Sorry I was just assuming, since most Lemmy people are really anti-establishment on everything basically.

                • Serinus@lemmy.world
                  link
                  fedilink
                  English
                  arrow-up
                  2
                  ·
                  6 months ago

                  Well, if you didn’t get it from me, you certainly would have gotten it from some of these responses.

              • OBJECTION!@lemmy.ml
                link
                fedilink
                English
                arrow-up
                4
                arrow-down
                4
                ·
                edit-2
                6 months ago

                The New York Crimes is a garbage propaganda rag. They don’t deserve a red cent from anyone after pushing their transphobic agenda, (and responding to widespread criticism by publishing an article defending JK Rowling) or after they blatantly lied and published a fake news story about Hamas conducting mass rape in an attempt to sway public opinion to be in favor of Israel’s genocide. If you have a NYT subscription, you are paying people to lie to you.

              • Linkerbaan@lemmy.world
                link
                fedilink
                English
                arrow-up
                3
                arrow-down
                6
                ·
                edit-2
                6 months ago

                Dear god I can’t believe anyone still believes this shit after NYT hired an ex IDF soldier without any prior journalistic experience to write a massive fake rape propaganda article for israel.

                NYT is a state propaganda outlet.

            • This is fine🔥🐶☕🔥@lemmy.world
              link
              fedilink
              English
              arrow-up
              7
              ·
              6 months ago
              1. New York Times is a huge corporation. The commenter would only support a site which is run by one creator, or with a genuine small team, which is transparent and not an asshole.

              Yeah but good luck chasing multiple stories across the world as a small team.

    • OBJECTION!@lemmy.ml
      link
      fedilink
      English
      arrow-up
      5
      ·
      6 months ago

      You can go to archive.is and put in the url of a news story you want to read in the second box and it will usually let you bypass the paywall.

  • 🇦🇺𝕄𝕦𝕟𝕥𝕖𝕕𝕔𝕣𝕠𝕔𝕕𝕚𝕝𝕖@lemm.ee
    link
    fedilink
    English
    arrow-up
    6
    arrow-down
    17
    ·
    edit-2
    6 months ago

    Thats a lot of data but surly its not all their articles cos I’d very much like to train mixtral7x8b on it along with 4chan data and shir from the dark web. Surly there is a project where such a model is public and being trained on literally everything regardless of legality.

    EDIT: why am i getting downvoted?

    • reddithalation@sopuli.xyz
      link
      fedilink
      English
      arrow-up
      3
      arrow-down
      5
      ·
      edit-2
      6 months ago

      you’re getting downvoted because LLMs are simply not very good, they consume lots of energy (bad for climate), and seemingly most people involved in ai hype want to replace human creativity or something.

      how about instead of training a not very trustworthy or useful LLM on lots of nyt, 4chan, and “dark web”, you go read lots of nyt, 4chan, and dark web to train your own (much better) model (your brain).

      • They are very good they exceed the capability of many humans in many tasks. If consume energy = bad for environment then all electric vehicles are bullshit cos they have energy inefficiencies that petrol cars don’t (thermodynamics is a bitch). U do realise the argument about if asking an ai to create an image is art argument is literally the same argument that was had about if photography is art.

        Llm are decently trustworthy especially with chain of thought reasoning and tool capabilities. And they are extraordinarily useful people wouldnt be using them and creating a market for them of they weren’t. I can’t train my brain then share it for free to everyone on the internet to download I can with an ai tho.

        • reddithalation@sopuli.xyz
          link
          fedilink
          English
          arrow-up
          1
          ·
          edit-2
          6 months ago

          Have you seen that study about the accuracy of chatgpt responding to programming questions? (here) It’s wrong 52% of the time, and I can say that I have personally experienced trying to use chatgpt for programming and getting more confused rather than less. Maybe it is because I wasn’t using gpt4, or claude, or whatever new model is the best, but I’m just sharing my experience.

          Also I support electric vehicles because without them lots of energy (and emissions) is generated for critical infrastructure (we can’t ditch cars yet), and so replacing that with renewably generated energy is a good idea.

          LLMs consume lots of energy to train and use, but instead of literally moving millions of people around, they assist you in doing things you could have done without them, but with dubious accuracy. Look at the massive use of LLMs in by students to cheat in school, yes they may not get detected, but sometimes they have noticable flaws, that get them in large trouble for being too lazy to actually learn anything.

          If you want to learn in depth knowledge about a topic, just go look it up and learn there, it’s more helpful than an LLM.