The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems’ permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.

The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

  • codemankey@programming.dev
    link
    fedilink
    English
    arrow-up
    21
    arrow-down
    2
    ·
    15 hours ago

    My assumption is that the pattern you describe is possible/doable on certain scales and in certain combinations of technologies. But doing this across a distributed system with as many nodes and as many different nodes as CloudFlare has, and still have a system that can be updated quickly (responding to DDOS attacks for example) is a lot harder.

    If you really feel like you have a better solution please contact them and consult for them, the internet would thank you for it.

    • Echo Dot@feddit.uk
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      5
      ·
      11 hours ago

      They know this, it’s not like any of this is a revelation. But the company has been lazy and would rather just test in production because that’s cheaper and most of the time perfectly fine.

      • floquant@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        2
        ·
        6 hours ago

        It looks like you have never read their blog. They do a lot of research and upstream contributions to improve their stack