I rebuilt my blog's cache. Bots are the audience now

rrobhoeijmakers about 5 hours ago 54 commentsRead Article on hoeijmakers.net

HI version is available. Content is displayed in original English for accuracy.

⚡ Community Insights

Discussion Sentiment

68% Positive

Analyzed from 2011 words in the discussion.

Discussion (54 Comments)Read Original on HackerNews

jdw64•about 5 hours ago

Personally, I think this is a good idea. But the core problem is this: How is a newcomer supposed to build reputation now? Without exaggerated business promises or capital, basic online reputation usually depends on writing. In fact, my own first step into freelancing came because someone found the articles on my Korean blog interesting. So the question is: if the subscribers are bots, what benefit do they actually give me? If bots become the readers, then what matters is whether they can provide any kind of symbolic capital or real capital. I can build caching with Redis without much difficulty, but I worry that if this continues, the result may simply be that LLMs learn from my writing while no benefit returns to me. People write partly to organize their thoughts, but also partly to gain symbolic capital. That is one reason why I write my own posts instead of using an LLM to write them for me.

ryandrake•about 2 hours ago

I think in general, "Writing on the Internet with the intent to make money" is effectively dead, or at least soon to be dead. AI+bots mean we now have the "infinite typewriter monkeys" from the thought experiment. With infinite supply, the price goes to zero.

We need to stop this treadmill of trying to "build reputation" and stop focusing on "symbolic capital" and "clout" and whatever else bloggers are going after. You're not going to get it, and even if you do, you're not going to be able to "monetize" it.

If you have a need to write, write. Maybe a handful of actual people will read it, maybe not. But, I wouldn't try to do it for a living. The reward will have to be the cathartic process of writing itself, and not in how much attention it gets, how much it "blows up" or how viral it gets.

jdw64•about 2 hours ago

I am not trying to make money from writing.

What I need is for my writing to spread enough that I can receive opportunities to have my programming ability evaluated.

The reason I write about programming is that, in the past, some readers found my programming essays interesting, and that led to chances for me to be tested. I had to leave graduate school because of financial problems, and I did not graduate from a prestigious university.

So this is not simply about monetizing writing. It is a struggle to receive opportunities. Those are fundamentally different things.

Some people may be happy writing things that nobody reads. But many people are happier when they can share their writing and let their values collide with those of others.

lenkite•about 1 hour ago

The world-wide web has been unfortunately been conquered, annexed and is being burned by the Tyrannical AI Empire. Until a Pure-Humans-Only alternative resistance movement and platform arises from its sordid ashes, we will all need to bow our heads and kneel before the God-Emperor.

pixl97•about 3 hours ago

>How is a newcomer supposed to build reputation now

Dead internet manifest.

whstl•about 2 hours ago

I dunno. SEO spam made by human-robots already killed the first half of the internet in the late 2010s. Mostly driven by low-paid freelancers writing without expertise and plagiarizing other articles like crazy.

It has been virtually impossible to find real information in some areas if you don't personally know which website is reliable or not. That's why we devs used to go to StackOverflow, and people use site:reddit.com when searching Google. LLMs just exacerbated all of that, but it was already happening.

nilirl•about 3 hours ago

I feel that pressure of not knowing how to definitively compete on the internet, especially when there's so much AI created noise.

I'm a copywriter and I used to get hired to write posts on behalf of founders on LinkedIn or for their company blog.

Now, the last three jobs I had were all focused on sending cold email.

johng•about 3 hours ago

What's worse, is they train on your content, and very often you don't even get an attribution link. So the end user never even knows it was your site that provided the information and you never even get a single clickthrough. It's not like the SERPs where someone would click through, read your site, hopefully find it interesting and useful and come back.

It's going to be a serious problem and I've already seen sites that are down 90% in traffic simply because AI is scraiping them, answering the questions themselves and never providing a linkback.

01284a7e•about 3 hours ago

I pulled all the websites I had - some existed for a decade plus and made me hundreds of thousands of dollars. All that is left is bots that theft the value of my work. Until something changes, goodbye.

cultofmetatron•about 2 hours ago

I have no problem with these AI summaries but someone needs to make a law requiring the AI provider to generate links to the primary source as part of the output. If a human writer ripped off someones work and didn't attribute, it would be a huge scandal. when an AI does it, its expected.

gbgarbeb•about 3 hours ago

This is like choosing to be an elementary school teacher and then quitting because it turns out your students for the year aren't your pets in perpetuity.

tardedmeme•about 2 hours ago

You might as well serve different content to bots. Incorrect content.

pavel_lishin•about 3 hours ago

> Not because I expect a person in Singapore to shave 200ms off their pageload, but because the next request for that page is more likely to come from a retrieval system than a browser, and the request after that, and the one after that.

Why do I care if I shave off 200ms from a crawler's request, instead of a human's?

Brybry•about 3 hours ago

The graphic in the article seems to be the only significant content.

Based on that I think it's more about requests from bots/scrapers having the greatest chance possible of hitting a cache before hitting the blog's origin/real host. Bots will hit some layer of Cloudflare first then they'll hit Fastly and then if not in Fastly they'll hit the Ghost blog's server.

To me, this makes a lot of sense if it's self-hosted but I also thought it was already the standard to shove your self-hosted blog behind a reverse-proxy and cache as much as possible.

And I'm not a professional web developer but all the extra caching layers for a static personal blog seem a bit overkill.

Aside from the graphic, the article is a lot of words about engaging with an LLM to get a full understanding of how caching works for their blog hosting and how it enabled them to change their setup for the better.

It's kind of hard to understand because there are no words about what they actually did or how what they actually did was better.

m0rde•about 3 hours ago

From the post:

> If you care about how your content moves through the world now, including through AI systems, you have to care about caching. Not as a performance optimisation for human browsers, but as infrastructure for machine readership.

nothrabannosir•about 2 hours ago

That doesn’t answer the question at all, and I wonder if it’s actually true? A cache is not magic; it is, itself, just a static file server in the end. If I self host a static page website on an nginx box, do I actually need cache to serve today’s crawlers?

The screenshot in the image says 3k req/day. That’s 2 requests per minute (amortized). At that rate, you can serve it with cgi and Perl.

Cache is only relevant if you have a lot of traffic AND dynamic pages, or if you care about latency (which is only relevant for humans).

rodw•about 3 hours ago

Page load time can impact index coverage (depth of crawl), freshness (revisit rate), and ranking.

chrismorgan•about 2 hours ago

I’m very confused about why you’d have such a complex cache arrangement. Sounds like you’re using Cloudflare and Fastly to do roughly the same thing. That sounds like a recipe for more expense and more problems.

For the sort of thing you’re doing, it should be as simple as “throw it behind Cloudflare/Fastly/Bunny/whichever private CDN you like” and that’s it.

Also the diagram near the end is pretty much incoherent. GenAI, I presume.

robhoeijmakers•about 2 hours ago

I am on a low tier Ghost subscription, I could not rewrite some of the HTML. So I do this with Cloudflare and then cache it again.

Yes, the architecture setup is generated by ChatGPT but in itself it says what it needs.

tayo42•about 2 hours ago

Ghost blogs platform already has a cdn as part of their product so, makes even less sense to me.

Maybe we're being trolled and the op is crowd sourcing solutions by posting something ridiculous and getting us to put solutions up

yawnxyz•about 2 hours ago

my tiny blogs no one reads have been racking up a huge deno deploy and vercel bills ($40-50/mo each) bc I ran them "naked" without a cache or cloudflare or static builds - it didnt matter bc I got like hundreds of visitors a month. they were just hono or whatever api pulling from my backend which could be notion or airtable - super simple, though kind of slow

now I suddenly I have 10k visitors a month hammering my apis and causing massive egress and cpu usage - so i had to get them behind cloudflare and now build everything statically - cut the costs back down from 90+ cpu hours to about 0.2 cpu hours a month

crazy times

(also, all donw w/ claude code's help, or it would have taken a week for me to figure out)

faangguyindia•about 2 hours ago

This is why I don't use those serverless setup.

$4 hetzner vps can serve tons of request if you put cloudflare in front of it

I host my own runners for CI and artifcat building on Hetzner VPS (spun on demand).

People are easily lured by pay as you go plans on serverless and other cheap to get started managed services and end up racking huge bills.

This is same reason I don't use stack driver or cloud monitoring and prefer to use graphana + loki + Prometheus setup

My setup cannot be misconfigured and end up racking huge bills.

cullumsmith•about 3 hours ago

I simply block all AI crawlers with a user-agent check in nginx.conf.

microtonal•about 3 hours ago

I also block all AI crawlers. I am not sure why I should give them my content for them to rip it off and make money from it through training or agents. Sadly, a lot of AI companies are trying to make requests indistinguishable from regular browsers from residential connections, so unfortunately I have to use Cloudflare to block them.

Ideally I'd make the content available to crawlers for training open models, but that seems to be nearly impossible. It would be possible if other AI companies behaved.

Barbing•about 3 hours ago

>so unfortunately I have to use Cloudflare to block them.

That can’t block Grok, can it?

(You might have a fake iPhone or something visit your site if you ask Grok to retrieve information from it)

tardedmeme•about 1 hour ago

What's the IP address of the supposed iPhone? Does it come from T-Mobile or from xAI?

tardedmeme•about 1 hour ago

This works for a few weeks to months. Then they detect your site is hostile to them and enable evasion mode, with random IP addresses and user-agent strings. Proxies are expensive so at least they're losing money.

orf•about 3 hours ago

*some AI crawlers. Not many

robhoeijmakers•about 3 hours ago

I started blocking some of them. But for now I want to improve visibility before further blocking or optimising. The dashboard helps with this.

cdrnsf•about 2 hours ago

I denylisted traffic from Singapore. As far as I could discern it was all Windows 11 machines running Chrome accessing pages sequentially. I’d love to not do that, but trying to sift through that is quite the task.

tardedmeme•about 2 hours ago

TLS fingerprinting may identify what they actually are

faangguyindia•about 2 hours ago

Yesterday I logged into cloudflare and found that Cloudflare had blocked chatgpt and claude from accessing my site. https://macrocodex.app

This is bad because there are fitness guides on my domain

https://macrocodex.app/guides which newbies often put in chatgpt and asks to simplify.

I enabled crawl for LLMs. There is lot of misinformation in fitness field so it's better if LLMs get their content from people who atleast have experience in the field

robhoeijmakers•about 2 hours ago

It is good to make a proper distinction, in the ChatGPT context, between crawlers and agents. The crawlers go for the content to build a new model, the agents serve content to users. The last one can be very useful.

tardedmeme•about 1 hour ago

They use different user-agent strings. The crawlers obfuscate themselves and use residential proxies. The agents call themselves ChatGPT-User. Of course Cloudflare wants OpenAI to pay them for not blocking ChatGPT-User by default.

faangguyindia•about 1 hour ago

It's true, crawlers used for AI training don't say they are crawlers at all.

ssv445•about 2 hours ago

the core value of internet was some one discovering you via your content, agents as primary consumer might looks good for now, but we are definitely making internet dead for many SMBs.

steve_adams_86•about 3 hours ago

I went through a similar process recently. For a while I saw readership of my site gradually increasing, and eventually it became clear that it wasn't human beings.

I also used Claude to help me drill into what's going on. Bizarrely, about 80% of my traffic comes from Singapore, which the author mentioned. I don't know why. A lot of the traffic looks real; it stays for a while, clicks different links in different orders. But no one in Singapore has ever read a thing I've written on my site as far as I'm concerned.

I thought Cloudflare would help protect my site from bots, but it utterly fails. I'm not sure if they're too sophisticated or people overestimate how well CF works for these things. I paid for advanced features for a while and reverted to the free plan once I realized it made no difference. It's a great platform in general, but hasn't been great for allowing me to see how many humans actually read my content.

I know some do because they email me occasionally. If I had to guess, of the ~200 visits per week reported in analytics, around 15 are real.

robhoeijmakers•about 2 hours ago

Same ratio roughly. 80% Crawlers and agents, 20% human. Loads of the agents actually serve the content to humans, mostly in ChatGPT.

steve_adams_86•about 2 hours ago

Wow, that's a great point... I hadn't considered that. I assumed it's all training.

From what I understand, Cloudflare is trying to create a way for agents to consume content in a more structured manner than allowed for attention to the author, and potentially payment along with it.

I don't want to be paid but I'd love to see how often context from my writing winds up in a session a human is actively using.

HWR_14•about 2 hours ago

Why would I want bots to read my blog?

Retr0id•about 2 hours ago

I recently un-denylisted AI crawler UAs from my blog's robots.txt, for several reasons:

- My blog is static content and it costs me ~nothing to serve the requests.

- The bots were ignoring robots.txt anyway.

- If there's ultimately a human driving the bot (e.g. someone asking "summarise this article"), I don't mind.

- It's like trying to block search engines. Just as I want my blog to turn up in search results, I want agents etc. to know it exists, too.

My original motivation for denylisting, years ago, was that LLMs were simply not very good, so training-set scrapers seemed like all downside with no upside.

kibwen•about 2 hours ago

Why would bots and the people operating them care what you want?

HWR_14•about 1 hour ago

It's a premise of the article.

unglaublich•about 2 hours ago

Like it or not, but it's the intermediary layer now, between you and your users.

ianberdin•about 2 hours ago

It is time rewrite to X to optimize Y :)

Hackbraten•about 5 hours ago

Why do I get just an empty page?

robhoeijmakers•about 3 hours ago

Thanks. It seems to be very local/incidental. The page works from the locations I can test, but I’ll check whether one edge cache or request path served a bad response.

Krasnol•about 2 hours ago

White page from one of Germany's largest ISPs.

consumer451•about 3 hours ago

Same here via VPN. No VPN, and I get the actual content.

ksk23•about 3 hours ago

Caching gone wrong.. (Works for me)

gostsamo•about 2 hours ago

The writership of the blog is also changed and seems to be mostly machine as well. It is painful to read something that lacks human presence on the other side.