RU version is available. Content is displayed in original English for accuracy.
Previous thread: Incident Report: Railway Blocked by Google Cloud [resolved] - https://news.ycombinator.com/item?id=48201484
RU version is available. Content is displayed in original English for accuracy.
Discussion Sentiment
Analyzed from 5600 words in the discussion.
Trending Topics
Discussion (196 Comments)Read Original on HackerNews
TK has a history of absolutely destroying the culture of the place like in OCI and has done something similar in GCP from what I've heard. GCP and Google are completely different entities with how they work. Don't expect Google quality from the name. It's just like those old brands which now have cheap licensed products like Nokia (An exaggeration I know but not far from truth).
Not only that they are known to shut off their services randomly giving you like 6 months to migrate. They have lots of engineers not doing anything, so they put them on migrating internal users off those services, most of their clients don't. There was a brilliant article on this by an ex-GCP employee that I can't find right now.
Avoid GCP like plague if you are serious about your business.
Edit: Gemini (unironically) found the article on this, a very good read: https://steve-yegge.medium.com/dear-google-cloud-your-deprec...
May 19, 22:10 UTC - Our automated monitoring detected API health check failures and paged our on-calls, who started investigating the issue. May 19, 22:11 UTC - Dashboard returning 503 errors. Users unable to log in. May 19, 22:19 UTC - Root cause identified: Google Cloud Platform has suspended Railway's production account. May 19, 22:22 UTC - P0 ticket filed with Google Cloud. Railway's GCP account manager engaged directly. May 19, 22:29 UTC - Incident declared. May 19, 22:29 UTC - GCP account access restored. All compute instances remained stopped and persistent disks inaccessible.
They got TK to woo the enterprise customers who were forced to be hostage to OCI. But it seems they are still doing opposite of hostage here.
It sounds exactly like what I have experienced in terms of Google quality over the decades.
They said they are already using Gemini 3.5 Pro internally.
Other than that, Google prefers to act like "customers" are some kind of unfortunate rash they can't quite seem to get rid of, but would love to do so.
That's pretty clear. Google can no longer be trusted as a B2B service provider.
And as we know from the recent Gemini ban wave, you can get suspended just because.
Nearly all these linkages are due to people sharing recovery email addresses and phone numbers. Don't do that.
Google has always acted as if they have no obligations whatsoever to their paying customers.
The resulting action should be you have proper disaster recovery, failover, etc.
Not sure I would trust these folks if this is the conclusion they are coming to from this experience. Any cloud provider can/will do this to you.
Maybe I'm getting old but here[1] is a HN comment from 17 years ago complaining about Google banning accounts "by mistake" and having no recourse but to post on HN and hope Matt Cutts sees it and helps, and saying "there are literally 1000s of such stories for many years all over the blogoshphere and forums" which is something I remember from HN of years ago.
[1] https://news.ycombinator.com/item?id=791004
You're joking, right?
This is Google we're talking about. This absolutely happened many times in the past and will happen again.
If it were something out of Railways hands, I think they would say something like "We have not yet identified the reason for the suspension, and are awaiting a response from Google".
> Google Cloud placed Railway’s production account into a suspended status incorrectly, as part of an automated action. This action extended to many accounts within Google Cloud. As this was a platform-wide action, there was no proactive outreach to individual customers prior to the restriction.
This might be 100% of what google told them.
If you're picking them instead of the underlying cloud provider, but you want all the knows and dials the underlying provider has, you've made the wrong choice.
"However, in this ring, there was still a hard dependency on workload discoverability being tied to the network control plane API that was hosted on the machines running in Google Cloud."
They've gotta be joking me that they deliberately left something so critical under the control of any other entity than themselves. That demonstrates a lack of critical planning and a lack looking at their configuration from a first-principles approach.
I'd wait for more details before adjudicating.
In practice, Google has earned the way my priors are ready to believe it's 100% their fault with mighty and sustained effort. Or lack thereof, depending on your point of view.
So no, Google doesn't get the benefit of the doubt.
And in general Google lost any immediate benefit of the doubt status many years ago. Many such stories.
> Google Cloud placed Railway’s production account into a suspended status incorrectly, as part of an automated action. This action extended to many accounts within Google Cloud. As this was a platform-wide action, there was no proactive outreach to individual customers prior to the restriction.
Put all the timestamps you want in the post mortem about what you observed, but you haven't addressed the root cause.
The "this doesn't make sense" part of the story likely has a real explanation that nobody wants to reveal yet.
After about 8 hours, a random Google support tech said it was because we were mining bitcoin, which was laughably untrue. We had CPU usage graphs and logs for the whole time and there was no spike. At around 12 hours, they turned it back on, said it was "misconfiguration of our abuse detection" and gave us like $100 in credit.
Absurd. Say what you will about AWS, they would never do that to a customer without a rep reaching out to you first. I have not trusted GCP since.
Remember knowledge cards? Prior to the LLM AI revolution, they had an extraordinarily crappy AI system digest the entire internet to figure out the wrong facts about stuff and then present it to users as solid truth, with no human review and no way to report inaccuracies.
They just don't care. If the task requires a person to look at a thing and tell if it's right, they only do that for like 5 examples and then train a classifier, then deploy said classifier without thinking twice because "at internet scale" or whatever crap.
Now the lawyers are huddling. IMO there won't be a lot more said publicly by either side, at least until any threat of lawsuits for damages is settled.
They need to tell Railway and Railway needs to tell us, or Railway can tell us that Google is refusing to tell them.
Either way, we need to hear about this from Railway.
The moment GCP shut off without any forewarning, its done deal, no need to ask any further questions.
Giving reasons is putting accountability on Google and they don't want that.
Kudos to them for acknowledging it and not doing PR speak. It shows it was an architectural failure from their part of trusting GCP, and they are working to fix it. Should they have seen it coming? Yes. But better late than never.
I was going to talk to our google rep about their killing the Gemini cli but this is way more concerning.
Then they took no personal responsibility. That definitely damaged their reputation. Here, they are taking at least some responsibility. Props to them on improving.
Also, GCP does indeed have serious reliability issues, and Google does indeed have serious customer support issues.
EDIT: It has been brought to my attention below that the first 2 paragraphs are misattributed, and were not Railway, but rather a customer of theirs. Sorry, Railway!
My company used to use a hosting provider that was basically AWS plus some extra guarantees. We just finished migrating onto regular AWS because they now offer what we need directly.
As much as we loved the simplicity they provided us, there's just been too many mishaps and shortcomings for us to continue running a B2B enterprise app on their infrastructure.
Sad day :(
1.) Vercel - having a bad month
2.) Supabase - having a bad month
3.) Railway - now having a bad month
I'm sure there are plenty of the like 1,000 AWS products that DO has no viable competitor for, but for what they do offer, they're great.
Even if you use AWS and the like, if you aren't building your app with redundancy across multiple AZs, then you'll have some downtime occasionally.
And even if you do build redundancy with multiple AZ, some services might fail anyway as AWS is not entirely isolated. So you might have downtimes.
So just accept downtimes and use the best tool for you (unless they are really bad, like GitHub level bad). If you cannot accept any downtime, you'll have to spend millions of dollars and months of work to have the confidence to expect no downtime. Something like Netflix's chaos monkey and infrastructure would be enough.
My gut feeling is that the number of significant applications that have this capability can probably be counted on two hands. Especially since a lot of the largest footprints of software stacks running in the cloud belong to Google and Microsoft, who I'm pretty sure do not replicate their services into someone else's cloud.
As an example, I note that GCP responded within 7 minutes according to their timeline. If you’d been using Cloud Run, that would have reduced downtime by over 7 hours — and there’s a good chance that you never would have gone down in the first place if the unknown trigger event was related to other customer activity or something odd Railway did.
There’s also a complexity factor: note how much complex infrastructure they mentioned having to fix that you wouldn’t need for your own account. That code does useful things, I’m sure, but it’s also a lot of moving parts which a hosting provider needs and you don’t – this outage took everyone down, whereas individual AWS or bare metal users would’ve otherwise been unaffected. There isn’t a global optimum which is the same for everyone but I think developers are prone to wildly over-estimating how much time they save by removing a couple of deployment steps relative to the direct costs and the less obvious costs of working within someone else’s environment.
But really any service (or even on-site hosting) can have downtime, if that's not acceptable then I suppose building/using a tool that can be distributed between multiple hosts located in different geographical areas is the best option.
For Vercel if your nextjs site can be compiled statically you could probably throw it up on almost anything. We've self hosted before which is pretty straightforward but you lose a lot of the image optimization stuff unless you go deep into setting up open next.
Azure!
It’s the enterprise cloud with enterprise support. They won’t randomly pull the plug on your account, unlike companies that have a wildly different cultural background:
Google - ad tech (you’re the product
Oracle - lawyers (you’re a future lawsuit for license extortion)
AWS - shop front (you’re a comptetitor)
Etc…
No code lock-in through SDKs and built on top of AWS with great DX for both developer and coding agents
It would seem that Google's counsel has deemed that whenever _____ is detected, the company must immediately and completely sever the business relationship. What is that driving concern? Is it sanctions enforcement? CSAM? Something else?
Please, someone that worked at Google, please comment.
I'm not a developer, just curious what this is.
Alternative to Fly or Heroku
Here is my source code Run it on the cloud for me I do not care how
In this case it looks like they also bundle together a bunch of the other services you would need to get code onto the platform, monitor it once it’s there and so on
> At 22:20 UTC on May 19, Google Cloud placed Railway’s production account into a suspended status incorrectly, as part of an automated action.
If the timestamps are accurate, what was causing the errors 10 minutes before the account was suspended?
The simplest explanation is just that one or the other of these timestamps is wrong, which wouldn't be a big deal. But if the timestamps aren't known with certainty, it seems very odd to include them in the writeup as though they are certain, even though they are very obviously inconsistent with each other.
Assuming the timestamps are accurate, Google probably started terminating resources while the account was not "suspended" and only completed that after all resources were disabled.
The problem with not having the data is that it’s easy to make assumptions.
> May 19, 22:19 UTC - Root cause identified: Google Cloud Platform has suspended Railway's production account.
They couldn't have identified the root cause before it happened.
* A Google employee messes up a setting (like one of the previous incidents) triggers something that looks like a suspension is warranted and it takes 10 minutes to flow through the process to suspend.
* A Railway customer does something corrupt, or seemingly corrupt, Google's system starts limiting access and take 10 minutes to decide it should be a suspension.
These are even more likely if there is a person in the loop to approve, who obvious did not dig deep enough to see that they should not have done so.
Don't get me wrong- the rest of this mess falls pretty clearly on Google Cloud, but this one feels like something Railway did to themselves.
https://news.ycombinator.com/item?id=48201484
Between my peer c-suites, the conversation is that GCP cannot even be in the consideration set until such a time as a several-year period has elapsed without this kind of incident.
Microsoft might have technical warts but commercially they are strong and Azure is a lot of times bundled with other services and you know you can get someone on the phone if needed
Google has... ?
At least that's my understanding from discussing with people praising GCP.
Let's say you want a big cloud provider, but you don't want Azure because of Microslop's old and recent history, and you don't want AWS because it's the default cloud provider.
You're left with GCP. And many people are stuck in the 00's, and still believe Google is the cool kid crushing the boring old corporations.
former Oracle salespeople
There is no justification given on why this action was incorrect. It's possible they actually did something wrong.
They don't, because the allure of effortless scaling is hard to resist: everyone thinks of themselves as the next tech unicorn. And if you actually become an unicorn, you're already too dependent on AWS / Azure / GCP to easily move somewhere else. At best, your strategy is to become "multi-cloud".
>Railway’s network is a mesh ring, built up of high availability fiber interconnects between Metal <> GCP <> AWS. However, in this ring, there was still a hard dependency on workload discoverability being tied to the network control plane API that was hosted on the machines running in Google Cloud
What the hell is even that?
I doubt that will happen because none of them want to stop the money-making machine they have! And, if your thought after my comment is that all us techies are making a fuss, so the cloud providers and businesses using them will hear our cries and trigger a backlash...? I doubt that to...because some senior business leaders that i see are bent on listening more to management consultants as opposed to abalance of folks including their own internal experts...but, alas, maybe i'm just having too cynical a day today. :-)
Refreshing. So tired of businesses blaming their vendors. Oh it wasn't us spamming you text messages and emails, it was Shopify. Oh, our delivery guarantee said 2 days and it's been a week? That's not us, it's UPS.
I don't care. I didn't pay UPS or Shopify. I paid you.
Be it individuals or companies, this time is the best time to ditch all dependence on anything clouds or SaaS since all are using automated AI, more and more of these incidents will occur.
https://news.ycombinator.com/item?id=48201484