Building durable workflows on Postgres

235

KKraftyOne about 5 hours ago 101 commentsRead Article on dbos.dev

ZH version is available. Content is displayed in original English for accuracy.

⚡ Community Insights

Discussion Sentiment

71% Positive

Analyzed from 4120 words in the discussion.

Discussion (101 Comments)Read Original on HackerNews

rkeene2•13 minutes ago

I have an implementation I use that has multiple drivers (PostgreSQL, Firestore, SQLite3, just a file, Redis, or an in-memory store) written in TypeScript and it's been working well for my low-scale needs. The interfaces could support interfacing with a dedicated queuing system if you needed to migrate over time.

It supports pipelines, batched pipelines, and basic runners, as well as idempotent keys (including batching them). It also lets you "partition" a queue into multiple sub-queues so that you can easily segregate your jobs within your application without a lot of setup on the outside. For example, you create a root queue talking to PostgreSQL and pass it around to subsystems that then each create their own sub-queue off that to enqueue entries into and their own workers that dequeue them.

It's only used internally right now but I've been thinking about creating a separate package (with documentation) with it for others to use as well. Any feedback or pull requests would be appreciated !

[0] https://github.com/KeetaNetwork/anchor/blob/main/src/lib/que...

[1] https://github.com/KeetaNetwork/anchor/blob/main/src/lib/que...

llimllib•about 4 hours ago

Armin Ronacher's `absurd` is an implementation of durable workflows for postgres:

https://lucumr.pocoo.org/2025/11/3/absurd-workflows/

https://github.com/earendil-works/absurd

https://earendil-works.github.io/absurd/

I've not used it, but it's worth comparing to other options

vrm•about 3 hours ago

If you don't need a ton of throughput I think `absurd` (and our Rust derivative `durable`) are very nice options that keep the client side extremely simple. It's also lightweight enough that a coding agent can keep the entire thing in its head easily and just run queries to look up state as needed.

llimllib•about 2 hours ago

cross-checking your profile suggests that https://github.com/tensorzero/durable is the repo you're referring to

You might consider another name for it, that one is wholly ungoogle-able! Looks neat though

vrm•about 2 hours ago

TBH it's intended only for internal use (we don't even publish it as a crate at this point) so I don't particularly mind it being low-key. But I appreciate it!

throwaw12•about 4 hours ago

Curious to know experience of people using DBOS and Temporal.

I have used Temporal in the past, works really good, my only problem with it was some limits on request payload or event sizes, created some inconveniences to us when building solutions. It also enforces good engineering practices, but sometimes you don't want to write special logic if your CSV file is larger than 2Mb, upload it to S3, pass link, then download it in the workflow.

What is your experience with DBOS? How does it compare to Temporal in terms of operational complexity, feature parity and anything else

jhot•about 2 hours ago

Haven't used DBOS but use Temporal at current job and used it at previous job as well so I have about 1.5 years under me now. I also run it at home to handle some home automation tasks that aren't super time sensitive (the latency of workflows isn't super bad, but I wouldn't use one for something that is triggered by a motion event in my house unless we're talking about a timeout to turn something off after inactivity).

I really like running a thin rest API in front of it inside your vpc or k8s cluster or whatever to help with event driven triggers so that they don't have to worry about Temporal auth and checking workflow status if there is any decision making around that. This helps keep your event as logic-free as possible.

Let me give a vague example: you have some sort of db trigger, and this trigger either acts directly or puts the event on a queue, your handler calls the thin rest api with the necessary event details, rest API can make the decision if this starts a workflow, signals an existing one, or ignores it (the pattern for this can vary based on the situation, but SignalWithStart is common for me or just dropping if the event is not worthy of starting a workflow and no workflow for that <ItemYouCareAbout> exists).

Then the parent/child workflow ability is very valuable when you need to orchestrate different self-contained behaviors for a single object's lifecycle, with cancellability when an external factor changes the trajectory of an object.

Long, vague story short, I find it very powerful and easy to work with and has really helped move lifecycle logic out of APIs where things can easily become riddled with debt and precarious to manage. I agree with you that it helps follow more best-practices instead of just throwing logic some place that seems easy but becomes a hidden trap later.

pants2•about 4 hours ago

I thought Temporal was overly complex, but as you said the best part is it does enforce good engineering practices.

Then I tried their Cloud offering and was appalled at their pricing. I burned through the $1,000 free credits before I even got something to production. Didn't want to bother with running a local Temporal, either.

Best solution is to just take inspiration from their architecture and then do it yourself in Postgres, IMO.

switchbak•about 4 hours ago

They've just released an external storage approach to solve the large payload issue. I don't 100% love it (it's bolted on, not an intrinsic part), and it's an early release right now - but you can consider this effectively solved for now.

hilariously•about 4 hours ago

That's good because back in the day if you were putting entire documents in a message queue I would laugh people out the door, putting something in object storage + linking is much more useful (though the distributed system part/backup current state part can be annoying!)

quard8•about 4 hours ago

we're using dbos for ai gen workflows and processing video files. understanding how to migrate from celery took time, but for our case it was worth it.

devstein•about 1 hour ago

If anyone else is looking to migrate from Celery to DBOS, we also made this transition and wrote about it https://dosu.dev/blog/migrate-celery-to-dbos-dosu

Very happy we made the switch.

temporal_thr123•about 4 hours ago

I run a large on-prem temporal setup - throwaway acct as they will likely out me.

Temporal is, in my opinion having run it in prod for over a year - poorly designed, slow and ridicliously heavy infra wise.

If you're doing anything non-trivial (say, 200+ events/workflow) and you need to run only a couple hundred of them concurrently all day, you're going to spend millions on infra, and it's still going to absolutely suck.

Try running their own benchmarks, the numbers are pathetic.

Their sales team is also absolutely appalling and desperate.

From a Developer standpoint, the SDK is quite nice though.

Don't get trapped into nexus, and if the sales team call you make sure legal is in the room.

quacker•about 1 hour ago

Honest question: Can you use Temporal Cloud? Have you evaluated Temporal Cloud pricing?

Ballparking: 200 events/workflow, 200 workflows/per day and assuming 1 event = 1 cloud action[1], that is 1.2M or so actions per month. The $100/month plan includes 1M actions each month, and even the pay-as-you pricing when you exceed that is $50 per 1M actions[2].

Temporal Cloud seems extremely cheap for your use case, even if I'm off by a factor of 10. Is there a catch? You still need infra to run your Temporal workers, and I assume there are storage and other costs, but I assume action usage is the majority of it.

1. Not sure exactly what constitutes an "Action". At a glance, seems like most events have a corresponding action(?) and a subset of those actions are actually billable(?)

2. https://docs.temporal.io/cloud/pricing#payg-action-pricing

temporal_thr321•about 1 hour ago

I was not clear; I did not mean not 200 a day, it's 10s of thousands of concurrently running workflows, sometimes into the hundreds of thousands, each with 200 events. We run many hundreds of thousands of these a day.

Temporal was a bad fit for us, and we regret it deeply.

temporal_thr123•about 4 hours ago

Since I'm in a ranting mode -- here's a good example: you're limited to _ONE_ IO per shard in the history service:

https://github.com/temporalio/temporal/blob/e22e6304b3c4a409...

Temporal does a crazy amount of database operations and all of these are behind that mutex.

Oh, and you can't change the shard count on existing clusters.

Great stuff.

lll-o-lll•about 2 hours ago

> If you're doing anything non-trivial (say, 200+ events/workflow) and you need to run only a couple hundred of them concurrently all day, you're going to spend millions on infra, and it's still going to absolutely suck.

Where are the “millions” on infra going? It’s a handful of services and a Postgres?

> Their sales team is also absolutely appalling and desperate.

You said “on-prem”. It’s open source; why are you dealing with their sales team?

> If you're doing anything non-trivial (say, 200+ events/workflow) and you need to run only a couple hundred of them concurrently all day…

If “millions” were required to obtain such tiny scale, I’d agree there’d be a massive problem. No one would use Temporal; it would be a complete waste of resource. If this were true.

cyberpunk•about 1 hour ago

We also hit scaling problems with temporal.

Postgres doesn't scale at all four our workload, so you're into cassandra.

For a medium sized deployment, you're looking at 200+ vcpus, and then lets say standard dev/uat/prod. So now you're at 600 cpus. Now you need two geographic regions, dev can stay in one place, so now you're at 800. Want a failover cluster for prod? Have another 200 cpus.

and 200 CPUs is a medium deployment, assuming something like 36 cpus per cassandra node, then say 4-8 per instance of matching, worker, history, frontend. Then all your other components around it, ingress controller, service mesh, etc.

There's a million a year easy, for a small deployment.

Our prod one is 4x this size.

temporal_thr321•about 1 hour ago

Not a couple hundred in one day, a couple hundred being started, concurrently, every second in a day. Each with ~200 events.

We need a 12 node cassandra cluster for this, with 64cpu nodes. So no, it's not a couple of services and a postgres.

Sales team, as we are an enterprise, and they want to extract money from us.

turtlebits•about 2 hours ago

The same with any "open-source" enterprise ($$$) software. It sucks to run yourself. Docs on running/errors are non-existent. Their helm charts are broken. Instead of degraded performance, it just fails.

dakiol•about 4 hours ago

Agree. Have worked in a codebase using Temporal, and is pretty much a nightmare. I don't know about the infra side, but from the developer side, all the abstractions they bring to the table are poorly designed. Wouldn't recommend

stuartaxelowen•about 3 hours ago

My dream is, instead of separating data storage, state machines, valid state constraints, and the logic that transitions between valid states, we can actually unify these into some kernel of app state. Honestly, Postgres already has a lot of these capabilities, but I don’t see an obvious story on the app or product level, providing provably correct sets of states that apps can transition between, and which they can automatically expose to clients in informative ways (this user can like this post, but not edit). It looks colored Petri net shaped to me, but I don’t yet see a simple app state paradigm in the same way that the database has obvious successful boundaries.

krashidov•8 minutes ago

this sounds like convex.dev or https://spacetimedb.com/ (full disclosure I don't use either)

tibbar•about 3 hours ago

This has been tried, but thousand-line stored procedures are truly a nightmare.

agumonkey•about 2 hours ago

was it due to the language expressiveness forcing too much verbosity ? (honest question)

tibbar•about 1 hour ago

lack of version control, clunky language mechanics, performance issues, etc.

opiniateddev•about 4 hours ago

Conductor OSS does this quite well https://docs.conductor-oss.org/devguide/ai/index.html

https://github.com/agentspan-ai/agentspan which is essentially an agentic SDK layer for Conductor can convert any of your langgraph, openAI, vercel, or ADK agent and makes it durable and adds orchestration with no code changes.

opiniateddev•about 2 hours ago

for our production we use Redis for queues but have seen users using both Postgres and MySQL for queues as well.

pragma_x•about 3 hours ago

I completely get the concept and agree - this is great way to build this kind of durability in a workflow system.

That said, my gamer-brain wants to call this "Save-scumming at scale." Which is to say, a lot of people already know that this approach works, but maybe they haven't made the connection to abstract CS stuff.

Another strategy that can be used to build robustness is to build your workflow out of idempotent operations. That can be useful for situations where the workflow state is too large to back up. Instead, you just run the job from the top and it's a bunch of no-ops until you start making progress again.

joshka•about 4 hours ago

This feels like the sort of architecture that starts clean and then gradually grows most of the things a workflow-native system already has. I've seen systems like this, seen companies that are built out of this idea, and built small systems like this over time.

Once you need retries, backoff, timeouts, cancellation, versioning, visibility, task routing, rate limits, leases, heartbeats, stuck-worker detection, replay/debugging semantics, workflow migration, fanout/fanin, long timers, audit trails, and operator tooling, the “just use a database” story becomes “build a poor copy of a workflow engine plus a bunch of workers.” pretty quick.

That may still be a good tradeoff for many applications, especially if Postgres is already the core operational dependency. But the comparison shouldn’t be “database vs overcomplicated orchestrator.” It’s more like “what complexity do you want to own, and what do you want to buy / offload to a professional system?”

hmaxdml•20 minutes ago

Yeah, we've observed that too: people start implementing their own retry logic, idempotency, etc. But then they grow a hard to maintain, complex stack that's not their core business logic. There's a reason why there is a dedicated team building DBOS, every day. Because it's not that easy to build a solid durable workflows engine on Postgres.

epolanski•8 minutes ago

Bingo, not even mentioning the blog post assumes all steps to be serializable.

I feel like this is the usual "just use postgres" garbage post that lacks any kind of nuance.

In fact you could replace that post with any other db and the statements keep being true, and naive.

hedora•36 minutes ago

I want to dig into this "free" workflow_error.sql. I'll assume 1024 byte workflow job descriptors, and the article's steady state of 10,000 jobs per second.

Possibility one: There is one index on the table, and it is the created_at TS. This query has to scan 10,000 jobs/sec * 60 seconds * 60 minutes * 24 hours * 31 days * 1024 bytes / job = 25,543 GB.

A KV store would scan exactly that much.

Possibility two: The primary key is refined to (state, timestamp). Assume a 1% failure rate. Now, we "only" scan and return 255 GB. A key value store would scan exactly that much. (This is probably the right physical design).

Possibility three: The primary key is (timestamp), and there's a secondary index on state. I guess we do an index join, where one side of the join is 25,543 GB, and the other side is one unsorted bucket with 255GB * number of months the system has been in operation in it.

A KV store wouldn't let you express that.

Now, what other ad hoc queries are we supposed to efficiently support over a one month lookback? Also, what does PG do if you tell it to scan 25TB at the same time as it's inserting 10MB/sec at 10K TPS? How is vacuuming configured?

vrm•about 4 hours ago

Since DBOS doesn't support Rust, we implemented a very minimal Rust version of this at https://github.com/tensorzero/durable. It has been quite stable and extensible but of course you need to be very careful with the SQL implementations. Hope this is interesting to readers here.

sgt•about 5 hours ago

Continuously amazed by what you can do with few tools, as long as Postgres is a part of your toolkit.

I recently developed a distributed queue and it works really great - benchmarks great too, with no race conditions or conflicts. I used SKIP LOCKED so that workers can compete safely.

You can also have multiple workers across nodes avoid conflict by using session wide mutexes i.e. pg advisory lock.

bootsmann•about 3 hours ago

Advisory locks are preferred for this anyways because holding a lot of SELECT FOR UPDATE doesn’t scale too well.

Edit: Actually I checked this again and apparently the advice has now changed to the inverse.

jgraettinger1•about 2 hours ago

At Estuary, we have an in-house Rust crate [1] for building scale-out durable actors / FSMs in Postgres. It powers all async activity in our control plane -- slews of fine-grain scheduled actions, complex change propagation through data-flow topologies, reliable alert and email delivery, and more -- at hundreds to thousands of state transitions per second (today). It's been a wonderful pattern to build on, and is all of three source files.

Here's a an example computing a Fibonacci sequence (very inefficiently, with lots of spawned sub-tasks and message passing) [2]

[1] https://github.com/estuary/flow/tree/master/crates/automatio... [2] https://github.com/estuary/flow/blob/master/crates/automatio...

nzoschke•about 1 hour ago

I do love Postgres and DBOS.

I also recently started experimenting with https://github.com/earendil-works/absurd which is also Postgres and even simpler than DBOS. Their comparison is a great read:

https://earendil-works.github.io/absurd/comparison/

But for operational reasons I've started using sqlite for durable workflows instead. Porting the database concepts from either DBOS or absurd PG to SQLite is remarkably easy these days. A small polling loop instead of notify/listen feels fine for smaller workloads.

saxenaabhi•about 2 hours ago

As someone who uses dbos.dev, restate.dev, cf workflows here is a snippet from our Agents.md:

  Restate.dev:
    for payment integrations on northflank since its faster than cf workflows, independent of cf and its downtime and self-hostable vendor-lock-in free,
  Cloudflare workflows:
    for non critical stuff like csv/pdf report generations since it's very cheap.
  DBOS.dev:
    for workflows that need atomic messaging tied to a postgres db transaction for 100% reliabilty/durabilty(for example populating a materialized row or sending out critical email/push to a merchant).

DBOS and Restate are similar on surface but Restate requires a central "orchestrator" which has pros and cons but makes it easy to build with serverless workers on cf/vercel.

It also has VirtualObject which is a nice vendor-lock-in-free OSS alternative to CF's single threaded DurableObject.

Where DBOS absolutely shines is

1) Atomic messaging in the same db tx as your business logic via dbos.enqueue_workflow! This is often the most brittle part of any solution and doing it atomically and durably with same tx that ran your business logic drastically reduces lots of complexity.

2) Since DBOS stores workflow state in db it should be easy to build dashboard for observability from metabase/looker(I wish restate exposed its rocksdb instance so it could be hooked up to metabase).

buremba•about 4 hours ago

All you need is Postgres until you scale into TBs of data. We use Postgresql as a durable workflow engine, vector search, time-series data, BM25 search, OLTP/OLAP engine, and a queue. It's basically the only dependency we have for https://lobu.ai

The main benefit is centralizing all the data in one place so we don't need to worry about copying data in between multiple systems. Once something becomes the bottleneck, you can eventually migrate to a purpose specific tool to scale out.To be honest, LISTEN/NOTIFY in my opinion is the most fragile part of PG but it's fine as start until you scale out.

tibbon•about 3 hours ago

But when you hit that wall, it is hard to stop and convince people to use different patterns and systems. I've seen so many tables go from "it will only be a few thousand rows" to suddenly several TB and then people are looking confused when performance and db admin tasks get really difficult.

I'm working at a scale where almost every day I have to ask people "are you use you need to treat that as relational data? It doesn't seem relational"

alexwennerberg•about 1 hour ago

> But when you hit that wall, it is hard to stop and convince people to use different patterns and systems. I've seen so many tables go from "it will only be a few thousand rows" to suddenly several TB and then people are looking confused when performance and db admin tasks get really difficult.

It's much, much worse in my experience to have to develop for the opposite -- working on a system that was designed for an imagined "infinite" scale that in reality like 100GB and a few transactions a minute.

dieselgate•about 2 hours ago

> are you use you need to treat that as relational data?

Is this intended to be "you sure you need..."?

turkeyboi•about 1 hour ago

Obviously, yes

sroussey•about 2 hours ago

Use different “databases” besides public at the very start. No joins between them. You will be in a good position to just split the postgres instance by those at a later date. They will have different usage patterns than the merged version you have now, and will be easier to optimize and will buy you some time. And time is all you need.

gjvc•about 1 hour ago

"public" is not a database, it is a schema within a database.

apropos bad naming, postgresql authors are not forgiven for naming all the databases on a single host a "cluster". I mean __really__.

hmaxdml•about 4 hours ago

Listen/notify is poised to become much better in PG 18 and 19

stuartaxelowen•about 3 hours ago

Why’s that?

TkTech•about 3 hours ago

In pg19 https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit... will land, which significantly improves NOTIFY performance. Right now LISTEN/NOTIFY doesn't scale to very busy instances because a `NOTIFY` within a transaction takes a global lock.

ceres•about 2 hours ago

Just an fyi, when I try to sign in with google for your app I get the message: "The app is requesting access to sensitive info in your Google Account. Until the developer (*reka*kc*@gmail.com) verifies this app with Google, you shouldn't use it."

buremba•about 2 hours ago

Ahh, sorry about that. It should be fixed in an hour, looks like we mixed the permissions. I just tried and confirmed other login methods work if you would like to try out.

throwaway7783•about 4 hours ago

I'm in the same camp. Do you use any specific extensions? Especially for OLAP and time series (partitioned tables + related extensions work fine, but curious if you use anything else)

osigurdson•about 1 hour ago

From experience, I'd suggest using ClickHouse beyond a few billion rows of timeseries data in Postgres.

throwaway7783•about 1 hour ago

Nice thing about our use case is that its not strictly analytics, but looking at most recent raw data. ClickHouse is definitely the powerhouse for analytics

buremba•about 3 hours ago

The native extensions are fine but I don't have good experience with any third party extensions, so far tried Timescale, pg_lake, citus, and pgvectorscale. They look very appealing but it's usually a trap as you can't get the value without using the vendor's cloud offerings.

I think if you grow enough to look for these extensions, it's usually better to bet on purpose-specific tooling. For example, I use DuckDB/Iceberg combination extensively for columnar data and connect DuckDB to PG when I need it.

throwaway7783•about 1 hour ago

Fair enough. How do you do BM25?

cultofmetatron•about 2 hours ago

conversely, startups that start scaling for tbs of data never make it to needing tbs of data. They burn too much energy scaling when they don't yet have a product people want yet.

nicoburns•43 minutes ago

Yep. I've also seen systems that were slow with <10GB data of because of bad application of patterns that were supposedly "scalable" (pulling entire tables out of the database to implement joins in application code because "nosql is faster" is not actually fast).

pphysch•about 4 hours ago

I don't see logs mentioned. I agree with most those applications but would keep my OLAP stuff (metrics, logs, traces) in a separate store like VictoriaMetrics, both for capacity and read activity.

TkTech•about 3 hours ago

pg_timescale can take you pretty far for metrics and would be Good Enough for almost all users. Totally agree on raw, high-volume logs though.

buremba•about 3 hours ago

Yeah I have logs in Sentry, which also uses Postgresql.

valyala•about 1 hour ago

Sentry stores logs in ClickHouse - https://blog.sentry.io/how-sentry-queries-unstructured-data-...

epolanski•11 minutes ago

I don't get how any of the points made in this blog post would not work if you replaced postgres with MySQL or cosmosdb.

In any case there can be more to durable workflows than just saving the current step, and not all intermediate steps are serializable thus I don't get where's the postgres magic that more mature solutions don't have.

switchbak•about 4 hours ago

Having inherited a few of these - you tend to home-grow an ad-hoc version of many of the existing OSS tools, but with less of the patterns baked in.

Not sure where the NIH ends and where you're actually better off with a supported orchestration approach. I suppose if you expect your program to be around a while (or need advanced features), maybe think about using something a bit more battle tested?

pirsquare•about 4 hours ago

I feel it's way too hand wavy on consistency and correctness. My opinion as someone who've implemented marketing workflows that breaks all the time (and tons of painful lessons).

Strong correctness guarantee is something that should not be undermine. Even more important than availability.

The examples on the website is simple but heavily undermines the importance of correctness. Anyone who implement similar pseudo-code directly will eventually suffer from data correctness issue in crashes.

  @DBOS.workflow()
  def checkout_workflow(items: Items):
      order = create_order()
      reserve_inventory(order, items)
      payment_status = process_payment(order, items)

      if payment_status == 'paid':
          fulfill_order(order)
      else:
          undo_reserve_inventory(order, items)
          cancel_order(order)

hmaxdml•about 4 hours ago

As you said, the example is simple and it might not be obvious to people without prod experience what the problems can be. Postgres can give you all the primitives you need to solve this at the application layer. Durable workflows on Postgres is an effective way to access these primitives.

munk-a•about 4 hours ago

We have a durable queue built into postgres to handle some complex notification-ish logic. It's worked excellently and while there are services various cloud providers would love to sell us to do that it's extremely cheap to run.

For that particular usage, the volume we process and business criticality make it a good choice for inventing here - but for other durable processes we just use off the shelf tools since the cost of maintenance would quickly outstrip the value.

Postgres is a great tool to use and far more powerful than most people give it credit for - but there's always the balance of in-house maintenance vs. paying rent for someone else's solution.

PunchyHamster•about 4 hours ago

what's "maintenance" here ? If app is also using PostgreSQL it should be just initial effort of writing/importing code to run it, no ?

munk-a•about 4 hours ago

You pay for everything you build - the more complexity you put into it the more that costs over time. Dependencies need to be updated, language/framework upgrades usually break something, new features/requirements introduce additional complexity and code to manage. Software just costs money every day - not a lot, our industry is much lower margin than, say, stamping sheets of metal into tools - but it still has operational costs beyond just the money to operate the hardware we run our products on.

PunchyHamster•about 4 hours ago

I know that. This looks like some lib you update once a year/every new CVE, and it is compared to a lib from cloud vendor and also update once a year/every new CVE, which is why I asked what it costed YOU in this particular case.

grahac•about 3 hours ago

Isn't this Just Oban from elixir? :)

senderista•about 5 hours ago

Citing CockroachDB as an example of scaling Postgres made me spit out coffee. Was this LLM-written?

sorentwo•about 4 hours ago

The efforts we've undergone to make Oban (and Pro) work with CRDB have been ridiculous. Feature detection all over because of a lack of common operators and functions that can't be used in indexes. The worst is the rampant "serialization_failure" errors that force continual transaction retries. Not how I'd suggest scaling Postgres.

That said, as a predecessor to dbos in building durable workflows just using Postgres, I concur with the overall sentiment.

bcooke•about 3 hours ago

Can you expand on why you chose to use CRDB with Oban? I have no opinion here, I’m genuinely curious as someone using Oban myself (with Postgres). I haven’t hit the point of really needing to scale it out yet and I’d rather avoid the traps others have figured out.

TkTech•about 3 hours ago

sorentwo is the author of Oban. He's not using CockroachDB, he's supporting it as a valid Oban target.

Reubend•about 3 hours ago

Yeah that seems off to me too. But I guess they meant that since CockroachDB is compatible with Pg, it would also serve the same prupose?

halamadrid•about 3 hours ago

We work on disk log based architecture for workflows at Unmeshed (https://unmeshed.io/) which helps it to scale at a fraction of the cost of traditional workflow systems that are based on expensive databases.

Postgres is not cheap to run in the cloud at scale. We went for the cheapest infra, which is basically the disk storage.

iwwff•about 2 hours ago

Every time I am surprised to see the promises of the cheap durability without mentioning costs of running dirable postgres, which might be not easy or cheap.

everforward•about 1 hour ago

Postgres is durable by default, it's ACID compliant. It is not reliable in the HA sense by default.

Either way, I'd bet a hosted Postgres with HA is cheaper than whatever PaaS you're thinking of.

magicseth•about 4 hours ago

Convex has a workpool component that gives the ability to compose big complicated flows in an understandable way, and give you realtime updates on status of various pieces: https://www.convex.dev/components/workflow

doginasuit•about 2 hours ago

I have only used two databases, SQLite and Postgres, depending on if the database needs to be bundled with the application. They both feel like magic. Even though I am not a religious person, I recognize the value in acknowledging a higher power.

hbarka•about 4 hours ago

How do you incorporate secrets in this kind of implementation? Stored in db?

KraftyOne•about 3 hours ago

Secrets are orthogonal to durable execution--what are your concerns about using them together?

mrits•about 3 hours ago

Unless you have a very specific use case, you wouldn't want to store in db or in any message you use in any workflow like this. Usually whatever does the actual work has a way to get the secret.

rafael-lua•about 3 hours ago

The "everything can be done in Postgres" crowd is crazy. It is like a religion at this point.

cpursley•about 2 hours ago

But that's not what we are saying; we're suggesting use Postgres until you truly need something else. 90% of applications aren't "web scale", keep the stack simple and portable. There's no good reason to slap in a ton of moving parts until they are truly needed.

elliot07•about 4 hours ago

how is this compared to hatchet?

OutOfHere•about 4 hours ago

I am not convinced that using a special software for "durable workflows" is necessary. If one has a stateful message queue or job task queue, e.g. RabbitMQ or Celery, one can use it. Irrespective, many jobs can be made idempotent. The most that you ought to residually need is a column in an existing table of your own database which keeps track of what remains to be done.

Given the above, it would seem that durable workflow software is pushed forward by those who have a surplus of VC money to spend. As for the vendors, there is no shortage of people trying to sell you things that you don't need.

llmslave•about 4 hours ago

Temporal is an insane piece of software, always surprised people dont know about it. You could replace almost youre whole AWS stack with temporal

temporal_thr123•about 4 hours ago

Sure, if you wanna run a 48 node cassandra cluster...

cpursley•about 4 hours ago

I find it strange that some think in terms of AWS architecture as the default. You could replace nearly the entire AWS stack with an Elixir (Erlang) monolith + Postgres.

cpursley•about 4 hours ago

PgFlow is pretty awesome for DAG workflows - it's built on pgmq (which does the heavy lifting, making it backend agnostic).

Typescript: https://www.pgflow.dev

Elixir: https://github.com/agoodway/pgflow/blob/main/docs/COMPARISON...