Back to News
Advertisement
hhugocorreia90 about 23 hours ago 20 commentsRead Article on github.com

HI version is available. Content is displayed in original English for accuracy.

Hi HN, I'm Hugo. I've been building Rocky over the past month, shipping fast in the open. The binary is on GitHub Releases, `dagster-rocky` on PyPI, and the VS Code extension on the Marketplace. I held off on a broader announcement until the trust-system surface was coherent enough to talk about as one thing. The governance waveplan — column classification, per-env masking, 8-field audit trail on every run, `rocky compliance` rollup, role-graph reconciliation, retention policies — landed end-to-end last week in engine-v1.16.0 and rounded out in v1.17.4 (tagged 2026-04-26). That's the milestone I'd been waiting for.

The pitch: keep Databricks or Snowflake. Bring Rocky for the DAG. Rocky is a Rust-based control plane for warehouse pipelines. Storage and compute stay with your warehouse. Rocky owns the graph — dependencies, compile-time types, drift, incremental logic, cost, lineage, governance. The things your current stack can't give you because it doesn't own the DAG.

A few things I think are interesting:

- Branches + replay. `rocky branch create stg` gives you a logical copy of a pipeline's tables (schema-prefix today; native Delta SHALLOW CLONE and Snowflake zero-copy are next). `rocky replay <run_id>` reconstructs which SQL ran against which inputs. Git-grade workflow on a warehouse.

- Column-level lineage from the compiler, not a post-hoc graph crawl. The type checker traces columns through joins, CTEs, and windows. VS Code surfaces it inline via LSP.

- Governance as a first-class surface. Column classification tags plus per-env masking policies, applied to the warehouse via Unity Catalog (Databricks) or masking policies (Snowflake). 8-field audit trail on every run. `rocky compliance` rollup that CI can gate on. Role-graph reconciliation via SCIM + per-catalog GRANT. Retention policies with a warehouse-side drift probe.

- Cost attribution. Every run produces per-model cost (bytes, duration). `[budget]` blocks in `rocky.toml`; breaches fire a `budget_breach` hook event.

- Compile-time portability + blast radius. Dialect-divergence lint across Databricks / Snowflake / BigQuery / DuckDB (12 constructs). `SELECT *` downstream-impact lint.

- Schema-grounded AI. Generated SQL goes through the compiler — AI suggestions type-check before they can land.

What Rocky isn't:

- Not a warehouse — it's the control plane on top.

- Not a Fivetran replacement. `rocky load` handles files (CSV/Parquet/JSONL); for SaaS sources use Fivetran, Airbyte, or warehouse-native CDC.

- Not dbt Cloud — no hosted UI, no managed scheduler. First-class Dagster integration if you need orchestration.

Adapters: Databricks (GA), Snowflake (Beta), BigQuery (Beta), DuckDB (local dev / playground). Apache 2.0.

I'd love feedback on the trust-system framing, the governance surface (particularly classification-to-masking resolution in `rocky compile` and the `rocky compliance` CI gate), the branches/replay design, the cost-attribution primitives, or anything else that catches your eye. Happy to go deep in the thread.

Advertisement

⚡ Community Insights

Discussion Sentiment

50% Positive

Analyzed from 1127 words in the discussion.

Trending Topics

#model#branch#rocky#code#llm#run#different#dag#column#through

Discussion (20 Comments)Read Original on HackerNews

Xiaoher-C12 minutes ago
The compile-time lineage part is the most interesting bit to me. A lot of “data lineage” tools feel like archaeology after the fact: parse logs, reconstruct what probably happened, then hope it matches reality.

Having the compiler know “this column flows into these downstream models” before execution changes the workflow quite a bit. It makes refactors and masking policies much less scary.

Do you expose any kind of “lineage diff” between branches? For example: this PR changes the downstream impact of `customer.email` from A/B/C to A/B/D. That would be useful in code review.

ramon156about 4 hours ago
If your introduction message already includes a bunch of uncurated claims and LLM smells, then what does that say about the code I'm about to run?
hugocorreia90about 3 hours ago
Yeah, fair pushback, and yes the intro was AI-assisted. Marketing is not my strength nor I am a native english speaker. I built this in about a month with heavy LLM tooling and the seed comment is part of that. I'm not going to pretend otherwise.

The code is what it is. `cargo test --workspace` runs across 19 crates. CI on 5 platforms (macOS ARM/Intel, Linux x86/ARM, Windows). JSON output schemas are codegen-checked in CI so docs can't drift from the binary.

If you want to skip the marketing copy and look at engine reasoning instead: PR #240 (audit trail), #241 (column classification + masking), #270 (failed-source surfacing in discover).

I'd rather hear "the code is bad" than "the post sounds AI-written".

DoctorOW18 minutes ago
> I'd rather hear "the code is bad" than "the post sounds AI-written".

Of course you would. Reading through and judging the quality of AI output is the largest amount of effort in a world where you can get everything else by prompting. Please internalize this: If you want to be respected you will have to put in effort yourself. There is no way around this.

austinthetacoabout 1 hour ago
This comment itself is likely written by AI by the sounds of it. It may be worth your time writing it out in your own words in your native language and then finding a competent translation tool to translate your words.
FrustratedMonkyabout 1 hour ago
Not sure why you are downvoted here.

'A-Lot' of side projects, hobby projects, etc.. are all using AI tools now. Also for marketing, every sales/marketing firm is using AI. So why critisize this guy inparticular.

AI is pervasive, the train has left the station. So that is not a reason to criticize this project. There might be other reasons, I'm not sure, but not that an AI was used.

ModernMech37 minutes ago
Because "Yeah, fair pushback" is AI smell. Either everything this person does is passed through an AI from code to blogs to even their HN comments and submissions; or they use AI so much they're starting to talk like it colloquially. Either way no one has time for that.
cmrdporcupineabout 1 hour ago
It's really a weird world now.

I do think the author is doing a disservice to themselves by writing the post and comments using LLM, even if the code is mostly agent built. People can tell right away, all the LLM shibboleths are there... it feels cheap. Just write naturally and then Google translate, don't let the LLM speak on your behalf.

What's going to distinguish projects that are built this way is the ability to explain, document, support, and maintain said projects over the long term. That will be the crucible. Gone are the days of "build it and they will come", and I feel a bit sad about that.

It's so easy to let the code grow under you beyond what you have the capacity to do the above for.

I've got the same thing going on. Eschewing paid work and grinding 16, 17 hours a day boiling the sea to build the whole universe from scratch (also a database, but of a different sort than this project) integrating all my favourite DB research papers and ideas that I've accumulated over the last 30 years. Outperforms postgres 2-4x or more, has a battery of correctness tests, Lean proofs, benchmarks, etc. etc.

But frankly I'd be nervous to share. Especially here. I don't even know where it ends up. Not least because if I'm doing it, so are 50 other people, probably.

PeterWhittaker41 minutes ago
Congrats on the work, but have you considered another name? Naming is hard and always will be: When I first scanned the headline, my initial thought was "that's an interesting area for the Rocky Linux team to explore". After a moment, "wait, no, that's confusing, it's some other Rocky".
hugocorreia9015 minutes ago
Thanks Peter. All my side-projects are named after my pets. I had a dog named Rocky and given this project is also an underdog competing with well-established tools such as dbt and sqlmesh, I decided to keep Rocky when opening it to public. But I'm happy to get some suggestions for a better name to this tool :)
mollerhojabout 4 hours ago
Its a bit confusing to claim that "The things your current stack can't give you because it doesn't own the DAG" and use DataBricks as your example: DataBricks includes jobs and pipelines, so it very much owns the DAG, no?
hugocorreia90about 3 hours ago
Fair point. Databricks owns a scheduling DAG (Workflows, DLT). What I meant by "owns the DAG" is the semantic DAG: model-to-model dependencies with column-level types that the compiler builds.

Workflows knows task A runs before task B. Rocky knows `dim_customer.email` flows from `raw_users.email_address` through three CTEs in `stg_customers`. Different layer, same word.

I'll be more careful with that framing.

hasyimibharabout 5 hours ago
Looks cool, I've been waiting for someone to build this since dbt and SQLMesh acquisition. It would be great to have model versioning and support for ClickHouse SQL.
hugocorreia90about 3 hours ago
Thanks. On model versioning — what's the use case you have in mind? A few options that map to different designs:

- dbt-style semantic-layer versions (v1/v2 of a model) - schema migration history - branch-based (Rocky already has branches + replay)

Different design choice for each, so it helps to know which problem you're trying to solve.

ClickHouse is tractable through the Adapter SDK without engine patching. If you can share roughly your model count and workload shape, I can put a real timeline on it. Open to community PRs too.

kjuulhabout 2 hours ago
fyi, llm written comments are discouraged on hackernews.

https://news.ycombinator.com/item?id=47340079

Not saying yours are, but them -- dashes certainly looks like it ;)

hugocorreia90about 2 hours ago
Fair. I just use it for tidying up my replies as I'm not a native English speaker.
mergisiabout 5 hours ago
* * *
hugocorreia90about 3 hours ago
Thanks for the careful read. The "what breaks if I rename this column" question is exactly what column lineage from the compiler is meant to answer, and you said it better than I did in the post.

On the schema-grounded AI angle: agreed. The failure mode you describe — structurally valid SQL that joins on the wrong key or aggregates at the wrong grain because the model hallucinated a relationship — is exactly what the compiler is positioned to catch. AI-generated SQL runs through the type checker before it can land, so suggestions that don't validate against the actual DAG never reach the user. NL-to-SQL tools that integrate a compile step would close exactly the gap you're pointing at.

On your two questions:

1. Branch isolation for stateful models — mixed answer, and worth being honest about:

   - Incremental: isolated. The watermark `state_key` includes the resolved schema, and `rocky branch create` swaps the schema prefix. So a branch run reads/writes a different redb key than main and they don't advance each other.

   - Snapshot: not yet. Today `rocky branch create` only writes a branch record; it doesn't copy warehouse tables. A snapshot model on a branch starts with an empty table (CREATE TABLE IF NOT EXISTS in the branch schema) and accumulates from the first branch run, with no inherited history from main. That's the gap. The fix is the next wave: native Delta SHALLOW CLONE / Snowflake zero-copy at branch creation, which gives point-in-time snapshot semantics without copy-on-write overhead.
2. Cost attribution. Both bytes scanned and duration are captured per-model in the run record (`bytes_scanned` and duration on `RunRecord`). Budget gating today is on cost (USD) and duration — `max_usd` and `max_duration_ms` in `[budget]` blocks in `rocky.toml`, as independent thresholds. A direct bytes-scanned budget threshold isn't gateable today; the bytes are in the run record for analysis but you can't currently fail CI on "this run scanned more than N TB". Reasonable extension if there's demand.

   To your Snowflake point: the warehouse-size × duration credit model and the scan volume tell genuinely different stories, so they're tracked separately rather than rolled into a single number.