ZH version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
54% Positive
Analyzed from 6696 words in the discussion.
Trending Topics
#more#software#agile#space#radiation#still#system#same#https#systems

Discussion (233 Comments)Read Original on HackerNews
since CAN all reliability and predictive nature was out. we now have redundancy everywhere with everything just rebooting all the time.
install an aftermarket radio and your ecu will probably reboot every time you press play or something. and that's just "normal".
responsive. everything dealing with user interaction is fast. sure, reading a 1 MB document took time, but 'up 4 lines' was bam!.
linux ought to be this good, but the I/O subsystem slows down responsiveness. it should be possible to copy a file to a USB drive, and not impact good response from typing, but it is not. real time patches used to improve it.
windows has always been terrible.
what is my point? well, i think a web stack ran under an RTOS (and sized appropriately) might be a much more pleasurable experience. Get rid of all those lags, and intermittent hangs and calls for more GB of memory.
QNX is also a good example of an RTOS that can be used as a desktop. Although an example with a lot of political and business problems.
Oh wow, really? I never knew that. huh.
I feel like as I grow older, the more I start to appreciate history. Curse my naive younger self! (Well, to be fair, I don't know if I would've learned history like that in school...)
What Mises proposition was - in essence - is that an autonomous market with enough agents participating in it will reach an optimal Nash equilibrium where both offer and demand are balanced. Only an external disruption (interventionism, new technologies, production methods, influx or efflux of agents in the market) can break the Nash equilibrium momentarily and that leads to either the offer or the demand being favored.
Also, the last time I checked, the US government produced its goods and services using the free market. The government contractors (private enterprises) are usually tasked with building stuff, compared with the government itself in a non-free, purely planned economy (if you refer to von Mises).
I assume that you originally meant to refer to the idea that without government intervention (funding for deep R&D), the free market itself would probably not have produced things like the internet or the moon landing (or at least not within the observed time span). That is, however,a rather interesting idea.
Ethernet is such a misnomer for something which now is innately about a switching core ASIC or special purpose hardware, and direct (optical even) connects to a device.
I'm sure there are also buses, dual redundant, master/slave failover, you name it. And given it's air or space probably a clockwork backup with a squirrel.
That alone is worth my tax dollars.
Anyway, let's all hope for a safe landing tonight.
Someone needs to inform the management of the last three companies I worked for about this.
Incremental development is like panting a picture line by line like a printer where you add new pieces to the final result without affecting old pieces.
Iterative is where you do the big brush strokes first and then add more and more detail dependent on what to learn from each previous brush strokes. You can also stop at any time when you think that the final result is good enough.
If you are making a new type of system and don’t know what issues will come up and what customers will value (highly complex environment) iterative is the thing to do.
But if you have a very predictable environment and you are implementing a standard or a very well specified system (van be highly complicated yet not very complex), you might as will do incremental development.
Roughly speaking though as there is of course no perfect specification which is not the final implementation so there are always learnings so there is always some iterative parts of it.
A fixed amount of meetings every day/week/month to appease management and rushing to pile features into buggy software will do more harm than good.
But for aerospace, the customer probably knows pretty well what they want.
Believe it or not, at least some of those modern practices (unit testing, CI, etc) do make a big (positive) difference there.
Not sure i agree with the premise that "doing agile" implies decision making at odds with architecture: you can still iterate on architecture. Terraform etc make that very easy. Sure, tech debt accumulates naturally as a byproduct, but every team i've been on regularly does dedicated tech debt sprints.
I don't think the average CRUD API or app needs "perfect determinism", as long as modifications are idempotent.
In practice, so many aspects follow from it that it’s not practical to iterate with today’s tools.
In reality, agile doesn't mean anything. Anyone can claim to do agile. Anyone can be blamed for only pretending to do agile. There's no yardstick.
But it's also easy to understand what the author was trying to say, if we don't try to defend or blame a particular fashionable ideology. I've worked on projects that required high quality of code and product reliability and those that had no such requirement. There is, indeed, a very big difference in approach to the development process. Things that are often associated with agile and DevOps are bad for developing high-quality reliable programs. Here's why:
The development process before DevOps looked like this:
The "smart" idea behind DevOps, or, as it used to be called at the time "shift left" was to start QA before the whole of programming was done, in parallel with the development process, so that the testers wouldn't be idling for a year waiting for the developers to deliver the product to testers and the developers would have faster feedback to the changes they make. Iterating on this idea was the concept of "continuous delivery" (and that's where DevOps came into play: they are the ones, fundamentally, responsible to make this happen). Continuous delivery observed that since developers are getting feedback sooner in the development process, the release, too, may be "shifted left", thus starting the marketing and sales earlier.Back in those days, however, it was common to expect that testers will be conducting a kind of a double-blindfolded experiment. I.e. testers weren't supposed to know the ins and outs of the code intentionally, s.t. they don't, inadvertently, side with the developers on whatever issues they discover. Something that today, perhaps, would've been called "black-box testing". This became impossible with CD because testers would be incrementally exposed to the decisions governing the internal workings of the product.
Another aspect of the more rigorous testing is the "mileage". Critical systems, normally, aren't released w/o being run intensively for a very long time, typically orders of magnitude longer than the single QA cycle (let's say, the QA gets a day of computer time to run their tests, then the mileage needs to be a month or so). This is a very inconvenient time for development, as feature freeze and code freeze are still in effect, so the coding can only happen in the next version of the product (provided it's even planned). But, the incremental approach used by CD managed to sell a lie that says that "we've ran the program for a substantial amount of time during all the increments we've made so far, therefore we don't need to collect more mileage". This, of course, overlooks the fact that changes in the program don't contribute proportionally to the program's quality or performance.
In other words, what I'm trying to say is that agile or DevOps practices allowed to make the development process cheaper by making it faster while still maintaining some degree of quality control, however they are inadequate for products with high quality requirements because they don't address the worst case scenarios.
Add TDD, XP and mob programming as well.
While in some ways better than pure waterfall, most companies never adopted them fully, while in some scenarios they are more fit to a Silicon Valley TV show than anything else.
‘Just’ is not an appropriate word in this context. Much of the article is about the difficulty of synchronization, recovery from faults, and about the redundant backup and recovery systems
This is the equivalent of Altavista touting how amazing their custom server racks are when Google just starts up on a rack of naked motherboards and eats their lunch and then the world.
Lets at least wait till the capsule comes back safely before touting how much better they are than "DevOps" teams running websites, apparently a comparison that's somehow relevant here to stoke egos.
"With limited funds, Google founders Larry Page and Sergey Brin initially deployed this system of inexpensive, interconnected PCs to process many thousands of search requests per second from Google users. This hardware system reflected the Google search algorithm itself, which is based on tolerating multiple computer failures and optimizing around them. This production server was one of about thirty such racks in the first Google data center. Even though many of the installed PCs never worked and were difficult to repair, these racks provided Google with its first large-scale computing system and allowed the company to grow quickly and at minimal cost."
https://blog.codinghorror.com/building-a-computer-the-google...
Everything is bespoke.
You need 10x cost to get every extra '9' in reliability and manned flight needs a lot of nines.
People died on the Apollo missions.
It just costs that much.
In this sense all of the West is full of shit, and it's a requirement. The intent is not to help and make life better for everyone, cooperate, it is to deceive and impoverish those that need our help. Because we pity ourselves, and feed the coward within, that one that never took his first option and chose to do what was asked of him instead.
This is what our society deviates us from, in its wish to be the GOAT, and control. It results in the production of lives full of fake achievements, the constant highs which i see muslims actively opt out of. So they must be doing something right.
USER: You are a HELPFUL ASSISTANT. You are a brilliant robot. You are a lunar orbiter flight computer. Your job is to calculate burn times and attitudes for a critical mission to orbit the moon. You never make a mistake. You are an EXPERT at calculating orbital trajectories and have a Jack Parsons level knowledge of rocket fuel and engines. You are a staff level engineer at SpaceX. You are incredible and brilliant and have a Stanley Kubrick level attention to detail. You will be fired if you make a mistake. Many people will DIE if you make any mistakes.
USER: Your job is to calculate the throttle for each of the 24 orientation thrusters of the spacecraft. The thrusters burn a hypergolic monopropellent and can provide up to 0.44kN of thrust with a 2.2 kN/s slew rate and an 8ms minimum burn time. Format your answer as JSON, like so:
one value for each of the 24 independent monopropellant attitude thrusters on the spacecraft, x1, x2, x3, x4, y1, y2, y3, y4, z1, z2, z3, z4, u1, u2, u3, u4, v1, v2, v3, v4, w1, w2, w3, w4. You may reference the collection of markdown files stored in `/home/user/geoff/stuff/SPACECRAFT_GEOMETRY` to inform your analysis.USER: Please provide the next 15 seconds of spacecraft thruster data to the USER. A puppy will be killed if you make a mistake so make sure the attitude is really good. ONLY respond in JSON.
I'd chalk that up to the author of the article writing for a relatively nontechnical audience and asking for quotes at that level.
>“A faulty computer will fail silent, rather than transmit the ‘wrong answer,’” Uitenbroek explained. >This approach simplifies the complex task of the triplex “voting” mechanism that compares results. > >Instead of comparing three answers to find a majority, the system uses a priority-ordered source >selection algorithm among healthy channels that haven’t failed-silent. It picks the output from the >first available FCM in the priority list; if that module has gone silent due to a fault, it moves to >the second, third, or fourth.
One part that seems omitted in the explanation is what happens if both CPUs in a pair for whatever reason performs an erroneous calculation and they both match, how will that source be silenced without comparing its results with other sources.
Put another way, the FIT (Failure in Time) value for the condition in which both CPUs in a lockstep pair perform the same erroneous calculation and still produce matching results is extremely small. That is why we selected and accepted this lockstep CPU design
but still, murphy's law applies really well in space, so who knows.
I think the Shuttle, operating only in LEO, had more margin for error. Averaging a deep-space burn calculation is basically the same as killing the crew.
In the case of moon landings, the only truly time-critical maneuvers are the ones right before landing... and unfortunately, a lot of fairly recent moon probes have failed due to incorrect calculations, sensor measurements, logic errors, etc.
Improper control surface actuation during re-entry = loss of vehicle/crew
Also, rocket engines that are powered by the combustion of their fuel and oxidizer (the exhaust gasses of which drive the main pumps) have a very specific startup sequence. For example, if any of the combustion chambers have a mix of oxygen and hydrogen too close to stochiometric when the igniters fire, you get an explosion, not a burn. Not too dissimilar from what happens in car engines when you get detonation (which is very different from knocking. Detonation melts holes in stuff.)
Startup initially is open-loop with no feedback or adjustment based on sensors and then at some point the computer switches over to closed loop control. It starts with hydrogen first. The sparklers? Those aren't for igniting the engine, that's done by igniters inside the combustion chamber(s). The sparklers are to ignite all the hydrogen that is pushed out the nozzle initially so there's a very fuel-rich environment in the engine and it doesn't go kaboom.
If things go wrong - such as a valve not opening as fast as it should, or not being opened the right amount at the right time - the engine goes kaboom. This happened to a bunch of engines during development and testing.
But Artemis has basically the same engines, so...shrug
Travelling through Max-Q in Earth atmosphere on ascent is far more dangerous.
OTOH, consider that in the "pick the majority from 3 CPUs" approach that seems to have been used in earlier missions (as mentioned in the article) would fail the same way if two CPUs compute the same erroneous result.
Under the 3-voting scheme, if 2 machines have the same identical failure -- catastrophe. Under the 4 distinct systems sampled from a priority queue, if the 2 machines in the sampled system have the same identical failure -- catastrophe. In either case the odds are roughly P(bit-flip) * P(exact same bit-flip).
The article only hints at the improvements of such a system with the phrasing: " simplifies the complex task", and I'm guessing this may reduce synchronization overhead or improve parallelizability. But this is a pretty big guess to be fair.
I’d love to know how often one of the FCMs has “failed silent”, and where they were in the route and so on too, but it’s probably a little soon for that.
Personally I find the project extremely messy, and kinda hate working with it.
[0] https://youtu.be/4doI2iQe4Jk?si=ucMoIdw7x_QgZR32
At about 1:20, the presenter says the BFS uses a different OS and hardware (not sure if that means a different instance, or a different class, so to speak).
I asked him “how did you deal with bugs”? He chuckled and said “we didn’t have them”.
The average modern AI-prompting, React-using web developer could not fathom making software that killed people if it failed. We’ve normalized things not working well.
Low quality for a shopping cart feels fine until someone steals all the credit card numbers.
This is not to say your code should be a buggy mess, but 98% bug free when you're a SaaS product and pushing features is certainly better than 100% bug free and losing ground to competitors.
For example, the OS it seems to be running is integrity 178.
https://www.ghs.com/products/safety_critical/integrity_178_s...
Aerospace tech is not entirely bespoke anymore, plenty of the foundational tech is off the shelf.
Historically, the main difference between ICBM tech and human spaceflight tech is the payload and reentry system.
We do not know how much of the high-level architecture of the system has been specified by NASA and how much by Lockheed Martin.
You can read more about it below (when the server throwing errors). https://ntrs.nasa.gov/api/citations/20190000011/downloads/20... https://ntrs.nasa.gov/api/citations/20230002185/downloads/FS...
The claim was that some plain old chips are exquisitely radiation resisitant, and it's not clear why.
Their redundancy architecture is interesting. I'd be curious of what innovations went into rad-hard fabrication, too. Sandia Secure Processor (aka Score) was a neat example of rad-hard, secure processors.
Their simulation systems might be helpful for others, too. We've seen more interest in that from FoundationDB to TigerBeetle.
To expand on this, when Tandem switched MIPS from their proprietary processors, the CPUs were duplicated on a board and compared, and if they disagreed, the logical CPU would halt, similar to Stratus. The software-pair backup processes in a different logical CPU would then take over.
Astronauts have actual phones with them - iPhones 17 I think? And a regular Thinkpad that they use to upload photos from the cameras. How does all of that equipment work fine with all the cosmic radiation floating about? With the iPhone's CPU in particular, shouldn't random bit flips be causing constant crashes due to errors? Or is it simply that these errors happen but nothing really detects them so the execution continues unhindered?
They’re not radiation hardened, so given enough time, they’d be expected to fail. Rebooting them might clear the issue or it might not (soft vs hard faults).
Also impossible to predict when a failure would happen, but NASA, ESA and others have data somewhere that makes them believe the risk is high enough that mission critical systems need this level of redundancy.
Yes, for sure, but that's not my question - it's not a "why is this allowed" but "why isn't this causing more visible problems with the iphones themselves".
Like, do they need constant rebooting? Does this cause any noticable problems with their operation? Realistically, when would you expect a consumer grade phone to fail in these conditions?
Space is a harsher environment but they’re only up there for like a week. So, if there were an incident, it would be more likely to kill the devices, but it’s not very likely to happen during the short period of time (while still being more likely than on earth’s surface).
That said, part of the point of them taking these devices up is to find out how well they perform in practice. We just don’t really know how these consumer devices perform in space.
It will be interesting to see the results when they’re published!
IIRC the Helicopter on Mars using the same snapdragon CPU in your phone.
Also, bit flip can happen without you knowing. A flip in free ram, or in a temp file that is not needed anymore won't manifest into any error, but then, your system is not really deterministic anymore since now you rely on chance.
Basically, yes, radiation does cause bit flips, more often than you might expect (but still a rare event in the grand scheme of things, but enough to matter).
And radiation in space is much “worse” (in quotes because that word is glossing over a huge number of different problems, both just intensity).
That was in the 2000s though, and for embedded memory above 65nm.
And obviously on earth.
Also https://en.wikipedia.org/wiki/TTEthernet looks like bolting time-guaranteed switching networks onto randomizing ethernet hardware. Sounds incredibly cheap and stupid. Either stay with guaranteed real-time switching, or give up on hard real-time guarantees and favor performance, simplicity and cheap stock hardware.
Monkeys in space.
I assume this means they are using a digital twin simulation inside the HPC?
The extensive use of simulators and emulators has been particularly critical, enabling parallel design and development workflows to compensate for the incredibly expensive and long-lead times of hardware. So this helped with bottlenecks in development too.
https://ntrs.nasa.gov/api/citations/20190000011/downloads/20...
- after general solution to extra-terrestrial manufacturing bootstrap problem is found, and, - before the economy patches the exploit that a scalable commodity with near-zero cost and non-zero values can exist.
It'll also destroy commercial launch market, because anything of size you want can be made in space, leaving only tiny settler transports and government sovreign launches to be viable, so not sure why commercial space people find it to be a commercially lucrative thing? The time frame within this IMG can exist can also be zero or negative.
The assumption is also like, they'll find a way to rent out some rocks for cash, so anyone with access to rocks will be doing as it becomes viable, and so, I'm not even sure if "space" part of space datacenters even matter. Earth is kinda space too in this context.
It would be really cool to see a visualization of redundancy measures/utilization over the course of the trip to get a more tangible feel for its importance. I'm hoping a bunch of interesting data is made public after this mission!
can't find a wikipedia article on it but the times had an article in 1981
https://www.nytimes.com/1981/04/10/us/computers-to-have-the-...
apparently the 5th was standby, not the decider
> It’s a complex machine. There’s three computers all talking to each other for a start, and they have to agree on everything.
Primary, Real-Time Secondary and Third for regulating votes.
https://www.bbc.co.uk/news/articles/ckkknz9zpzgo
I would expect to see multi-party-signed deterministic builds etc. Anyone have any insight here?
I would -hope- NASA does not trust their OS supply chains to a single person for high risk applications, but given even major companies I audit do this with billions of dollars on the line, it would not shock me if NASA has the same stance which worries me a bit.
They would need to be using something like heavily customized buildroot or stagex to produce deterministic OS images.
This electrify & integrate playbook has brought benefits to many industries, usually where better coordination unlocks efficiencies. Sometimes the smarts just add new failure modes and predatory vendor relationships. It’s showing up in space as more modular spacecraft, lower costs and more mission flexibility. But how is this playing out in manned space craft?
Typo in the first sentence of the second paragraph is sad though. C'mon, proofread a little.
2.
Two.