The surprisingly complex journey to text-selectable client-side generated PDFs

FFailMore 2 days ago 56 commentsRead Article on sdocs.dev

HI version is available. Content is displayed in original English for accuracy.

⚡ Community Insights

Discussion Sentiment

71% Positive

Analyzed from 3150 words in the discussion.

Discussion (56 Comments)Read Original on HackerNews

pimlottc•20 minutes ago

I would look for any other well-known formats that already have a dependable pipeline for rendering to a selectable text PDF. Perhaps LaTeX? Then make a converter to covert from markdown to that format.

mvdwoord•about 15 hours ago

"everyone hates PDFs where you can't reliably select and copy text!"

Boy do I. One of my biggest annoyances is receiving an invoice in pdf format, where I can either not select the text at all, or where you cannot cleanly select text, i.e. when you try to select something it somehow half highlights the line above as well and I am not sure what is on my clipboard, and need to paste temporarily in a text editor, then select what I need ... etc

Super nice when the list IBAN numbers for payment in a tiny font size as well.

Maybe I should vibecode a little helper. tool to visually select a rectangle and perform OCR and detect IBAN numbers or show a popup with proper text to do my subselect.

13324•about 14 hours ago

Personal lifehack: Use the address bar of your browser to view the clipboard content quickly and to omit any formatting from it.

cluckindan•about 8 hours ago

This also conveniently sends it to your search provider, and possibly to the browser vendor for analytics.

pimlottc•25 minutes ago

These days it’s hard to be sure of anywhere you can paste a piece of text and be certain it’s not being sent to a server somewhere.

illusive4•about 5 hours ago

Depends on your settings.

Steve44•about 14 hours ago

We have a couple of large customers who will only send remittance advices as a PDF, the are several pages and a couple of hundred rows. Apparently their system can not send XLSX or any other format.

I've been a happy user of Tabula[1] for a few years and it works really well, for my needs anyway.

I just import, auto-detect tables, select "Stream", and then export to a CSV.

[1] https://tabula.technology/ [1] https://github.com/tabulapdf/tabula

ezst•about 9 hours ago

> select a rectangle and perform OCR

You get that out of the box on an up to date KDE, from the screenshoting application Spectacle.

pegasus•about 14 hours ago

For the mac there's TextSniper which does just that.

seemaze•about 9 hours ago

On mac, I just do a quick screenshot and use the builtin OCR in Preview to select and copy text all the time.

mock-possum•about 14 hours ago

There’s a Power Toys util for windows that does the same - draw a rectangle around the text you want, and it OCRs it to your clipboard.

ak217•about 9 hours ago

Tangentially related, one of the most underappreciated projects for print media and PDF generation is paged.js. It goes down the rabbit hole of paged media and the surprising complexity of it (have you ever thought about what it takes to render a table across multiple pages?) and provides a great foundation for solving these problems with sanity and using open web standards. It's a project that deserves more support.

FailMore•about 9 hours ago

Nice shout - I will check it out - this I assume: https://github.com/pagedjs/pagedjs/

000000000001•about 15 hours ago

I vibecoded a pdf replacement at work, sort of.

I wanted a way to make submitting Inventory Changes at work easier, so I took the pdf, used StirlingPDF to convert to an html bundled zip, I converted the .png with the form border and symbols to base64, and then wrote a powershell script to replace <p> tags with variable data from a csv export of our inv data(I tried to use odbc to extract it but once a dev showed me the logical, physical, and views that made up our Inventory lookup I went back to using a xlsx export that is built into our environment and letting the ps1 trim and sanitatize the input). As the conversion places text with absolute positioning, I was able to fine tune the layout and spacing. I then used my local AI qwen3.6-27b to convert my ps1 to a single html binary webapp with html/css/js, no external framework, two js scripts are loaded via cdn for now.

Inspired by how well that worked I vibecoded a drag and drop editor to build forms for other processes, I upload a png w which gets converted to base64 and then I can drag and drop text elements to where they need to be and export.

I know how many people feel about AI coded projects so these are really only for me, I didn't expect my coworkers to adopt it or anything but they did.

gadders•about 15 hours ago

While I think you would be stupid to try and vibe-code EG a banking platform or SAP, I do think there is a lot of scope for small apps that can solve a particular issue.

I've just vibe-coded a mobile app, and I'm comfortable doing that because this particular app can't update data or send data anywhere. It's damage blast radius is pretty low.

FailMore•about 15 hours ago

Sounds pretty interesting. Got a video of it? Sometimes our vibe coded tools are pretty useful... but sometimes we can be a bit shy to share them given the vibeness...

lavrton•about 13 hours ago

Just was solving the exact same issue.

Recently I released https://polotno.com/render-tag/ library to render rich text into 2d canvas context. And it turns out it was very easy to adapt it to work with pdflib library (via 2d canvas <-> pdf context) proxy. I was able to render good set of rich text features. Thinking to make that bridge open source as well. Maybe you will be interested in that?

FailMore•about 12 hours ago

Yep sounds good!

josefrichter•about 18 hours ago

It’s not that surprising. It’s one of those well known pandora boxes of web development: email templates, PDFs, printing,…

FailMore•about 17 hours ago

Ah, I didn't know that. It's not something I had worked on before, and the file format is highly prevalent (so I assumed things would be easy), so it was surprising to me

SirHumphrey•about 17 hours ago

Nothing about PDF is easy. Similarly to what once Tom Scott said about time zones, every time I must deal with PDFs I pray that PDF.js can be hacked in to doing it instead, otherwise I just don’t bother.

It’s on of the few examples when converting it in to picture and chucking it in a multimodal llm is a more sensible solution than trying to parse it.

caspper69•about 15 hours ago

You would think that, but PDF is not really a format for text. It's a format that describes typography and graphics layout & formatting. It's not uncommon for a text pdf to not contain all of the text it renders (due to ligatures).

ashishb•about 16 hours ago

Software engineers drastically underestimates GUI - Web layouts, mobile app layouts, and even PDF layouts are non-trivial pieces of work to get right in all circumstances.

freedomben•about 16 hours ago

Nobody who has actually worked on those things think that. You might want to qualify if you're only talking about people who have never worked in this area.

In my experience it's the NON software engineers who tend to underestimate the complexity

FailMore•about 16 hours ago

Yep, they (can) rarely enter your domain... so it's easy to assume its going to be trivial (maybe because things like .md or .txt files are trivial, so it's easy to think there's not much of a delta)

evolve2k•about 3 hours ago

The second part of the article moves into a very confusing to read user experience.

That said, maybe I missed it, but I didn’t see any mention of pandoc that is known to do markdown > pdf rendering “client side”.

This wasn’t AI written by chance?

gobdovan•about 16 hours ago

Thanks, this puts into perspective why copy-paste from PDFs is so bad.

I months into building a pasteboard transform library that normalises VS Code, Google Docs, PDFs and a bunch of Chromium apps provider-specific data so I can start pasting everything everywhere exactly how I want it. It's much, much messier than I expected.

Apps put different UTTypes on the pasteboard that are not really compatible with each other. Usually there's a plain text fallback, then rich text/HTML, then provider-specific data. You show how much insane work is needed just to make text selectable with glyph mappings, layout, links, code blocks, rendered styles, etc. But once you copy from that PDF, most viewers still only expose raw text, and often broken raw text at that...

FailMore•about 16 hours ago

Yep, it is a very interesting space for improvements imo. Kind of broadly speaking copy and paste is so central to working with a computer in a smooth way it should probably have more power / quality built into it (e.g. not having to install some random plug in to get clipboard history, etc.)

gobdovan•about 16 hours ago

Part of it is security/privacy and providers avoiding liability. People constantly copy passwords, tokens, personal data, etc., so clipboard history is risky by default. Apple probably does not want to expose a rich API here and then be responsible for securing that surface forever.

So macOS does not really give you a clean "this app copied this semantic object" API. Clipboard-history apps generally poll NSPasteboard.changeCount, which already makes provenance fuzzy, since you can observe that the pasteboard changed, but not reliably know the source app.

Pasting is fuzzy too. You know what representations were available, but not what the destination app actually accepted, because that decision happens inside the app and is generally opaque for the OS. So what even is history? Is it the raw object, the fallback text, the richest representation, the thing you intended to paste, or the thing the target app consumed? And even if you define history as "the observed events", polling can also miss states. And once you add transforms (like I want to), you basically have to define your own history model. A coherent OS clipboard-history API probably will never happen without big effort and liability policy changes from providers.

Worf•about 16 hours ago

PDFs should be only for printing or maybe for keeping scanned versions of things. For anything else they're just not the right tool for the job. Not for things meant to be accessed on a computer like books, scientific papers or, for some weird reason, catalogs and price lists from websites.

We have responsive and open standards like HTML and EPUB (zipped XTML) and they work great. arXiv has HTML papers, and libgen and anna's archive often have EPUB versions of books. The issue for me with EPUB is the lack of good readers now.

jkscm•about 15 hours ago

slighlty disagree with this. A fixed page layout has it's own advantages. The reason we have more high quality pdf readers than epub readers is probrably connected to the format itself. PDF readers usally are more more feature complete when it comes to stuff like annotations too.

Worf•about 15 hours ago

My issue with EPUB readers is about features a PDF reader wouldn't even have. Small and annoying things like how much freedom I get when changing fonts and whatnot.

I haven't had a need to use annotations. I guess that could be solved by EPUB editors, but I haven't tested any, apart from any text editors after unzipping the EPUB.

kuekacang•about 14 hours ago

Somewhat agree!. I think my main gripe is that there are no good-enough reader that: - without account integration - has good ux

Even some of the okay ones, most are, as tantacurl may say, janky.

gf000•about 16 hours ago

I don't know, I really love a well-typeset books/papers. Especially when they feature figures that are deliberately placed close to the relevant section in the text, it's just not something we can replicate with HTML, that can barely do proper justified text.

Sure, I would like that beautifully designed page to magically become a single column beautiful document on my phone, but I will take the former over a badly designed text extract where the relevant figure is 10 pages away.

Epub (=html) is good for novels, but there is nothing replacing PDF for science papers. If anything, the latex (or ideally typst) source would come the closest, if properly written (not absolute offsets). That could be used to produce different page sized versions.

Worf•about 15 hours ago

The "figures that are deliberately placed close to the relevant section in the text" is something I've heard often, and I'd agree to an extent. But the figure is never 10 pages away (unless you have a tiny screen or something). It's easy to put an image inbetween 2 paragraphs. With PDF papers 1 figure is often referenced in several places throughout the paper so I just open 2 windows with the paper anyway.

For justified text - what's the point of stretching each line artificially just so they align at the end? It looks awful to me even when done "correctly". Having uneven spaces makes it harder to read. Having every line align on the right also makes it harder to read. When you have uneven lines, I subconsciously use the different at the end as an anchor for where I am in the text or where a certain phrase was. Hyphenating words is another thing that doesn't make a lot of sense nowadays - we have enough words with a hyphen naturally in them, so reading a broken up word is mentally taxing as I have to figure out if it's a normal word with a hyphen or a broken up one.

All the arXiv HTML papers are much better to read in the browser, IMO. And they'll only get better. PDF will likely stay the same.

For small screens like phones or tablets, having to constantly scroll up and down and left and right for a 2-column paper is just painful. PDF is much better on a big screen.

jkscm•about 15 hours ago

Just look at examples like this: https://www.google.de/books/edition/Exploration_Map_by_Map/s...

You will realize that saying "PDFs should be only for printing" is a vast oversimplifcation for the requirements people have for different kinds of documents.

mcdonje•about 16 hours ago

EPUB is the ebook standard, outside of Amazon-land, so it has staying power in its space. I think it would be good for the ecosystem if it broke containment and got tooling in enough places to challenge PDF.

FailMore•about 16 hours ago

Interesting point. What do you feel about the "business world"'s heavy use of PDFs? There is something to be said about the file format being trusted/so dominant now... probably some random sequence of events led to this happening... but perhaps hard to shift

Sharlin•about 16 hours ago

Because the business world used to run on paper, and pdf became the de facto standard desktop publishing file format because Adobe became the de facto king of desktop publishing. Storing, transferring, and reading documents on paper has given way for doing all of that digitally, but path dependency guarantees that there’s no way of getting rid of PDF now.

Purely psychologically, I think there’s something that feels more "secure" or long-lasting about PDF’s perceived quasi-immutability compared to formats designed to be edited.

FailMore•about 16 hours ago

Yeah, I think the point about editing is a very good one. There is something comforting about them and perhaps that's it (+ maybe we are used to them being A4 pages, so you know what to expect). I think also the lack of flexibility with rendering is good, if you see it on one device you know exactly how it will look on another device.

danhor•about 16 hours ago

A PDF of a long document such as a standard or reference manual is almost always preferable to an HTML version. HTML versions have issues with formatting, searching (as browsers struggle with multi-thousand page documents and non-native search document search implementations almost always suck), indexing, correct behavior on windows size change (especially a side-by-side pdf view is almost unheard of for webpages), ... . Some vendors have switched to online-only for some documents and it always annoys me.

Worf•about 15 hours ago

> correct behavior on windows size change

Except the PDF is not responsive at all and you can't increase or decrease the font size without increasing the whole width of page.

> Some vendors have switched to online-only for some documents and it always annoys me.

HTML shouldn't mean online-only. If the vendor isn't trying to make it hard to download, you should always be able to convert to PDF. But PDF to HTML is very hard or impossible.

adrian_b•about 14 hours ago

In any technical/scientific document I do not want to increase or decrease the size of any element, e.g. of one of the fonts.

You only want to do an overall proportional zoom, when needed.

A well-designed document page has appropriate size ratios between various kinds of texts, formulae, tables and images, which should not be corrupted by changing the size of a single element.

The pages where the author has not formatted them adequately are ugly and hard to understand, which is what you typically see when this kind of content is written as HTML/EPUB documents, which are rendered non-deterministically.

Lazy writers may like HTML, but readers who must read and search through vast amounts of technical documentation do not like it.

There are many good PDF readers that are adequate for reading and searching even huge documents, but I have never seen any tool that works acceptably for EPUB/HTML big documents, which is not surprising because no tool can compensate the fact that the writer of the document did not design the layout of the pages carefully.

adrian_b•about 15 hours ago

HTML and EPUB work not great, but very badly for scientific or technical papers or books.

No two readers render them alike, and they typically are much uglier and more difficult to use than books (sometimes even the same book) in PDF, DJVU or ODT formats.

I read a very large quantity of technical documentation and I always avoid EPUB and HTML like the plague. I use such formats only when there is no alternative.

On Linux, mupdf is a decent EPUB reader, which is very fast and it usually does a better job at formatting pages than most other EPUB readers that I have tried on Linux.

For fast navigation and searching, especially in technical documentation with hundreds or thousands of pages, it is very useful for the document to be well partitioned into pages and the page layouts to be well designed, like for a printed book, even if this may seem unnecessary for a document stored in a computer.

HTML and EPUB documents are seldom divided in uniform pages and the position of various elements, like tables or figures can vary between readers or even with the same reader in different circumstances, so when you search various things you are slowed down in recognizing them, because they may not be in the position where you have seen them previously. Moreover, in HTML and EPUB documents, depending on the reader, the size ratios between various elements may be inappropriate, making the pages ugly and/or hard to understand.

All the defects of HTML and EPUB documents are caused by the fact that the writer of the document normally does not take full responsibility for the appearance of the pages, delegating this to the browsers/readers, which seldom do a good job for scientific/technical documents full of formulae, tables and figures.

This may be fine for normal Web pages, but it is not acceptable for technical and scientific documents.

In theory, one could design carefully HTML pages and the associated CSS files, to be rendered deterministically, but I have encountered very rarely such documents.

Worf•about 14 hours ago

I understand your point that you want a fixed presentation layout and pagination. I prefer to be able to responsively resize the document and to follow the ToC instead of pages. I've yet to see documentation that's thousands of pages long that doesn't include a very detailed ToC. For me remembering "section 8.3.16.2" is better and makes more sense than remembering "page 1292". I've had to read scanned math books years ago and I remember using some PDF reader to put bookmarks that correspond to the ToC, so I'd put "2.5.1.2 - Theorem about X" in the sidebar. That's how I was able to actually go back and forth easily. With just pages it would've taken me tens of seconds to locate a theorem (or lemma, proof, definition, whatever). And it was a dense book, so I had to constantly go back and reread stuff. But I agree that PDFs can, and often do, have ToCs, too.

> the size ratios between various elements may be inappropriate

I can't recall having this issue on websites or on EPUBs. What kinds of elements are we talking about? HTML and CSS are pretty good at keeping sizes from what I've seen. I agree that there are many EPUB readers, most of them very unpolished. And perhaps there aren't EPUB readers that are good at everything, yet.

For formulas, MathML and other tech has been satisfactory. I was able to find this basic math paper arXiv uses as a demo for their HTML papers:

https://ar5iv.labs.arxiv.org/html/1910.06709

It doesn't have figures, but the math is rendered perfectly. I can easily remove the "justify" style and increase and decrease the letters. If it was a long paper, it would've been nice to have a clickable ToC, but most EPUBs have one.

I think that right now most EPUB readers and some HTML renderings are bad, but I believe they'll get better.

adrian_b•about 14 hours ago

While browsers like Chrome and Firefox normally render very well on-screen any Web page, it is enough to give them the "Print" command for that page, to see in most cases a badly rendered page, where the size ratios of various elements are bad and they overlap or are mis-positioned so that the "printed" page is completely unlike what the browser shows on screen.

The way how the "printed" pages look in Firefox and Chrome demonstrates the same rendering problems that appear in most EPUB readers.

I have no idea which is the cause of this, but the bad behavior of "printing" in Firefox and Chrome has existed for years. Not all browsers behave the same, e.g. Vivaldi usually is much better at generating "printed" pages, than Chrome, despite being derived from the same code base.

Perhaps the great differences between on-screen rendering and "printed" rendering is caused by the fact that badly designed HTML/CSS might specify some sizes in "pixels" or other such inappropriate units, instead of using length units, like points, inches or millimeters. Then when rendering on different media the size ratios are corrupted.

yolo3000•about 14 hours ago

Today I had to edit in MacOS a pdf which had some text fields. It had 3 places which resembled checkboxes, so in those I was able to place an `x` character. When I saved it and previewed it, the `x` was in 2 places, but not the third. I tried several times but couldn't make it work. I gave up, started a Claude session, and asked it to fill in the `x`. 4k tokens of Python later it managed to place the `x` correctly.

jiehong•about 9 hours ago

This page made me discover the tool, and I find sdocs.dev pretty cool!

FailMore•about 9 hours ago

Thank you very much! I launched via a Show HN a few weeks ago [1] which goes into quite a few details which might not be obvious.

Also one thing that probably isn't clear with everything I've written so far is that by far the most enjoyable use case I've found with SmallDocs (or the `sdoc` cli command) is telling your CLI-based agent to "sdoc" you things. E.g.:

* "write up the plan and sdoc it to me"

* "explain async/await to me simply in a sdoc"

* "draft the release notes as a sdoc I can send to Ben for feedback"

* etc.

[1] https://news.ycombinator.com/item?id=47777633

cbolton•about 16 hours ago

I wonder if using Typst would be a viable solution: the compiler can be built into a wasm component that runs locally in the browser (that's what the Typst webapp does) and it generates good PDFs with working selection/copy/paste.

There's even a package (cmarker) than can translate Markdown to Typst which could be enough for a MVP.

LuttelBurchtje•about 14 hours ago

The Swiss army knife for document conversion, Pandoc, supports compiling to WASM since 3.9 [1]. It supports Markdown, in a wide variety of flavours, and Typst. Their official demo page provides a PDF output via Typst, all done client-side [2]. Furthermore, you get .docx and other output formats as well

[1]: https://github.com/jgm/pandoc/releases/tag/3.9

[2]: https://pandoc.org/app/

alansaber•about 16 hours ago

You don't know the hell of trawling through PDF XML and HTML construction until you've done it