pleroma.debian.social

pleroma.debian.social

A discussion about why mmap is bad, and why storing binary data directly (without explicit serialization) like encourages is even worse https://news.ycombinator.com/item?id=47214413

The main argument against is this "portability" bugbear, which is entirely irrelevant. An embedded DB is meant to operate within the context of a single machine, so portability to multiple architectures is irrelevant. Nobody shares raw database files concurrently across multiple machines, that's a complete non-issue.

As usual, the HN crowd is a bunch of clueless clowns. These folks get it now, though https://github.com/kermitt2/delft

Storing float32 values directly in *enhanced* their interoperability, between their python code and java libraries. When all your usage is on a single machine, other serialization methods just get in the way. They're not just unnecessary, they're actively harmful to performance and interoperability.

(Screenshot of a github page)
Changes in DelFt 0.4.1:

LMDB embedding format changed: Embeddings are now stored as raw float32 bytes instead of pickle-serialized objects. This enables Java interoperability (used by GROBID) and improves performance.

@hyc oh wow, that guy seems to be afraid of programs

In contrast, a network protocol's entire purpose is communication and interoperability between different machines. Portability is paramount there. But portability is completely irrelevant within a single machine. A single CPU operates on the integers and floats, regardless of what language a program was written in.

Having spent a lifetime implementing communication protocols, I'm keenly aware of the importance of portability. It just doesn't apply here.

Sidenote: I tried to correct their misunderstanding there, and my comment was downvoted. As Isaac Asimov said "There is a cult of ignorance in the United States, and there has always been. The strain of anti-intellectualism has been a constant thread winding its way through our political and cultural life, nurtured by the false notion that democracy means that 'my ignorance is just as good as your knowledge."

Nowhere is it more rampant than on forums like HackerNews.

@hyc wait, do these dim-witted animals think that IEEE-754 has portability issues, or are they too daft to either no ship embedded databases over the network, or just if they really want to, to document which endianness their project uses?

@Archivist well, tbf, you can ship SQLite DB files across the network. They decided machine portability was a priority. LMDB prioritizes efficiency instead.

@hyc
To be completely fair, SQL requires you to convert to and from string representation the whole time. If performance *really* matters, anything SQL-based is a mistake
@Archivist
replies
1
announces
0
likes
1

@wouter @hyc @Archivist this is not necessarily true - sql is just fine with storing binary data, and for example sqlite offers you a fine way to transparently store binary data w/o ever converting. I do this all the time & it is plenty fast.

@bert_hubert
You mean like https://www.sqlitetutorial.net/sqlite-blob/, correct?

Your SQL engine would still be involved, and needs to do string parsing just to understand what it is you want to store in binary. Yes it's faster than converting to base64 and back, but not as fast as a memcpy ala LMDB.

I said "*really*" for a reason πŸ˜‰
@hyc @Archivist

@wouter @hyc @Archivist I suggest trusting me on this one. There is no string parsing involved with a prepared query where the parser only sees '?' and that only once. Also I never stated anything about this being as fast as lmdb, please respond to things I actually said.

@bert_hubert
I didn't mean to offend, but "sqlite can't be as fast as LMDB because of its design choices" is the statement *I* was trying (but perhaps failing) to make.

That doesn't make those design choices invalid -- in most cases I prefer SQL for ease of debugging -- but if performance *really* matters, LMDB is really the only option.

That doesn't mean 'everything else is slow', just that LMDB is faster
@hyc @Archivist

@wouter @bert_hubert @Archivist you won't say it but I will: everything else is slow.

;)