More

kaathewise · 2025-09-08T13:17:39 1757337459

The first programming language I learned was Java. And for us non-native speakers who didn't know English very well at that point public static void did indeed sound like a magic spell. It was behind both an understanding and a language barriers

zahlman · 2025-09-08T13:37:13 1757338633

When I first saw Java, I had already seen multiple dialects of BASIC, plus Turing (a Pascal dialect), HyperTalk (the scripting language of HyperCard, and predecessor of AppleScript), J (an APL derivative), C and C++. I'm also a native speaker of English.

Your perception is still warranted. It was clear enough to me what all of that meant, but I was well aware that static is an awkward, highly overloaded term, and I already had the sense that all this boilerplate is a negative.

kaathewise · 2025-08-09T15:15:29 1754752529

One of the problems is that a lot of bioinformatics formats nowadays have to hold so much data that most text editors stop working properly. For example, FASTA splits DNA data into lines of 50-80 characters for readability. But in FASTQ, where the '>' and '+' characters collide with the quality scores, as far as I know, DNA and the quality data are always put into one line each. Trying to find a location in a 10k long line gets very awkward. And I'm sure some people can eyeball Phred scores from ASCII, but I think they are a minority, even among researchers.

Similarly, NEXUS files are also human-readable, but it'd be tough to discern the shape of inlined 200 node Newick trees.

When I was asking people who did actual bioinformatics (well, genomics) what some of their annoyances when working with the bioinf software were, having to do a bunch of busywork on files in-between pipeline steps (compressing/uncompressing, indexing) was one of the complaints mentioned.

I think there's a place in bioinformatics for a unified binary format which can take care of compression, indexing, and metadata. But with that list of requirements it'd have to be binary. Data analysis moved from CSVs and Excel files to Parquet, and I think there's a similar transition waiting to happen here

jltsiren · 2025-08-09T17:49:08 1754761748

My hypothesis is that bioinformatics favors text files, because open source tools usually start as research code.

That means two things. First, the initial developers are rarely software engineers, and they have limited experience developing software. They use text files, because they are not familiar with the alternatives.

Second, the tools are usually intended to solve research problems. The developers rarely have a good idea what the tools eventually end up doing and what data the files need to store. Text-based formats are a convenient choice, as it's easy extend and change them. By the time anyone understands the problem well enough to write a useful specification, the existing file format may already be popular, and it's difficult to convince people to switch to a new format.

mfld · 2025-08-11T08:19:49 1754900389

Yes, most bioinformatics tools are the result of research projects.

However, the most common bioinformatics file formats have actually been devised by excellent software engineers (e.g. SAM/BAM, VCF, BED).

I think it is just very convenient to have text-based formats as you don't need any special libraries to read/modify the files and can reach for basic Unix text-processing tools instead. Such modifications are often needed in a research context.

Also, space-efficient file formats (e.g. CRAM) are often within reach once disk space becomes a pressing issue. Now you only need to convince the team to use them. :)

kaathewise · 2025-08-09T18:23:15 1754763795

Totally. A good chuck of the formats are just TSV files with some metadata in header. Setting aside the drawbacks, this approach is both straightforward and flexible.

I think we're seeing some change in that regard, though. VCF got BCF and SAM and got BAM

kaathewise · 2025-06-01T17:18:29 1748798309

Yeah, I found `anyhow`'s `Contex` to be a great way of annotating bubbled up errors. The only problem is that using the lazy `with_context` can get somewhat unwieldy. For all the grief people give to Go's `if err != nil` Rust's method chaining can get out of hand too. One particular offender I wrote:

   match operator.propose(py).with_context(|| {
    anyhow!(
   "Operator {} failed while generating a proposal",
   operator.repr(py).unwrap()
  )
   })? {

Which is a combination of `rustfmt` giving up on long lines and also not formatting macros as well as functions

kaathewise · on Feb 11, 2025

The dereference table allows allocations to fail:

https://arxiv.org/pdf/2501.02305#:~:text=If%20both%20buckets...

(the text fragment doesn't seem to work in a PDF, it's the 12th page, first paragraph)

bajsejohannes · on Feb 11, 2025

Thanks! So I guess the best recourse then is to resize the table? Seems like it should be part of the analysis, even if it's low probability of it happening. I haven't read the paper, though, so no strong opinion here...

(By the way, the text fragment does works somewhat in Firefox. Not on the first load, but if load it, then focus the URL field and press enter)

kaathewise · on Feb 11, 2025

Yeah, I presume so. At least that's what Swiss Tables do. The paper is focused more on the asymptotics rather than the real-world hardware performance, so I can see why they chose not to handle such edge cases

sjamaan · on Feb 11, 2025

This bothered me too, reading it and the sample implementations I've found so far just bail out. I thought one of the benefits of hash tables was that they don't have a predefined size?

yencabulator · on Feb 11, 2025

The hash tables a programmer interacts with generally very much have a fixed size, but resize on demand. The idea of a fixed size is very much a part of the open addressing style hash tables -- how else could they even talk of how full a hash table is?

kaathewise · on Nov 9, 2024

A spreadsheet engine. It's a React app with a Rust backend, but it impressed me how snappy it was[0]. Of course, it's not nearly as feature rich as Google Sheets, not to mention Excel.

[0]: https://app.ironcalc.com/

mdaniel · on Nov 9, 2024

"backend" seemed to imply it was contacting some server, but https://github.com/ironcalc/ironcalc#early-testing claims (and the network tab confirms) it is just Rust compiled to wasm, no "backend" required

MIT or Apache 2 (player's choice) if anyone else has grown deeply suspicious about any "open source" HN headlines of late

kaathewise · on Nov 9, 2024

Right, I've made a mistake! I keep getting surprised by the fact it's possible to simply compile a Rust crate with a WASM target and run it in the browser.

fsckboy · on Nov 9, 2024

backend does not imply server to me, it implies software that does the calculating engine work and does not concern itself with display refresh.

bonoboTP · on Nov 9, 2024

Backend is a general word, not limited to client-server or the web. You can have a rendering backend with various configurable choices, like in Matplotlib (https://matplotlib.org/stable/users/explain/figure/backends....), or the deep learning library Keras has a choice between PyTorch, JAX and TensorFlow backends.

fsckboy · on Nov 10, 2024

we're talking about a spreadsheet engine, the backend is not going to be the Xserver displaying it

readthenotes1 · on Nov 9, 2024

In code that typically runs in on e process, that's a plausible interpretation.

However, the browser, sorry, the Internet browser, is typically a distributed system and a more plausible interpretation of backend is server-side.

IMNSHO.

ninalanyon · on Nov 9, 2024

Surely that's background not backend.

8n4vidtmkvmk · on Nov 9, 2024

I wouldn't call it background unless maybe it's async or continues to process stuff while you're doing other things.

fsckboy · on Nov 10, 2024

what's the backend of a spreadsheet engine going to be doing? updating the datastructures of the spreadsheet.

is it going to be local or remote? that's not part of the question.

is it foreground or background? that's an implementation choice. apple II, yeah, everything freezes while it recalcs. windows? recalcs when it can, don't let the mouse freeze.

nhatcher · on Nov 9, 2024

It's running entirely on your browser unless you click share, download or import. The computation part is done in Rust compiled to wasm.

Thank you for posting!

kaathewise · on Nov 9, 2024

Yep, I've misunderstood, realized it after seeing mdaniel's comment.

Thanks for making this in the first place! I saw IronCalc in the list of projects supported by NLnet and it grabbed my attention.

By the way, if You don't mind me asking, how'd Tuta end up sponsoring IronCalc? It seems that lately they and Proton have been trying to expand their business away from just email. The fact that Tuta is interested in IronCalc makes me think they want to have an office-like offering.

nhatcher · on Nov 9, 2024

Tuta sponsors by providing us with free email accounts, that's all. I reached out months ago, they liked the project and were kind enough to help us out with the email.

I haven't have talks with them about integrating IronCalc, but it is something that is on my mind.

kaathewise · on Nov 9, 2024

Ah, I see. Best of luck with that!

There are a few projects where I'd love to see a modern spreadsheet implementation. CryptPad comes to mind. They use OnlyOffice, which is quite featurefull, but takes awhile to load and isn't as responsive.

nine_k · on Nov 10, 2024

What Google Sheets functionality you're missing there?

Of course there's a lot missing, but what's interesting is what you've reached for and could not find.

snthpy · on Nov 10, 2024

How tightly coupled are the React app and the Rust backend?

I hope the backend engine can be used standalone embedded in other apps.

kaathewise · on June 21, 2024

Contrary to the name, I don't think that it's very good. The whole thing is a canvas via WASM, so scrolling isn't smooth, selection doesn't work, and accessibility is seemingly non-existent.

But I think the technology itself is interesting. While most modern UI toolkits use HTML or React-like components, this uses a set of JSONs, which describe the page.

tonyabracadabra · on June 21, 2024

Is json a good file format for UI?

kaathewise · on May 31, 2024

I haven't self-hosted any, but I enjoy using Flarum [0] forums. They load faster than Discourse ones and feel snappy.

[0]: https://flarum.org/

kaathewise · on May 27, 2024

I was searching for a Meilisearch alternative (which sends out telemetry by default) and found Tantivy. It's more of a search engine builder, but the setup looks pretty simple [0].

[0]: https://github.com/quickwit-oss/tantivy-cli

ukuina · on May 28, 2024

QuickWit also sends telemetry by default: https://quickwit.io/docs/telemetry

OtomotO · on May 27, 2024

Hm, I am interested, but I would love to use it as a rust lib and just have rust types instead of some json config...

The java sdk of meilisearch was also nice, same: no need for a cli and manual configuration. I just pointed it to a db entity and indexed whole tables...

Would love that for tantivy

PSeitz · on May 28, 2024

> Hm, I am interested, but I would love to use it as a rust lib and just have rust types instead of some json config...

Yes that's how you use tantivy normally, not sure which json config you mean.

tantivy-cli is more like a showcase, https://github.com/quickwit-oss/tantivy is the actual project.

OtomotO · on May 28, 2024

Yes, and there is https://tantivy-search.github.io/examples/basic_search.html

But instead of this, I would prefer some way to just hand it JSON and for it to just index all the fields...

for comparison, this is my meilisearch SDK code:

    fun createCustomers() {
        val client = Client(Config("http://localhost:7700", "password"))
        val index = client.index("customers")
        val customers = transaction {
            val customers = Customer.all()
            val json = customers.map { CustomerJson.from(it) }
            Json.encodeToString(ListSerializer(CustomerJson.serializer()), json)
        }
        index.addDocuments(customers, "id")
    }

PSeitz · on May 28, 2024

You can just put everything in a JSON field in tantivy and set it to INDEXED and FAST

OtomotO · on May 29, 2024

Hm, I need to read up on the trade offs of going this route.

Thanks!

banish-m4 · on May 28, 2024

That's a petty objection to usable interactive search when it's easy to opt-out by adding a single command line argument.

soulofmischief · on May 28, 2024

OP is entitled to make political choices when selecting software.

Some of us have specific principles of which things like opt-out telemetry might run afoul.

OP will choose their software, I choose mine and you choose yours; none of us need to call each other petty or otherwise cast such negative judgement; a free market is a free market.

banish-m4 · on May 28, 2024

Irrational white-knighting rather than principled discussions doesn't add value here.

soulofmischief · on May 28, 2024

Suggesting you should be less judgemental is not white-knighting, nor is it irrational. Sorry bud, but not everyone thinks the way you do, different people have different principles.

Feel free to explain how either of the two comments of yours I've replied to represent principled discussion or added value, because I'm not seeing it.

kaathewise · on May 28, 2024

It's a minor complaint, but I'm also evaluating it for a minor project. I just don't like the fact that I can forget to add a flag once and, oh, now I'm sending telemetry on my personal medical documents.

Kerollmops · on May 28, 2024

Meilisearch only sends anonymized telemetry events. We only send API endpoints usage; nothing like raw documents goes through the wire. You can look at the exhaustive list of all collected data on our website [1].

[1]: https://www.meilisearch.com/docs/learn/what_is_meilisearch/t...

Nathanba · on May 28, 2024

also meilsearch is more like quickwit, their distributed offering but quickwit is AGPL

PSeitz · on May 28, 2024

They serve quite different use cases.

quickwit was built to handle extremely large data volumes, you can ingest and search TB and PB of logs.

meilisearches indexing doesn't scale as it will become slower the more data you have, e.g. I failed to ingest 7GB of data.

qdequelen · on May 28, 2024

Hey PSeitz, Meilisearch CEO here. Sorry to hear that you failed to index a low volume of data. When did you last try Meilisearch? We have made significant improvements in the indexing speed. We have a customer with hundreds of gigabytes of raw data on our cloud, and it scales amazingly well. https://x.com/Kerollmops/status/1772575242885484864

banish-m4 · on May 28, 2024

Frankly, I'm okay with Meillisearch for instant search because y'all are clear about analytics choices, offer understandable FOSS Rust, and have a non-AGPL license. If/when we make some money, I'm in favor of $upporting and consulting of tools used to keep them alive out of self-interest.

kaathewise · on April 9, 2024

It's an old project, based on Sphinx[0]. But unlike many other code searches, this one indexes Codeberg, SourceHut, and a number of other forges non-GitHub forges.

[0]: https://sphinxsearch.com/

mdaniel · on April 9, 2024

without saying what repos they prioritize, it's hard to take them seriously since some pretty simple searches were "uh-huh"

e.g. https://searchcode.com/?q=kubelet&src=2&lan=55 versus https://codesearch.debian.net/search?q=kubelet&literal=1 or the gold standard (although regrettably no longer open source) https://sourcegraph.com/search?q=context:global+kubelet&patt...

kaathewise · on March 30, 2024

psykose was a prolific contributor to Alpine's aports, with thousands of commits over the past few years[0]. So, I doubt They're involved.

[0]: https://git.alpinelinux.org/aports/stats/?period=y&ofs=10

everybackdoor · on March 30, 2024

JiaT75 was also a prolific contributor to xz over the past few years, so your assumptions are generally invalid at this point.