More

luigi23 · 2025-12-09T02:42:01 1765248121

yearning for old apple and order, current times and genz are more chaotic. not sure if it's generational, old apple was obsessed about design, now HIG is mostly optional. they now even use hamburger on websites which was a big no in the past.

luigi23 · 2025-09-26T07:15:18 1758870918

this is the thing that is missed often in recent convos. h1b not only enables you to hire cheaper, but also give much bigger leverage and power over worker. maybe now there are plenty of candidates on the market, but still having immigrant on visa for cheaper is, cynically, much better deal for corps.

luigi23 · on Nov 21, 2024

Was here before comment got removed!

luigi23 · on Nov 6, 2024

Grover Cleveland would like a word

luigi23 · on Sept 15, 2024

Just so I understand: you’re talking about setting up an FTP account, using curlftpfs, and SVN/CVS for Linux users? And even with all these, you’d still need USB drives for connectivity issues? Plus, you're naming it Dropbox? Is there more?

johnchristopher · on Sept 15, 2024

There was more to what dropbox was to FTP/CURLFTPFS/etc. then than what this webapp/page is to a Claude API now.

luigi23 · on Sept 14, 2024

selection bias. also survival of the fittest.

luigi23 · on Sept 2, 2024

Why are scrapers so popular nowadays?

adamtaylor_13 · on Sept 2, 2024

There’s a lot of data that we should have programmatic access to that we don’t.

The fact that I can’t get my own receipt data from online retailers is unacceptable. I built a CLI Puppeteer scraper to scrape sites like Target, Amazon, Walmart, and Kroger for precisely this reason.

Any website that has my data and doesn’t give me access to it is a great target for scraping.

drusepth · on Sept 2, 2024

I'd say scrapers have always been popular, but I imagine they're even more popular nowadays with all the tools (AI but also non-AI) readily available to do cool stuff on a lot of data.

bongodongobob · on Sept 2, 2024

Bingo. During the pandemic, I started a project to keep myself busy by trying to scrape stock market ticker data and then do some analysis and make some pretty graphs out of it. I know there are paid services for this, but I wanted to pull it from various websites for free. It took me a couple months to get it right. There are so many corner cases to deal with if the pages aren't exactly the same each time you load them. Now with the help of AI, you can slap together a scraping program in a couple of hours.

MaxPock · on Sept 2, 2024

Was it profitable?

keyle · on Sept 2, 2024

I'm sure it was profitable in keeping him busy during the pandemic. Not everything has to derive monetary value, you can do something for experience, fun, kick the tyres, open-source and/or philanthropic avenues.

Besides it's a low margin, heavily capitalized and heavily crowded market you'd be entering and not worth the negative-monetary investment in the short and medium term (unless you wrote AI in the title and we're going to the mooooooon babyh)

bongodongobob · on Sept 3, 2024

It was in the sense that I learned that trying to beat the market is fundamentally impossible/stupid, so just invest in index funds.

rietta · on Sept 2, 2024

Because publishers don’t push structured data or APIs enough to satisfy demand for the data.

luigi23 · on Sept 2, 2024

Got it, but why is it booming now and often it’s a showcase of llm model? Is there some secret market/ usecase for it?

IanCal · on Sept 2, 2024

Building scrapers sucks.

It's generally not hard because it's conceptually very difficult, or that it requires extremely high level reasoning.

It sucks because when someone changes "<section class='bio'>" to "<div class='section bio'>" your scraper breaks. I just want the bio and it's obvious what to grab, but machines have no nuance.

LLMs have enough common sense to be able to deal with these things and they take almost no time to work with. I can throw html at something, with a vague description and pull out structured data with no engineer required, and it'll probably work when the page changes.

There's a huge number of one-off jobs people will do where perfect isn't the goal, and a fast solution + a bit of cleanup is hugely beneficial.

atomic128 · on Sept 2, 2024

Another approach is to use a regexp scraper. These are very "loose" and tolerant of changes. For example, RNSAFFN.com uses regular expressions to scrape the Commitments of Traders report from the Commodity Futures Trading Commission every week.

simonw · on Sept 2, 2024

My experience has been the opposite: regex scrapers are usually incredibly brittle, and also harder to debug when something DOES change.

My preferred approach for scraping these days is Playwright Python and CSS selectors to select things from the DOM. Still prone to breakage, but reasonably pleasant to debug using browser DevTools.

yuters · on Sept 3, 2024

I don't know if many has the same use case but... I'm heavily relying on this right now because my daughter started school. The school board, the school, and the teacher each use a different app to communicate important information to parents. I'm just trying to make one feed with all of them. Before AI it would have been hell to scrape, because you can imagine those apps are terrible.

Fun aside: The worst one of them is a public Facebook page. The school board is making it their official communication channel, which I find horrible. Facebook is making it so hard to scrape. And if you don't know, you can't even use Facebook's API for this anymore, unless you have a business verified account and go through a review just for this permission.

drusepth · on Sept 2, 2024

Scrapers have always been notoriously brittle and prone to breaking completely when pages make even the smallest of structural changes.

Scraping with LLMs bypasses that pitfall because it's more of a summarization task on the whole document, rather than working specifically on a hard-coded document structure to extract specific data.

bobajeff · on Sept 2, 2024

Personally I find it's better for archiving as most sites that don't provide a convenient way to save their content directly. Occasionally, I do it just to make a better interface over the data.

CSMastermind · on Sept 3, 2024

There's been a large push to do server-side rendering for web pages which means that companies no longer have a publicly facing API to fetch the data they display on their websites.

Parsing the rendered HTML is the only way to extract the data you need.

kordlessagain · on Sept 3, 2024

I've had good success running Playwright screenshots through EasyOCR, so parsing the DOM isn't the only way to do it. Granted, tables end up pretty messy...

fzysingularity · on Sept 3, 2024

We've been doing something simliar for VLM Run [1]. A lot of websites that have obfuscated HTML / JS or rendered charts / tables tend to be hard to parse with the DOM. Taking screenshots are definitely more reliable and future-proof as these webpages are built for humans to interact with.

That said, the costs can be high as the OP says, but we're building cheaper and more specialized models for web screenshot -> JSON parsing.

Also, it turns out you can do a lot more than just web-scraping [2].

[1] https://vlm.run

[2] https://docs.vlm.run/introduction

nsonha · on Sept 3, 2024

What do you think all these LLM stuff will evolve into? Of course it's moving on from chitchat on stale information and now onto "automate the web" kinda phase, like it or not.

luigi23 · on Sept 2, 2024

Strawman comment

taurath · on Sept 3, 2024

Its a series of strawmen that make up the blog I'm replying to - no counterforce was given to the benefits of "founder mode" so I'm making those arguments.

luigi23 · on Aug 20, 2024

its quite cynical, dont you think. you can benefit in both - morally and in business.

ffsm8 · on Aug 20, 2024

I'd say it's quiet naive to think otherwise, so I wouldnt call that cynical.

luigi23 · on Aug 1, 2024

no, they just mentioned it because they're aware of that but cannot control it and might affect the end result. makes sense - it used to be underdiagnosed and now with bigger push for prevention this might introduce more false positives.

jessriedel · on Aug 1, 2024

It's sufficiently important for interpreting their results that it should be in the abstract

sydbarrett74 · on Aug 1, 2024

Too bad glioblastoma and pancreatic cancer remain stubbornly resistant to long-term remission.

catoc · on Aug 1, 2024

Or even short-term for that matter: even for localized pancreatic adenocarcinoma (not metastasized to lymph nodes or beyond) only 50% of patients survive their first year. For glioblastoma it is even worse: only 25% one-year survival rate

Spooky23 · on Aug 1, 2024

There are some promising developments for glioblastoma actually!

red-iron-pine · on Aug 1, 2024

explain

Spooky23 · on Aug 2, 2024

CART therapies and mRNA vaccines are having breakthrough level results in recent studies.

It’s an amazing time in that space. My poor late wife succumbed to metastatic melanoma last year. In 2010, the chances of living a year was 0. Now 5 year survival rates are 65% thanks to immunotherapy. Unfortunately, complications delayed treatment for my wife and she was one of the 35%.

In the next decade, many brain cancers will be curable. Unfortunately, those breakthroughs are built on the shoulders of those who come before us.