yearning for old apple and order, current times and genz are more chaotic. not sure if it's generational, old apple was obsessed about design, now HIG is mostly optional. they now even use hamburger on websites which was a big no in the past.
this is the thing that is missed often in recent convos. h1b not only enables you to hire cheaper, but also give much bigger leverage and power over worker. maybe now there are plenty of candidates on the market, but still having immigrant on visa for cheaper is, cynically, much better deal for corps.
Just so I understand: you’re talking about setting up an FTP account, using curlftpfs, and SVN/CVS for Linux users? And even with all these, you’d still need USB drives for connectivity issues? Plus, you're naming it Dropbox? Is there more?
There’s a lot of data that we should have programmatic access to that we don’t.
The fact that I can’t get my own receipt data from online retailers is unacceptable. I built a CLI Puppeteer scraper to scrape sites like Target, Amazon, Walmart, and Kroger for precisely this reason.
Any website that has my data and doesn’t give me access to it is a great target for scraping.
I'd say scrapers have always been popular, but I imagine they're even more popular nowadays with all the tools (AI but also non-AI) readily available to do cool stuff on a lot of data.
Bingo. During the pandemic, I started a project to keep myself busy by trying to scrape stock market ticker data and then do some analysis and make some pretty graphs out of it. I know there are paid services for this, but I wanted to pull it from various websites for free. It took me a couple months to get it right. There are so many corner cases to deal with if the pages aren't exactly the same each time you load them. Now with the help of AI, you can slap together a scraping program in a couple of hours.
I'm sure it was profitable in keeping him busy during the pandemic. Not everything has to derive monetary value, you can do something for experience, fun, kick the tyres, open-source and/or philanthropic avenues.
Besides it's a low margin, heavily capitalized and heavily crowded market you'd be entering and not worth the negative-monetary investment in the short and medium term (unless you wrote AI in the title and we're going to the mooooooon babyh)
It's generally not hard because it's conceptually very difficult, or that it requires extremely high level reasoning.
It sucks because when someone changes "<section class='bio'>" to "<div class='section bio'>" your scraper breaks. I just want the bio and it's obvious what to grab, but machines have no nuance.
LLMs have enough common sense to be able to deal with these things and they take almost no time to work with. I can throw html at something, with a vague description and pull out structured data with no engineer required, and it'll probably work when the page changes.
There's a huge number of one-off jobs people will do where perfect isn't the goal, and a fast solution + a bit of cleanup is hugely beneficial.
Another approach is to use a regexp scraper. These are very "loose" and tolerant of changes. For example, RNSAFFN.com uses regular expressions to scrape the Commitments of Traders report from the Commodity Futures Trading Commission every week.
My experience has been the opposite: regex scrapers are usually incredibly brittle, and also harder to debug when something DOES change.
My preferred approach for scraping these days is Playwright Python and CSS selectors to select things from the DOM. Still prone to breakage, but reasonably pleasant to debug using browser DevTools.
I don't know if many has the same use case but... I'm heavily relying on this right now because my daughter started school. The school board, the school, and the teacher each use a different app to communicate important information to parents. I'm just trying to make one feed with all of them. Before AI it would have been hell to scrape, because you can imagine those apps are terrible.
Fun aside: The worst one of them is a public Facebook page. The school board is making it their official communication channel, which I find horrible. Facebook is making it so hard to scrape. And if you don't know, you can't even use Facebook's API for this anymore, unless you have a business verified account and go through a review just for this permission.
Scrapers have always been notoriously brittle and prone to breaking completely when pages make even the smallest of structural changes.
Scraping with LLMs bypasses that pitfall because it's more of a summarization task on the whole document, rather than working specifically on a hard-coded document structure to extract specific data.
Personally I find it's better for archiving as most sites that don't provide a convenient way to save their content directly. Occasionally, I do it just to make a better interface over the data.
There's been a large push to do server-side rendering for web pages which means that companies no longer have a publicly facing API to fetch the data they display on their websites.
Parsing the rendered HTML is the only way to extract the data you need.
I've had good success running Playwright screenshots through EasyOCR, so parsing the DOM isn't the only way to do it. Granted, tables end up pretty messy...
We've been doing something simliar for VLM Run [1]. A lot of websites that have obfuscated HTML / JS or rendered charts / tables tend to be hard to parse with the DOM. Taking screenshots are definitely more reliable and future-proof as these webpages are built for humans to interact with.
That said, the costs can be high as the OP says, but we're building cheaper and more specialized models for web screenshot -> JSON parsing.
Also, it turns out you can do a lot more than just web-scraping [2].
What do you think all these LLM stuff will evolve into? Of course it's moving on from chitchat on stale information and now onto "automate the web" kinda phase, like it or not.
Its a series of strawmen that make up the blog I'm replying to - no counterforce was given to the benefits of "founder mode" so I'm making those arguments.
no, they just mentioned it because they're aware of that but cannot control it and might affect the end result. makes sense - it used to be underdiagnosed and now with bigger push for prevention this might introduce more false positives.
Or even short-term for that matter: even for localized pancreatic adenocarcinoma (not metastasized to lymph nodes or beyond) only 50% of patients survive their first year. For glioblastoma it is even worse: only 25% one-year survival rate
CART therapies and mRNA vaccines are having breakthrough level results in recent studies.
It’s an amazing time in that space. My poor late wife succumbed to metastatic melanoma last year. In 2010, the chances of living a year was 0. Now 5 year survival rates are 65% thanks to immunotherapy. Unfortunately, complications delayed treatment for my wife and she was one of the 35%.
In the next decade, many brain cancers will be curable. Unfortunately, those breakthroughs are built on the shoulders of those who come before us.