Hacker Newsnew | past | comments | ask | show | jobs | submit | Firefishy's commentslogin

Disclosure: I am part of the mostly volunteer run OpenStreetMap ops team.

Technically we able to block and restrict the scrapers after the initial request from an IP. We've seen 400,000 IPs in the last 24 hours. Each IP only does a few requests. Most are not very good at faking browsers, but they are getting better. (HTTP/1.1 vs HTTP/2, obviously faked headers etc)

The problem has been going on for over a year now. It isn't going away. We need journalists and others to help us push back.


Hey. I run a small community forum and I've been dealing with this exact same kind of behaviour where well over 99% of requests are bad crawlers. There used to be plenty of "tells" for the faked browsers, HTTP/1.1 being a huge one. As you said, however, they're getting a bit smarter about that and it's becoming increasingly difficult to differentiate it from legitimate traffic.

It's been getting worse over the past year, with the past few weeks in particular seeing a massive change literally overnight. I had to aggressively tune my WAF rules to even remotely get things under control. With Cloudflare I'm aggressively issuing browser challenges to any browser that looks remotely suspicious, and the pass rate is currently below 0.5%. For my users' sake, a successful browser challenge is "valid" for over a month, but this still feels like another thing that'll eventually be bypassed.

I'd be keen to know if you've found any other effective ways of mitigating these most recent aggressive scraping requests. Even a simple "yes" or "no" would be appreciated; I think it's fair to be apprehensive about sharing some specific details publicly since even a lot of folks here on HN seem to think it's their right to scrape content with orders of magnitude higher throughput than all users combined.

I really don't know how this is sustainable long-term. It's eaten up quite a lot of my personal time and effort just for the sake of a hobby that I otherwise greatly enjoy.


I hear ya. This is just my opinion but I don't think journalists are going to be much help. The bots would have to be hurting something belonging to the government or the government is paying for to really get them on it. e.g. some big orgs in the government embed your maps on their site. They would have to create legislation and then someone would have to trace the bots back to their operator for attribution and then someone would have to file lawsuits against them once it is illegal. Or you could try using a ToS/AuP to go after them assuming attribution. I am not a lawyer.

I think your only hope would be to either find subtle differences between them and real legit users or change how your site works so that bots have to be authenticated unless they have a whitelisted IP/CIDR or put your site behind something else that spots the bots. Beyond that all anyone can do is beef up their infrastructure to handle much more than the bots could dish out.

Have you tried silly simple things like hidden javascript puzzles the browser has to solve?


Disclosure: I am part of the OpenStreetMap mostly-volunteer sysadmin team fighting this.

The scrapers try hard to make themselves look like valid browsers, sending requests via residential IP addresses (400,000+ IPs at last count).

I reached out to journalists because despite strong technical measures, the abuse will not go away on its own.


OpenStreetMap Slack https://slack.openstreetmap.us/ was forced to downgraded to the free edition earlier this year for similar reasons.


Join us on Signal or one of the many other options: https://wiki.openstreetmap.org/wiki/Contact_channels#Realtim...

I've never understood why a part of our community goes with this walled garden to host their chat. We're literally an open data project

Edit: fwiw, I know that moving communities is extremely hard, not to say impossible to achieve completely intact, but those who care could choose to join two chat systems. Eventually, the one people gravitate towards will win. E.g. I'm still in the Telegram chats and use those on occasion (also because, as a moderator, I get regular pings), but primarily share content on Signal or Matrix


I help to run the OpenStreetMap infrastructure, and for many years the OSUOSL has generously hosted some of our servers.

Without OSUOSL OpenStreetMap would be more difficult to host and significantly slower to access from North America.

I hope OSUOSL can get the financial support they rightfully deserve.


https://protectli.com/ Good quality devices. Real serial consoles to allow recovery when you make a networking configuration mistake ;-)


Same here. Alpine Linux on top of that + Unbound DNS, dnsmasq for DHCP, netfilter, chronyd for time. I've never been able to make them break a sweat.


Curious: how did you set up firewall (nftables?), IPv6 delegation both ULA and public prefix? Happy to read if you have a write-up somewhere.


I disabled IPv6 as my little ISP has not yet figured out how they want to bill for or assign/segment it out for static assignment. I have multiple static IPv4 addresses. I only use static IP's but that is a requirement specific to me. The firewall is very simple and just forwards packets and uses a simple IPv4 SNAT. The only time I've had it set up more complicated was when a guest was abusing P2P so I had to block it using string matches on the unencrypted commands.

My setup is honestly simple enough that a write-up would not benefit many. My Unbound setup to block many malicious sites is also fairly well documented by others. The null routing of commonly used DoH servers is straight forward. My Chrony setup would just annoy people as I only use stratum-1 servers and the options would just look like cargo-culting to some.

About the only thing not commonly discussed is the combination of thc_cake and some sysctl options to keep buffer bloat low but OpenWRT has their own take on that topic already.


Raspberry Pi's "Making a More Resilient File System" document https://pip.raspberrypi.com/categories/685-app-notes-guides-... has instructions on how to configure the eMMC on the CM4 and CM5 to run in pSLC mode. Halving the storage capacity.


Yup. mmc-utils is the way to go. Note that this change is irreversible once you send the command to finalize the settings.

The single biggest thing you can do to improve the reliability of your embedded system is to use eMMC’s built-in hardware partitioning.

- Each hardware partition is a separate block device. A firmware update cannot corrupt the device by overwriting a partition table.

- There are two small boot partitions, and they can be made permanently read-only after programming, thus preventing corruption of your bootloader. You can also use the other one read-write for your uboot environment.

- You can easily have two OS partitions for A/B firmware updates. In addition to mounting them readonly, temporary write protection can be enabled on a per-partition basis and disabled when needed for fw updates.

- If you can’t afford the capacity hit from pSLC, I believe it can be enabled on a per-partition basis. (Don’t quote me on this, it could be wrong).

All these settings can be configured with either mmc-utils or u-boot. In volumes, programming houses can take care of this for you. (You’ll have to list all the registers out very specifically in an Excel spreadsheet)

The downside is that calculating all the correct register values is not a simple process, and you’ll have to spend a bit of time reading the eMMC spec.


How come there is so much "permanent" config for SD cards?


I believe it's because it originally was SD/MMC which was supposed to be a future media for audio and the like for retail sale. I had some read only palm pilot cards like this - books etc were also sold this way for a short period.


The spec was developed with DRM as one of its use cases.


Probably the same reason microcontrollers have OTP fuses—to prevent accidental corruption in the field or from buggy programming.


Could you also use ZFS or BTRFS with copies? I'm not sure I'd trust any of these drives.


OpenStreetMap is now back up and running :-)


Thanks Firefishy for working on it!


The new stack has some unique feature like the vector tiles being minutely updated directly from OSM mapping changes.

There are still issues to fix as it is still only a technical preview.


We don't offer 2x raster tiles because we simply don't have the resources to do it while keeping the tiles minutely updated and open access. We serve around 60k req/sec on a tiny donation backed budget.

The raster tiles are primarily for our OSM mappers to see their map changes rendered quickly to improve the feedback loop.


OP here. The toot was my sarcastic response after having to rate-limit and block another set of abusive scrapers aggressively hitting our website and mapping API. robots.txt be damned.

OpenStreetMap data is free to download. We published minutely on https://planet.openstreetmap.org/ and the data available via AWS S3 + torrent.

If you just starting out, best to start with a smaller regional extract: https://wiki.openstreetmap.org/wiki/Planet.osm


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: