Most of this makes sense, but ends up not being followed for non-technical reasons like the person's confidence, etc.
I'd say, just know two things:
1. (From the post) "Let your instincts guide you if you think something feels much more difficult than you’d expect", and talk to your team about it, and change it for them.
2. There is a LOT of dead (unreachable / redundant) code in any enterprise codebase. Especially recently launched flag that haven't been "cleaned up yet". Just start cleaning this stuff up. Even the very obvious stuff gets you very well introduced to the codebase.
Where I disagree with the post:
> Taking notes is non-negotiable for me, not strictly as a memory aid, but as a means to communicate findings and verify them later.
If you need to take a 2 line note, you're better off having written a 2 line code review. Even something simple like "metric.add("DOES_THIS_HAPPEN?", 1);" can be done with 0 confidence / testing or with minimal-confidence / minimal-testing you could instead write "Preconditions.checkState(..., "Will this be caught by our integration tests?");". The best notes are the ones that get pushed to production to validate assumptions.
> There is a LOT of dead (unreachable / redundant) code in any enterprise codebase. Especially recently launched flag that haven't been "cleaned up yet". Just start cleaning this stuff up. Even the very obvious stuff gets you very well introduced to the codebase.
Ooof, good way to get well known in the team when the smoke comes out. :)
Lol yes agreed completely. "Cleaning things up" before you understand the domain, codebase, and gotchas sounds like a highly unproductive way to get onboarded. I think suggesting that a new hire start week 1 by adding a whole bunch of metrics that add unnecessary overhead for the team and all users just to see "if something happens" is pretty bad advice, to the point of being something I'd probably wish I had screened at the interviewing stage.
One thing I like to do is go through everything in makefiles, and any commands in the readmes or documentation. In almost every case there's stuff that doesn't work, sometimes because it's out of date and you can fix it, sometimes you'll be missing some one-time-only configuration that everybody else already did so long ago they didn't remember to document it, sometimes the docs are just outdated so you can ask if it's no longer relevant and clean it up. And if you mess anything up it usually breaks very early the pipeline, the tests won't run against the dev branch or something like that.
"Cleanup tasks" will always be appreciated I am sure, but again I would caution that there is a trap here. It's easy to clean up any number of things as you go, and construct an illusion that somehow, you are being productive.
When I hire people, aside from fixing mistakes in the documentation they encounter, I want them to have all the time they need to learn, ideally with a set of curated onboarding tasks that progressively give them expertise in the system they are intended to operate in. There is a time to "make things better" and, in my opinion, that's after getting up to speed in a sense that lets me collaborate with the team in a productive manner. I won't begrudge someone the opportunity to clean things up, but each individual is the ultimate arbiter of how to be effective at the end of the day.
If you have the time to create a “curated onboarding experience that results in progressively more expertise”, yes I agree please do that instead.
I’ve never onboarded to a team that had such an experience, I’ve never heard of a team with something like it consistently, every time I’ve seen it proposed and tried it takes time away from an experienced person (who could have done those same tasks faster) and goes out of date as soon as one new teammate completed the curated set of tasks.
If you can make it durable, my utmost respect!
In the meanwhile, deleting obviously dead code has never failed me as a strategy for myself or a new hire (typically steered by a simple oncall-type task to start the investigation off).
And best part, there is always more to clean up for the next new hire lol.
I don't mean refactor everything or spend six months until everything is perfect. I've just found it's a pretty reliable way to find low hanging fruit and get an easy first commit in. "Get your feet wet" I believe is the English expression. You have to get familiar with the build process anyway, you might as well improve it while your attention is on it.
If you have a curated onboarding experience that's great, but then you'll probably have good documentation and nice setups, then my suggestion obviously doesn't apply. I wish that was common but in my experience that has been a rarity, not the common case.
In enterprise where there is poor documentation and lots of tribal knowledge, noting down just those 2 lines for every new info is a quick way to break down knowledge gaps created by just that.
It is exactly due to such disdain for documentation that most people find it hard to navigate large codebases. Documentation is not just for noting things down pedantically but also a thinking tool and a temporary thought buffer.
And no one pushes code to production to validate assumptions. Not if you have 100 clients and you are not doing CD.
> And no one pushes code to production to validate assumptions.
I always have, with the rare issue occurring and by and large rewarded for it.
> Documentation is not just for noting things down pedantically but also a thinking tool and a temporary thought buffer.
Sure, but why not treat your codebase as a temporary thought buffer? I do, and it’s consistently worked and improved every system I’ve worked with. No teammate has ever complained about this strategy. If anything it’s typically adopted by teammates.
eg “oh this list is never modified” rather than taking a 2 line note, I’ll push a code change to use an ImmutableList.
The knowledge is now documented, enforced by the compiler/code reviews if the type changes people talk about it, and allows me to keep improving that part of the code base months later without code conflicts or needing to re-make my changes from a notes file.
1-2 line reactors - exclusively better than 1-2 line notes. This scales to any number of lines where the size of the note is equivalent to the size of the possible code change.
Meanwhile, please do take notes and document when it’s at least 10x shorter to grok than the current code or possible code change.
Re dead code: unless you've got an expert on hand and they know what they're talking about (less likely than anyone wants to admit), I've had fantastic success simply emitting metrics on any line of code I think is unused.
And then ignore it for a month and go do other things.
If it hasn't been hit by then, you've got pretty good evidence that it's not very important. Start deleting.
You can always replace it with an alerting metric if you really suspect it'll matter months later, since you can restore from history when you see it + have better knowledge of the system in the future. And if volume is truly that low and the alert fires, just do it by hand - if it matters by that point it's probably worth a lot of money, and that's time well spent.
In minor ways, yes. In return though I've successfully used it to make some pretty significant cleanups, removing years-old abandonware that nobody was willing to touch. I consider it to be extremely worth the risk, and the utterly inconsequential cost - once you set it up, it takes less than a minute to decide "nope, let's just wait and find out" and write that one line of code, and make a calendar event a month or three later.
The main ways it has backfired have been:
1) Once, the removed-thing was used for one team which used it to make one annual report that they didn't realize they were still running. It threw a pretty major wrench in that one report... but they were surprised too, so they quickly modernized it (the old version had known issues), and then that team was able to remove a bunch of dead code because we (the only recipients of its output) could show that X had not been used in a year.
2) And once it revealed bugs in metrics systems / dependency chains... by leading to deleting code that we thought was unused, but was in fact used semi-regularly. An unfortunately-less-than-quick rollback fixed that, and we then went on to find a number of other bugs because moderate chunks of the codebase were receiving no-op metrics emitters for some reason.
Frankly I consider having only ^ those two cases to be luck, but they were both definitely worth it in aggregate. And 2 taught me to verify that surrounding metrics existed / correlated with logs before trusting a lack of signals. I've found multiple other gaps that way :|
3) More than once it has led to not removing code that everyone thought was unused (but we had developed a minor "check first" habit due to past successes), thus needing to maintain some convoluted garbage for almost zero benefit. Bleh, politics.
4) Others blindly applying the habit has led to some careless commits that added it to surprisingly hot loops, adding delay to the point that it caused significantly more timeouts (generally: rapidly filled buffers and paused to flush, then repeat 1000x). The good news is that once they've done this once, essentially every author immediately starts paying attention to their hot loops forever, and we moved more things to batch emissions rather than incremental. Also a net win, it's a pretty cheap lesson.
Defending on what your system does, there can be code that only gets run monthly, quarterly, or annually, or when the other thing that "never" changes gets changed.
There was recently a minor annoyance at my work where some firewall rules got deleted because they weren't used during an audit window, and then X months later a deploy got held up because the build server couldn't reach the target servers for that one rarely-changed system.
Agree on the first part but there are going to be plenty of cases where running in a debugger isn't sufficient to satisfy one's self that something never happens. Unless we're talking about "does this happen under this specific scenario I know about".
if something never happens, that's not a part of the codebase you need to understand right now. If your task is cleaning up, sure, but your task here is learning what happens, not what doesn't happen.
And it's relevant in more important ways than "I'm cleaning up and this isn't used, so I should delete it". Quite often, when cleaning up, you find pieces of code that make assumptions that contradict each other. It's a real head scratcher until you figure out that one of these pieces of code is obsolete and never executed anymore (or has never been executed), so its assumptions are irrelevant. The cleanup resulting from that might not look like much, but it removes a major trap when trying to make sense of the code.
This really depends, but for things that really are confusing you / your team won't help, why not add the metric? You add observability and can learn something about how that flow is utilized
Imagine a metric firing on something that happens a 1000 times a millisecond. Imagine a metric firing in a way that grabs a mutex at an inopportune time that creates a deadlock. Imagine a metric firing that suddenly spams your metrics server with data that nobody except you cares about in this one temporal moment. There are so many reasons this is a bad idea. If you must do it, do so locally but there are so many easier ways to determine "if something is used."
If adding a few counters to your code base cause any of these issues, your current team is already dysfunctional and will get more productive my improving the system.
eg “Metrics server goes down” metrics should be sent pre-aggregated, and in batches, even possibly polled for. Having 10 metrics that fire 10^N times a second shouldn’t impact the metrics server, ever! Metrics servers are impacted by cardinality, so yes don’t create 10^N unique metrics.
eg “something that fires 1000times a millisecond” if the rest of the team doesn’t know where their hot-loops are and has them commented or caught in a code review, it’s very likely someone will add some logic to it that will hurt. The metric is just one such change, and might as well find out sooner than later. If it will cripple your service, that is why CI/CD encourages canary deployments / other automated performance testing.
eg “metric causes a deadlock” - umm, this is just too contrived. I’ve never used or heard of any metric library that was built this way. This type of metrics client would cripple even a very experienced developer on that team.
Summary, if adding a metric can cripple your service or team, reconsider priorities. It will speed you up even in the medium term.
I don't mean to diminish any of your achievements. Shipping amazing products is orthogonal to creating a robust codebase/system.
I also don't mean to be negative towards you or your experience. Instead, I'll pose the question to you: What would you call a team/system that describes itself as "at risk of being crippled by a single line change like adding a metric"?
At the end of the day, the product is the only thing that matters and you can be and should be proud of the products you've launched. That said, are you proud of the way you and your team built them? Have you or anyone on your team proclaimed it was a joy to build/improve that product?
idk... after all I'm just some guy on the internet. Best of wishes to you! I truly hope you did not, do not, and never do work on a dysfunctional team/system.
Recording a metric is not a cheap operation in many contexts. I work with code we typically instrument at a microsecond to millisecond granularity. Sampling profilers typically record instruction samples at most every 200 us or so, so adding instrumentation at a finer granularity than this has a dramatic effect on performance. If a game operates at 60 fps, adding a metric at an inopportune time will render the game unplayable.
No, I haven't been on teams that dysfunctional. These restrictions are born out of necessity. My point was that you seem to have a somewhat narrow view on software as a whole, and you are extrapolating from your personal experience general advice that simply doesn't apply.
This is not what the post was talking about though, if you have a service that truly is being fired 1000 things a millisecond, I can almost guarantee you have other metrics already up, and thus won't need anything else.
I work with realtime graphics. Thousands of things are indeed happening every millisecond and I sure wouldn't want a new hire blindly instrumenting those things just to "see if they happen." There are appropriate tools to measure performance at this granularity.
I'd say, just know two things:
1. (From the post) "Let your instincts guide you if you think something feels much more difficult than you’d expect", and talk to your team about it, and change it for them.
2. There is a LOT of dead (unreachable / redundant) code in any enterprise codebase. Especially recently launched flag that haven't been "cleaned up yet". Just start cleaning this stuff up. Even the very obvious stuff gets you very well introduced to the codebase.
Where I disagree with the post:
> Taking notes is non-negotiable for me, not strictly as a memory aid, but as a means to communicate findings and verify them later.
If you need to take a 2 line note, you're better off having written a 2 line code review. Even something simple like "metric.add("DOES_THIS_HAPPEN?", 1);" can be done with 0 confidence / testing or with minimal-confidence / minimal-testing you could instead write "Preconditions.checkState(..., "Will this be caught by our integration tests?");". The best notes are the ones that get pushed to production to validate assumptions.