jparkie's comments

jparkie · on April 14, 2020

+1 for Modern Metaprogramming.

I know some people are against metaprogramming because they believe the abstractions hide the intrinsic of how the underlying code will execute, but I would love to write substantial tests in C without relying on FFI to Python or C++ to perform property-based testing, complex fuzzing, and whatever. I feel metaprogramming would be a huge boon for C tooling and developer productivity.

oldiob · on April 14, 2020

In my point of view, there's a difference between abstraction created by the language, e.g. lambdas or virtual table in C++, and abstraction created by the programmers via the CPP.

The former is compiler dependent and you cannot know how it's implemented. The former is simple text substitution and you're the one implementing it. I often find myself creating small embedded languages in CPP for making abstraction, and I know exactly what C code it's going to generate and thus the penalty if there's any.

People that are afraid of the preprocessor simply don't understand how powerful it's in good hands.

jparkie · on May 8, 2019

I get it a lot, so I set a precommit hook to render all my notebooks to HTML. With GitHub Pages, I can access them easily.

pimlottc · on May 9, 2019

Sounds useful, care to share that hook?

jparkie · on March 15, 2019

I agree with other's opinions that asking a Dynamic Programming question is a weak signal for a candidate's competencies in general software development work like frontend, backend, mobile applications, and the like.

However, I would like to counter a common opinion that eventually follows in similar threads and some of my social circles: "Algorithms is an undergraduate course in which students learn specialized solutions to esoteric math problems, all of which they ultimately forget when they spend real time working in the industry, so the knowledge shouldn't be relevant in an interview."

It is fair that if you don't exercise what you learned, you will gradually forget it, but I believe its still important for a candidate to cherish algorithm design and analysis because I consider it a great toolbox of the trade.

For all the concepts and the techniques I learned from my undergraduate course in data structures and algorithms, I utilized them as the basis of my Software Engineering Toolbox.

What is my Software Engineering Toolbox? It is a collection of algorithm design concepts and techniques that I can employ anytime I am faced with a novel problem or a problem whose standard Stack Overflow solutions are inadequate.

The Software Engineering Toolbox contains the following: Arrays, Linked List, Stack, Queue, Hash Table, Binary Search Tree, Priority Queue, Set, Trie, Sorting, Binary Search, Divide-and-Conquer, Backtracking, Dynamic Programming, Range Query Trees, Graph Algorithms, Bit Mask Optimizations, Square Root Bucket Optimizations, and Multiple Pointer Optimizations.

First, I rarely implement my own data structures from scratch; all the programming languages that I use provide great standard libraries. Yet, I always remind myself the use of these data structures, because you would be surprised with the amount of people you can meet who tries to answer a problem that boils down to a set membership with a HashMap<Integer, Boolean> when they can just use a HashSet<Integer> or with the amount of people who manually treat an Array as a Stack or a Queue when those data structures are readily available.

Second, I rarely implement my own sort or search functions from scratch; again, all the programming languages that I use provide great optimized functions. I treat Sorting and Binary Search as techniques that lend themselves to optimizing the locality of a data set such that you can easily answer basic statistics, find the bucket for a token in a ring, or merge data sets. These are simple techniques developers should readily know to exist when optimizing their code.

Third, why do I have Divide-and-Conquer and Backtracking in my toolbox? I believe that no matter what problem you face, you should be able to bruteforce it. You can't always tell someone that you can't implement something because you didn't find a Stack Overflow answer or you didn't deduce a collage of standard library functions or third-party libraries to solve your problem. Using these techniques, you can at least arrive at a pretty weak solution which is still a solution. To actually address Divide-and-Conquer and Backtracking in relation to bruteforcing, these techniques allow you to easily traverse through a search space to filter for a certain combination or permutation of items that satisfy a customer's constraints. Furthermore, Backtracking is a relatively easy to medium difficulty technique that is the basis for a lot of the Graph Algorithms people keep balking at!

Fourth, Dynamic Programming. To be honest, I rarely utilize it, but I appreciate it because the common subproblem types of 1D, 2D, Range, and Sub-Trees taught me how to order subproblems successively to solve other problems, which applies beyond DP. I discourage people from trying to pattern match Dynamic Programming problems and solutions, and I encourage them to truly digest CLRS and understand its 4 rules for Dynamic Programming to consider possible dependencies and structures for various combinations and permutations of the problem parameters to identify what the optimal substructure really is.

Finally, the remaining things in my toolbox are included in my toolbox because they are useful in my work experience with real-time network anomaly detection and streaming analytics. For example, topological sorting distributed tracing events into a rooted tree that I encode into a bit vector using a left-child right sibling binary tree. Not everyone will do this, but with my toolbox I never worry much about facing new frontiers of problems or being tasked to create libraries and tools for myself and others to use instead of being at whims of someone else on the Internet.

Overall, I hope people can look back at their courses in algorithm design and analysis and say, "Yeah, the problems and the solutions were really weird, but the techniques hidden away within them are actually GENERALIZABLE and are a fundamental basis to build new things and solve complex problems!"

Nonetheless, I don't want anyone who is weak in algorithms design and analysis to feel discourage. Play to your own strengths whatever they maybe, or you can always strengthen them; it's never too late.

Finally, my Software Engineering Toolbox has way more stuff like actual "engineering" stuff like automatic formatters, linters, fuzzers, automation, tests, mocks, coverage, "Infrastructure as Code", and blah blah blah. :")

I would like to close by saying that a good engineer knows the right tools for the job. :)

jparkie · on Dec 6, 2016

I'm thinking of building a test suite which takes a distribution of distinct elements and run a test for FPP and FNP. Then it asserts whether the the actual FPP and FNP is within an epsilon of the calculated probability.

I want to add statistics for mine to hopefully be able to monitor them at least by some estimate.

orlp · on Dec 6, 2016

I would like to add to this that it's incredibly important you use fixed seed, or at least logged seed tests for this.

If you fail to do this you get unreproducable ghost tests that you can never investigate if they happen to fail rarely.

jparkie · on Dec 6, 2016

I haven't looked into the relationship with ROC curves.

I'm not sure of an efficient general method for generating statistics. I know empirically testing the structures is the easy way. For Bloom Filters which evict old data, you can calculate the probability of FN by calculating the probability that a given element is a duplicate but reported as distinct.

jparkie · on Dec 6, 2016

This library provide probabilistic set membership test on an individual element basis as the stream progresses with a fixed bound on memory. Every call to classifyDistinct() progresses the internal history of a ProbabilisticDeDuplicator.

Traditionally, probabilistic set membership tests were accomplished by using Bloom Filters. However, Bloom Filters only work well on finite sets because as Bloom Filters age (fill up), the false positive rate approaches 1. This issue was addressed by Stable Bloom Filters which frees up space for inserts; however, this introduces false negatives.

This library contributes three implementations of de-duplication algorithms centered around Bloom Filter variants whose design and replacement strategy allows it to reach stability faster than a Stable Bloom Filter while reducing the false FNR by even several orders of magnitude.

jparkie · on Dec 5, 2016

My bad. README/Documentation error. The tests are fine though.