Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I manage a data science team and revamped the hiring process pretty substantially about a year ago, to good results. Nothing in here is particularly original, but here's what we do:

1. Break down "data science" into several different roles–in our case, Analyst (business-oriented), Scientist (stats-heavy), Engineer (software-heavy). Turns out that what we mostly want are Engineers-Analysts, so our process screens heavily for those.

2. Figure out which types of people can be trained to be good at those roles, given the team's current skillset. I opted to look primarily for people with strong analysis skills and some engineering.

3. Design interview tasks/questions that screen for those abilities. In my case, the main thing I did was make sure that the interviews depended very little on pre-existing knowledge, and a lot on resourcefulness/creativity/etc. E.g. the (2-hour) takehome is explicitly designed to be heavily googleable.

4. Develop phone screens that are very good at filtering people quickly, so that we don't waste candidates' time. By the time someone gets to an onsite interview on our team there's something like a 50% chance they'll get an offer.

On the candidate side, when I'm applying I try to figure out first and foremost what a company means by "data scientist", usually by networking & talking to someone who already works there. This filters out maybe 90% of jobs with that title, and then I put more serious effort into the rest.



I've been looking for a job, and I've found how vaguely organizations define their data scientist and analyst roles in their job postings really frustrating. They tend to have a short description of the role, which is generally filled with buzzwords, followed by a list of requirements. I wish organizations would talk about what they wanted to do with their data instead.

For instance, a common description might say the candidate will be working with "big data" to help with "data-driven initiatives" and the requirements will be something like "knowledge of Excel, with a Masters in Statistics, or equivalent experience".

It's really hard to tailor a cover-letter or a resume to a job posting like that. For one thing, I can't even imagine what kind of work they are doing if they are using Excel for "big data". Second of all, I currently have a job, and writing cover letters and creating resumes takes a lot of time. By the time I get to the phone screen I've probably already spent at least a couple of hours applying. Plus, in the interest of keeping my cover letter and resume short, I have to leave off a fair amount of my experience and performance metrics.

Honestly, at this point I think I'm just going to start reaching out to people in the fields I'm interested in and asking them if they know of any roles that would fit my skill-set. The way I see it, I'd at least have a chance of getting feedback from someone who can view my skill-set holistically, rather than HR, who will let me know that I don't tick all their boxes (or vice versa).


>>I've been looking for a job, and I've found how vaguely organizations define their data scientist and analyst roles in their job postings really frustrating.

I lead a Data Science team and part of the struggle with writing sensible job descriptions is that there are too many people providing input into the job description. HR can also put their hand in the pot when they try to use buzzwords (e.g. Hadooop) to internally justify why a role with 2 years of experience needs to be paid like other roles (e.g. traditional Excel based Analyst) with 5-10 years of experience.

>>They tend to have a short description of the role, which is generally filled with buzzwords, followed by a list of requirements. I wish organizations would talk about what they wanted to do with their data instead.

One major challenge for Data Scientists is how hyped the role is, leading to people in an organization believing whatever they want about Data Scientists. Are you a leader who wants a business analyst who can use software and interface with IT? Data Scientist. Are you an engineering manager who wants a person who can interface with the business and use machine learning? Data Scientist. Are you a VP who thinks big data and ML is the problem to your bad or non-existent data? Data Scientist. Do you want somebody who can exhale the maximum amount of hot air while still sounding like a tech and math genius? Data Scientist.

Also add in that business people with minimal experience in modern Analytics are trying to build up Data Science and Analytics capabilities in their own part of the organization because they realize Excel is not the answer to every question. I've spent a lot of time speaking with people to help them understand the type of people they need to hire. Sometimes people are sensible and sensible job descriptions and expectations come from that. Other times they are adamant about what they need (even if they are wrong) and the end result are convoluted job descriptions that are either never filled or filled with the wrong person.


how do you choose candidates for an interview?

I can't even get a call back for an interview lined up. I have done NLP, got a masters degree in computer science from Penn, plenty of experience with big data such as hdfs and hive, spend my free time doing what ever data science I can. but obviously doing something wrong.

any suggestions? Here is my linkedin account: https://www.linkedin.com/in/karl-dailey-02557b65


May I offer a couple of quick suggestions? I've never hired for Data Scientist positions, but I've hired for plenty of other ones.

1) Change your profile picture to something serious. Get a collared shirt and a nice background outdoors. No tie. People will unconsciously judge you on your picture, so you want something that shows you're a professional, but you're confident and happy in life.

2) Think hard about your job titles. Your latest job is far more than just a "Data Analyst". It seems closer to a Scientist or Engineer role, even if your company doesn't call it that. "Analyst" makes people think of entry-level positions. Your DBA-Programmer position is more like a Software Engineer/DBA/System Admin position. If you can't pick one, generally those all-in-one positions can be known as Systems Engineers. Whatever you do, make sure that DBA isn't the primary thing people see. In general, sell your previous positions more. Like "IT". That's not a title, that's a department.

3) Expand on your projects section more.

4) Make sure your resume matches your LinkedIn. People absolutely look you up on there. I did it all the time. When the two didn't match, I was suspicious.

Good luck.


I've got some feedback and you might see it as mean and brutal. But what I'm doing is being honest as how I would evaluate a resume that looked like your LinkedIn profile. I'm just telling you what I'm thinking.

The first thing I did was look at your current position and I immediately became skeptical. You have been at your position for 10 months, yet you have done a lot of fancy sounding things that seem to have no connection to each other. To me this is a huge red flag that you're just grabbing some data, churning through some code you found on StackOverflow or in a book/documentation, and then making grand claims about the work you are doing at ComCast.

Your use of buzzwords only makes me think this more.

For example "Churn forecasting: probabilistic modeling of deletion rates on dVR with a beta binomial distribution to forecast the number of devices that carried a show over 300 days. Maximum Likelihood Estimation was used to derive distribution parameters."

This honestly looks like you saw a tutorial about an R-library and then you copy/pasted the documentation for one of the functions into your profile. By using so many buzzwords (including the term deletion rates), you've left me wondering if this is something you actually did or just something you made up. If you actually defined and derived the Likelihood function you should state that. Otherwise saying you used MLE means nothing since many functions in R use MLE under the hood.

Another one is "Built node.js/angularjs integrated tool to allow analysis team to test the sensitivity of KNIME workflows for forecasting." In my mind I'm thinking, what exactly does web programming have to do with KNIME? What exactly do you mean by sensitivity? How are KNIME workflows sensitive? Do you mean you are checking the accuracy of forecasting models built in KNIME? Lots of Data Scientists will have no idea what node.js and angular are and many will not care when they find out. Data Scientists may build charts using JS, but very few will be building Web based UIs themselves (I assume this is what you are trying to say?)

To be honest, I have no idea what you do in your job. Are you actually a Data Scientist as part of your job duties or did you develop an interest in Data Science and you have access to Comcast data so you've been playing around on your own?

At a hiring manager, I want to know the person I'm evaluating has spend time working on a business problem from start to finish. This means they thought about the problem, defined how to answer it (or figured out how the business owners want them to define it), figured out which data was relevant, thought through a model if it's a business problem that can be addressed by a statistical or ML method - this includes thinking through how the output of model will be used by the larger business - and making a lot of mistakes and changes in this journey.

If the person is just getting into Data Science, I want to know the person has the analytical and metacognition skills to think through a business problem. If they have that then I evaluate their thinking with models.

I'd recommend you trim down your LinkedIn profile and focus on the projects which took you some time (e.g. a month or longer) and where you had to iterate a solution. Use these projects to illustrate what you actually do at your job.

It's possible that you've done everything you state. You may be working with a team of people. At a large company you'll have other teams supporting you with data collection, cleaning, and putting the data into a shape that makes it amenable to machine learning and joining with other relevant sources. If that's the case, focus on projects where you were leading the project in some way and emphasize how you worked with and led the team.

But based on your LinkedIn profile I'm skeptical and a resume like this would not pass my filter. The HR person I work with would see your profile as impenetrable. If they can't figure out what you do and what you have done they are not going to send your resume to me.

Finally, can you move into a role at Comcast that gives you an official "Data Scientist" title? From what I've heard Comcast has a solid Data Science practice and it seems like they have some really interesting data. Finding that combination is really hard. Many good or potential Data Scientists end up at jobs where they are unhappy because the Data Science culture and/or data is awful.


> I wish organizations would talk about what they wanted to do with their data instead.

It may seem like a daft question, but have you tried asking them? They may clam up and refuse to give you anything, but I would imagine most small companies would be willing to talk about what the role involves. It's a bit late at the end of the interview to be asking "So what would I be doing?".

You may find out that they're working on some data which is e.g. image-heavy and you happen to be an image processing expert.


This is a great breakdown. Data Science can mean a lot of different things and I think the three roles you defined are spot on. At LinkedIn, what used to be called Data Science is now exclusively analyst roles. We have a separate org (hundreds of people) called Relevance which is part of the engineering org and is composed of scientists and engineers. Everyone in that org falls somewhere on the scientist to engineer spectrum with ML / Stats backgrounds and a bias towards ML engineers. Most of the people who interview would consider the relevance roles "Data Science" but the term within the company is still tightly coupled with pure analyst functions.


I wish all interviewing was like that above - typically it's - create some incoherent job description, run through a bunch of semi-random candidates, pick a random guy that fits the budget, rinse/repeat...

Candidates know that, typical in person interview has a 1/10 chance (or worse) of getting an offer from an inperson interview, and so most candidates, including good ones, would care less to spend much time on interviewing with any one employer, or do much to increase their unknown chances.

As a rusult employers don't get right candidates and then wine about lack of talent and pay for crappy tools that just sort resumes half assed way... good grief.


Could you talk more about your phone screens? What criteria are you filtering on? What sorts of questions have you found to be effective at stratifying candidates?


I'm mostly looking for people who want to learn about engineering with largeish datasets (~100gb/day for us), and have some of the prerequisite skills. Our codebase is mostly in Spark/Scala and uses functional programming idioms, so I'm looking for people who either know or want to learn how to use those. I'm also specifically trying to filter out people who mostly want a stats-heavy, machine learning heavy job, since that's not what we do.

An engineer who wants to learn data science is a great fit for us, an academic who wants to write R all day is not (though an academic who wants to learn engineering/functional programming is fine!)

Beyond that, I ask some questions about projects they've worked on, and in particular, how their approach would change if assumptions were different. Here I'm looking for the ability to reason backwards from a business goal, as opposed to somewhat blindly applying statistical techniques.

If they do well on these, we send the take-home exam. As previously noted, this is specifically designed to require relatively little knowledge but heavily test analysis skills, and lightly test programming skills. It's almost impossible to complete this exam without using Google effectively, so that's another thing I'm testing.


can I ask why functional programming in particular I can see why you might want to avoid java for big data - but isn't the average ML algo more in the procedural mould?

Would not python with numpy be a better fit ? or fortran with some handwave interface code

Back (early 80;s) when I did map reduce we used PL1/G


The most direct reason is because the current team enjoys functional programming.

From a business standpoint though, there are a few main reasons:

–Data pipelines are well modeled as functions: they take a few input datasets, return a few outputs at the end, and do a ton of processing in between

–FP idioms generally make parallelization easier, and this is very important for the datasets we're dealing with

–A strong type system like Scala's lets us prevent many runtime errors, which is quite important when your pipelines can take several hours

–It's fairly trivial to wrap a statistical/ML algorithm in a pure functional interface, even if the algorithm itself is imperative


Have you had performance issues getting things to conform to functional paradigms?

For example i've found that as a pipeline gets optimized for production use it needs to preallocate all of its output space and then modify things in at each step (like a one hot encoder flipping a few bits in specific rows of a zeroed array instead of allocating new ones and copying them in).

I find it difficult to reconcile this sort of code with a "pure functions without side effects" philosophy and still have it perform an an acceptable level.


We're mostly doing ETL on large datasets, so the code needs to parallelize well, but beyond that performance isn't really a big concern. We use ML in research, but no models in production, because the costs of increased maintenance/lost transparency generally outweigh the benefits in our use case.

In jobs that were heavy on ML, I would use high-performance tools for the models (imperative code, numeric computing packages etc.) and functional code for the ETL, which worked pretty well–no need to be dogmatic about it, a 70% pure codebase is still generally easier to reason about than a 20% pure codebase.


Interesting I will have to have a proper look at Scala when I get my baby cluster up and running.


Functional programming for a lot of numerical computing maps easier to mathematical notation. However, Scala is usually a worse choice than Java for numerical computing since everything is a boxed type.


This is straight up false, why do you think Scala doesn't have primitive values? Long will be either a value or reference type as needed, despite being spelled only one way instead of two different ways in java.


A 'primitive type' is one which can be directly operated on by intrinsic CPU instructions. My understanding of Scala was that all objects (such as Long, Int...) are encapsulated inside of an object.

Therefore an array of boxed types will not be memory aligned; and any vector instructions (which are very important to scientific computing) cannot be used.

Perhaps something has changed in Scala land since I last looked(??).


No, it has been worked the way GP wrote since approximately forever.


Can I ask what are the rough percentages of the three types of data science jobs in the industry? And is there an easy way to search for only one type of jobs (such as using some particular keywords on indeed.com)? I have some advantage only in the scientist/stats-heavy type of data science jobs and would like to better focus my job-searching effort. Thanks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: