I'm the tech lead for the R&D team at Automated Insights, a company that turns raw data into human-sounding narratives. Our SaaS platform, Wordsmith, generated billions of articles last year for companies like the AP, Yahoo, and Activision.
The team's responsibility is mainly algorithm development (machine learning and natural language processing, as well as more traditional methods), with the goal of making Wordsmith more powerful and easier to use. We primarily develop in Python (spaCy, gensim, scikit-learn, Tensorflow) and NodeJS. We're looking for an engineer with a couple years experience and familiarity with machine learning. A research background and/or advanced degree is a plus.
Interview: Quick chat with HR, technical phone screen, at-home programming evaluation, on-site interview. No adversarial whiteboard sessions or trivia quizzes.
Many of your criticisms are totally valid. Lots of the phrasing is awkward - even the lede is really bad ("Tesla, Inc. (TSLA) has been having a set of eventful trading activity"...wat). And it feels really deceptive to put a human byline on an automated article.
We're pretty open about the fact that our solution to this problem is not "magical" at all [1, 2] - it's good, old-fashioned automation. This approach allows our customers to QA their content heavily before pushing it to production, which eliminates many of the problems with awkward/incorrect phrasing that people who rely more heavily on machine learning tend to run into. And the news articles we publish always have a note at the end saying that they were generated by Automated Insights, and don't include a human byline.
There is real value in this type of reporting - a recent study [3] found that the articles we produce for less well-known publicly-traded companies has increased the trading volume for those companies. The idea is that, yes, the content is fairly formulaic, but there's now reporting on companies that had very little coverage before we existed. There are similar arguments for mass personalization work we've done for companies like Activision Yahoo - having prose that describes raw data (even if it is formulaic to an extent) is often better than not having prose.
I don't understand what value the prose provides over spending the same amount of effort producing clear, easy to read infographics.
Instead of producing awkward and difficult-to-read English sentences, why not use the same content generator to produce completely accurate and easier to read dynamic data visualizations?
If you do automated content well, it's not awkward and difficult to read ;)
As far as visuals vs prose, I see it as "both-and" rather than "either-or". And in addition to our journalism and personalization work, we also integrate with interactive visualization tools like Tableau.
Increased the trading volume? Seriously, you call that value? What are you smoking, that's called pump and dump and is illegal my friend, and if it isn't actually being dumped is downright sleazy car salesman to me. Was that recent study also automated.
Increased trading volume generally just means better price discovery. Why do you think increased trading volume means it's "pump and dump?" When I trade SPY, I increase the trading volume in the underlying S&P 500 components -- am I pumping and dumping then?
Increased trading volume driven by bot-written blogspam produced by the company PR department with the express intent of pumping their share price definitely isn't "better price discovery"
It's a good idea to follow the links before accusing someone of illegal behavior.
Here's the link to the study again: [1]. This is specifically in reference to the reporting on quarterly earnings reports that we automate for the Associated Press. It's an objective summary of the financial performance of these companies that appears in news outlets across the country (for example, [2,3,4,5,6]). The companies being reported on have no influence over the content of the articles.
From the summary:
>These articles synthesize information from firms’ press releases, analyst reports and stock performance, and are widely disseminated by major news outlets within hours of publication...This study found a positive effect between the public dissemination of objective information and market efficiency.
Fair response and I withdraw and apologise for the implicit accusation against your company specifically. I'm sure you would agree less benign actors exist, which is why I'm reflexively sceptical of the idea that a link between more reporting and more trading volume is an indication of its merit.
I guess if I'd taken the time to read your original link we could have a more interesting discussion on whether pretty basic earnings information in a format more friendly and available to non-professional investors was adding noise to the market or providing a useful counterweight to the amount of free publicity that more prominent companies' earnings get. But I've probably already poisoned the well on this one.
I'm the tech lead for the R&D team at Automated Insights, a company that turns raw data into human-sounding narratives. Our SaaS platform, Wordsmith, generated billions of articles last year for companies like the AP, Yahoo, and Activision.
My team's responsibility is mainly algorithm development (machine learning and natural language processing, as well as more traditional methods), with the goal of making Wordsmith more powerful and easier to use. We primarily develop in Python (spaCy, gensim, scikit-learn, Tensorflow) and NodeJS.
We're looking for an engineer with a couple years experience and familiarity with machine learning. A research background and/or advanced degree is a plus.
Interview: Quick chat with HR, technical phone screen, at-home programming evaluation, on-site interview. No adversarial whiteboard sessions or trivia quizzes.
While I definitely agree that advances in ML in the last 20 extremely important (and potentially revolutionary), I think this article misses the mark in a few places.
>Now, instead of humans designing algorithms to be executed by a computer, the computer is designing the algorithms. (Albeit guided by human-devised algorithms)
This line is way off in both tone and substance. On tone, it really underplays the human effort involved in effective machine learning (as it is practiced in 2017) and anthropomorphizes "machines" to an unreasonable extent. In substance, I fail to see how a machine that "designs its own algorithms" according to an algorithm designed and implemented by a human is fundamentally different than an algorithm coded directly by a human. To use the author's example, machine learning allows humans to build complex software systems in less time just as a bicycle allows humans to cover more distance with less energy. It's a big improvement, but it's not, say, teleportation.
>it is only now that the machines are creating themselves, at least to a degree. (And, by extension, there is at least a plausible path to general intelligence)
I could not disagree more strongly with this addendum. Simply put, I fail to see any path from state-of-the-art ML/DL research today to AGI, and I would even go so far as to say that humans have made approximately zero progress on this task since it was first formulated in the 50s. I think we know about as much about "intelligence" (and consequently, what would constitute AGI) as star-gazers in ancient times know about the universe. That's not to say that it will take millennia to invent AGI, but the path to get there is probably quite orthogonal to modern ML research.
Simply put, I fail to see any path from state-of-the-art ML/DL research today to AGI
Before I really understood and worked with NN, I felt the same way. I thought the atomspace computation approach and other similar granular computation paradigms were much more likely to make progress.
However after seeing the striking similarities between how I watched my three kids learn from infant -> toddler ages and how we build our convolutional neural nets in my company, it was like a light went on.
If you look at how relatively sparse and weak even the best deep nets are compared to human brains, especially considering a really narrow set of inputs - we are at the very early beginnings of mimicking the complexity of the human brain. It seems to me that the ANN approach is right, we now need to make it radically more efficient and give it better input sensors.
We need a nervous system for AGI (structured data acquisition) before the big brain tasks will be solved.
I think that when people talk about "AGI" what they often mean is artificial personality.
Sure, your NN learns facts and processes like your toddler learns facts and processes. Those are a tiny part of who your toddler is, though.
The essential component is their will. You don't have to set them up and feed them data. They don't sit quietly until you ask them to answer a question. Kids have distinct personalities from very early on, and demand input, and produce opinionated output (to put it mildly)--from day one.
Emotions are a huge part of that. But to my knowledge, we have less understanding of emotions, and spend less time trying to create them with computers, than conscious processes like "which picture has a car in it."
But there is evidence that if you take away a person's emotions, they have great trouble making decisions. They can consciously evaluate their options. They just struggle to pick one.
So how will AI research focused on replicating conscious thought result in AGI, if we don't know how to generate emotions? Is anyone even trying to do that?
My standard joke is that a lot of people are working to create a car that can drive itself, but who is investing to build a car that will tell its owner, "fuck off, I don't feel like driving today"?
But can a machine that always does exactly what it is told to do really be thought of as "intelligent" the way we think of human intelligence? Do smart people always do exactly what they are told?
What you call will is no different in my mind than any other thing we encode into a NN - it's a different level and depth.
Creating motivation in AI is an open area, and in fact is arguably the big hairy beast when it comes to the "Friendly AI" question or really the whole "General" part of it.
You do the same thing everyone else does in this debate which is move the poles - we don't know how to build "emotions", we don't know how to build motivation - until we do or it is perhaps an emergent property of a sufficiently deep net.
Too many other strawmen in there to argue eg. the idea that we will need always tell them what to do.
The point I am making is that because the reinforcement nature of biological systems is mimicked in the basic ANN structure, it's the strongest candidate (at scale) for the building blocks of an AGI.
At a guess there are two major schools of thought here. The first thinks that emotions, will, personality etc are much more complex than the way we think of neural nets today. The other thinks that what we are seeing is already much more like the brain than we were expecting and if we continue down this path we may discover those things are emergent aspects of much simpler behaviours at a small scale.
My slightly optimistic money is on the latter one.
I interpreted the point of the article as being that we tend to focus too much on AGI and not enough of how disruptive "narrow" AI may be. The only thing I disagree with there is that I think lots of people focus a lot on the potential issues of broadening use cases for ML. But I agree that AGI is mostly just a distraction to that discussion.
> I fail to see how a machine that "designs its own algorithms" according to an algorithm designed and implemented by a human is fundamentally different than an algorithm coded directly by a human
Human creativity can be empowered by optimization algorithms. It's a huge improvement over design by hand.
Not sure exactly why this was posted today, since spaCy has been around at least a couple years, but - spaCy is a great tool, and I have a ton of respect for Matthew Honnibal, the main developer.
Coincidentally, I wrote a blog post [1] that went up just this morning that, in part, compares spaCy with the other giant in the Python NLP ecosystem, NLTK. TLDR - I think that, right now, the majority of users are better served by spaCy than NLTK.
Could you talk more about your phone screens? What criteria are you filtering on? What sorts of questions have you found to be effective at stratifying candidates?
I'm mostly looking for people who want to learn about engineering with largeish datasets (~100gb/day for us), and have some of the prerequisite skills. Our codebase is mostly in Spark/Scala and uses functional programming idioms, so I'm looking for people who either know or want to learn how to use those. I'm also specifically trying to filter out people who mostly want a stats-heavy, machine learning heavy job, since that's not what we do.
An engineer who wants to learn data science is a great fit for us, an academic who wants to write R all day is not (though an academic who wants to learn engineering/functional programming is fine!)
Beyond that, I ask some questions about projects they've worked on, and in particular, how their approach would change if assumptions were different. Here I'm looking for the ability to reason backwards from a business goal, as opposed to somewhat blindly applying statistical techniques.
If they do well on these, we send the take-home exam. As previously noted, this is specifically designed to require relatively little knowledge but heavily test analysis skills, and lightly test programming skills. It's almost impossible to complete this exam without using Google effectively, so that's another thing I'm testing.
can I ask why functional programming in particular I can see why you might want to avoid java for big data - but isn't the average ML algo more in the procedural mould?
Would not python with numpy be a better fit ? or fortran with some handwave interface code
Back (early 80;s) when I did map reduce we used PL1/G
Have you had performance issues getting things to conform to functional paradigms?
For example i've found that as a pipeline gets optimized for production use it needs to preallocate all of its output space and then modify things in at each step (like a one hot encoder flipping a few bits in specific rows of a zeroed array instead of allocating new ones and copying them in).
I find it difficult to reconcile this sort of code with a "pure functions without side effects" philosophy and still have it perform an an acceptable level.
We're mostly doing ETL on large datasets, so the code needs to parallelize well, but beyond that performance isn't really a big concern. We use ML in research, but no models in production, because the costs of increased maintenance/lost transparency generally outweigh the benefits in our use case.
In jobs that were heavy on ML, I would use high-performance tools for the models (imperative code, numeric computing packages etc.) and functional code for the ETL, which worked pretty well–no need to be dogmatic about it, a 70% pure codebase is still generally easier to reason about than a 20% pure codebase.
Functional programming for a lot of numerical computing maps easier to mathematical notation. However, Scala is usually a worse choice than Java for numerical computing since everything is a boxed type.
This is straight up false, why do you think Scala doesn't have primitive values? Long will be either a value or reference type as needed, despite being spelled only one way instead of two different ways in java.
A 'primitive type' is one which can be directly operated on by intrinsic CPU instructions. My understanding of Scala was that all objects (such as Long, Int...) are encapsulated inside of an object.
Therefore an array of boxed types will not be memory aligned; and any vector instructions (which are very important to scientific computing) cannot be used.
Perhaps something has changed in Scala land since I last looked(??).
Call it "the kaggle effect" - once someone defines the problem and the metric you'll be graded on and gives you a relatively clean dataset, "solving the problem" is just as simple as importing xgboost and plowing away. But there is often an under-appreciation among people without much job experience how hard it is to get to that point. The OP article touched on it a bit, but really, the most difficult job a data scientist has is defining what problem they're trying to solve and getting buy-in from other business stakeholders. And frankly, no data science masters program or boot camp can teach those skills.
http://automatedinsights.applytojob.com/apply/oMu2ojUP8M/Sof...
I'm the tech lead for the R&D team at Automated Insights, a company that turns raw data into human-sounding narratives. Our SaaS platform, Wordsmith, generated billions of articles last year for companies like the AP, Yahoo, and Activision.
The team's responsibility is mainly algorithm development (machine learning and natural language processing, as well as more traditional methods), with the goal of making Wordsmith more powerful and easier to use. We primarily develop in Python (spaCy, gensim, scikit-learn, Tensorflow) and NodeJS. We're looking for an engineer with a couple years experience and familiarity with machine learning. A research background and/or advanced degree is a plus.
Interview: Quick chat with HR, technical phone screen, at-home programming evaluation, on-site interview. No adversarial whiteboard sessions or trivia quizzes.