More

setra · on Oct 30, 2018

If anyone wants to jump straight to the code: https://spatial-lang.org/hello-spatial/

setra · on Sept 17, 2018

TDLR: This is a content addressed data store similar to IPFS (although this project is older). You can configure one of several backends such as local file storage, S3, SSH, etc. It includes an organization system based on tags, and other meta data. You can construct a fuse filesystem representation based on a query. A web UI exists allowing exploration of existing files, uploading, etc.

setra · on March 27, 2018

A fatal accident at 50x the rate of a sober human driver with a study size where N = 1.

pgreenwood · on March 27, 2018

I'll have a go at seeing what we can conclude from the data. Others, check my thinking please. Now we have 1 death in 3m miles for Uber, versus 1.18 deaths in 100m miles for sober drivers.

The expected rate for 100m miles for Uber is 33.333...

But how confident can we be? To answer that let's compute a poisson confidence interval around that rate, as in https://stats.stackexchange.com/questions/10926/how-to-calcu....

Let's see what a 95% confidence interval for 1 death in 3m miles looks like:

  > poisson.test(1,conf.level = 0.95)$conf.int
  [1] 0.02531781 5.57164339
  attr(,"conf.level")
  [1] 0.95

Multiply that by 33.333 to convert to deaths per 100m miles:

  > 33.333333*0.02531781
  [1] 0.843927
  > 33.333333*5.57164339
  [1] 185.7214

So 95% confidence that the rate per 100m miles is from 0.84 to 185.72. That's pretty wide! And since the lower bound crosses 1.18, the difference is not significant at the .05 level (if we must make that particular comparison). However, let's look at 90% CI:

  > poisson.test(1,conf.level = 0.9)$conf.int
  [1] 0.05129329 4.74386452
  attr(,"conf.level")
  [1] 0.9

Which gives a CI of 1.71 to 158.13. So with 90% confidence we can say Uber is less safe than sober drivers. Ok.

Now let's look at 93% CI:

  > poisson.test(1,conf.level = 0.93)$conf.int
 [1] 0.03562718 5.17251332
 attr(,"conf.level")
 [1] 0.93

That gives a CI of 1.188 to 172.417. The lower bound being just a bit worse than sober drivers.

So we can conclude with 93% certainty from this data that Uber is less safe than sober drivers. Probably a LOT less safe. Although the CI is really wide, this is shocking data for Uber, in my opinion.

davidgould · on March 27, 2018

> 1.18 deaths in 100m miles for sober drivers

> Uber is less safe than sober drivers

But the 1.18 deaths in 100m miles is for all drivers, not just the subset of sober drivers. Not quite sure why you are claiming it is only sober drivers.

jakobegger · on March 27, 2018

Erm... I don’t think statistics work like this. You can‘t go and pick a confidence level that „confirms“ your desired outcome.

People with more knowledge about statistics than me might be able to explain why.

Slartie · on March 27, 2018

Statistics works exactly like this. What doesn't work is saying "Okay, we have one death in 3 million miles, that extrapolates to 33 deaths in 100 million miles", because it implies a silent addition of "with nearly 100% certainty", which is the part that's wrong here.

But the poster did something different. He took it one level further and attempted to calculate this confidence number for different spans in which the actual "deaths per 100 million miles" number of Uber's current cars would fall into, given an ideal world (from a data perspective) in which they would have driven an infinite amount of miles. But he actually did it the other way round - he modified the confidence variable and calculated the spans, and then he adjusted the confidence until he arrived at a span that would put Uber's cars just on par with human driving in the best case.

The fact that a fatal incident happens that early (at 3 million, and not closer or past the 86 million that a statistical human drives on average until a fatal incident occurs) does not allow us to extrapolate a sound number per 100 million miles, but it tells us something about the probability by which the actual number of fatalities by 100 million miles that we'd get if Uber continued testing just like it did and racked up enough miles (and killed people) for a statistically sound calculation will fall into different margins. Sure, Uber could have been just very, very unlucky - but that's pretty unlikely, and the unlikeliness of Uber's bad luck (and conversely the likeliness of the fact that Uber's tech is just systematically deadly) is precisely what can be calculated with this single incident.

nmca · on March 27, 2018

The statement "with 95% confidence" is a classic misinterpretation of what a CI is, the assumption of Poisson is dubious but there's no obvious plausible alternative. Overall seems reasonable.

pgreenwood · on March 27, 2018

Hello! I'd be interested to hear what you think the correct interpretation of these CIs are in this case. Failing that can you explain what is wrong with saying something like "with xx% confidence we can conclude that the rate is within these bounds" is?

The assumption of using Poisson seems pretty solid to me, given we are talking about x events in some continuum (miles traveled in this case), but always happy to hear any cogent objections.

muraiki · on March 27, 2018

The Poisson distribution assumes equal probability of events occurring. That seems to me to be an oversimplification, given that AV performance varies over time as changes are made, and also given that terrain / environment plays a huge factor here, whether looking at one particular vehicle or comparing to vehicles across companies (and drivers in general). Since AV performance will hopefully be improved when an accident occurs, we also cannot meet the assumption of independence between events. Although if AVs are simply temporarily stopped after an accident, that also breaks the independence assumption as we'd have a time period of zero accidents.

The bigger problem though is what you are doing with your confidence interval. A CI is a statement about replication. A 95% confidence level means that in 100 replications of the experiment using similar data, 5 of the generated CIs -- which will all have different endpoints -- will _not_ contain the population parameter, although IIRC this math is more complicated in practice, meaning that the error rate is actually higher. As such, if you generate a CI and multiply the endpoints by some constant, that's a complete violation of what is being expressed: there is vastly more data with 100m driving miles than 3m miles, which will cause the CI to shrink and the estimate of the parameter to become more accurate. There is absolutely no basis for multiplying the endpoints of a CI!

Ultimately, given that the size of the sample has an effect on CI width, you need to conduct an appropriate statistical test to compare the estimated parameters between the 1 in 3m deaths for Uber and whatever data generated the 1.18 in 100m deaths for sober drivers. There's a lot more that needs to be taken into account here than what a simple Poisson test can do.

For an analysis of how AVs with various safety levels perform in terms of lives saved over time, I recommend https://www.rand.org/blog/articles/2017/11/why-waiting-for-p...

Edit: Note the default values of the T and r parameters when you run poisson.test(1, conf.level = 0.95), and also that the p-value of the one-sample exact test you performed is 1. Also, since this is an exact test, the rate of rejecting true null hypotheses at 0.95 is 0.05, but given my reservations about the use of a Poisson distribution here, I don't think that using an exact Poisson test is appropriate.

muraiki · on March 28, 2018

To be more clear, when you run poisson.test(1, conf.level = 0.95) with the default values of T and r (which are both 1) you are performing the following two-sided hypothesis test:

Null hypothesis: The true rate of events is 1 (r) with a time base of 1 (T).

Alternative hypothesis: The true rate is not equal to 1.

The reason that you end up with a p-value of 1 is because you've said that you've observed 1 event in a time base of 1 with a hypothesized rate of 1. So given this data, of course the probability of observing a rate equal to or more extreme than 1 is 1! As such, you're not actually testing anything about the data that you claim you are testing.

I'm not trying to be harsh here, but please be careful when using statistics!

pgreenwood · on March 28, 2018

Ok I re-ran setting T properly for both cases. The results were similar:

> poisson.test(c(1, 11800), c(3, 1000000), alternative = c("two.sided"),conf.level = .93)

Comparison of Poisson rates

  data:  c(1, 11800) time base: c(3, 1e+06)
  count1 = 1, expected count1 = 0.035403, p-value = 0.03478
  alternative hypothesis: true rate ratio is not equal to 1
  93 percent confidence interval:
     1.006334 146.142032
  sample estimates:
  rate ratio 
    28.24859

The lower bound of the CI approaches a rate ratio = 1 for a 93% confidence interval.

Interestingly, if you multiply the CI I claimed before by the rate ratio instead of the expected rate, you get almost exactly the same CI as here.

  > ci <- c(0.03562718, 5.17251332)
  > 28.24859 * ci
  [1]   1.006418 146.116208

* Note 11800 is about two years of pedestrian deaths and time units are in millions of miles. https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/...

pgreenwood · on March 28, 2018

Facinating, thank you. Particularly the part about multiplying the CI. I wonder if the analysis could be resuced to some extent? I feel there must be a way to use the information we have do draw some conclusions, at least relative to some explicit assumptions.

coinjobber · on March 27, 2018

No. 3 million miles of observation. You can get a pretty exact and conservative estimate with a bayesian poisson process model. I don't have the time to run the numbers right now, but my guess is the posterior estimate that Uber's fatal accident rate is higher than a human's is >90%, even if taking the human accident rate as a starting prior.

jfoutz · on March 27, 2018

I thought Uber had to have a human take over every 13 miles.

It’s more like 10 miles of observation 300,000 times. Or rather an attentive human can be 50x better than average.

btrettel · on March 27, 2018

I'd be very interested in seeing the math if you have the time later.

gpm · on March 27, 2018

95% - erring on assuming Uber has driven more miles than they probably have.

https://news.ycombinator.com/item?id=16621118

comex · on March 27, 2018

Hmm; if I understand correctly, in that link you show that if Uber’s AI has the same risk of killing people as a human driver, then the prior probability of an accident occurring when it did or earlier was 5%. That’s significant, but it’s not the same measure as the probability that the AI has a higher risk (which would require a prior distribution).

tedsanders · on March 27, 2018

It's a reasonable gut feeling to not generalize from n=1, but the numerical evidence - with either a Bayesian or frequentist approach - is actually quite strong and statistically significant. Math here: https://news.ycombinator.com/item?id=16655081

YeGoblynQueenne · on March 27, 2018

That's not right. You're setting your expectation for N = 100m miles, then updating it for N = 3 million miles?

That's like saying: "I rolled this red d20 twenty times before I rolled a 1, whereas I rolled a 1 the first time on this blue d20, so the red d20 is obviously better and I'm rolling all my saves on it".

Or, I don't know- "I rolled three 1s on this d20 in twenty rolls so it's obviously not a fair d20".

tedsanders · on March 29, 2018

Can you clarify? What do you believe to be wrong and why?

If you have a strong prior the dice are equivalent, then yes, the rolls shouldn't change your mind.

If you have a prior that the dice are weighted in an unknown way, then yes, the rolls really should change your mind.

setra · on March 19, 2018

IPFS is not an automatically distributed file store. Just like torrents if someone wants to host a petabyte they can. That does not mean anyone else will be mirroring it.

setra · on March 4, 2018

This was written by a western venture capitalist Michael Moritz of Sequoia Capital. He Invested in Google, LinkedIn, PayPal. etc.

setra · on Feb 27, 2018

In this article the author states: "The latter definition is important for developers. It includes things like IP addresses, mobile device IDs, browser fingerprints, RFID tags, MAC addresses, cookies, telemetry, user account IDs, and any other form of system-generated data which identifies a natural person.". This information does NOT automatically qualify as personal data. Information being unique is not the same as personally identifiable. A random cookie sent by the browser is not PII. A cookie stored in conjunction with say an email address could be.

Certain information can be classified as PII if it possible to cross reference it with other stored information to identity a user. For example a European court in a recent ruling stated that a full IP address could be considered PII because an ISP would have a record of IP address and time with a persons name.

robin_reala · on Feb 27, 2018

Are you mixing up ‘personal data’ and ‘personally identifiable information’ (a US legal concept that differs from the EU definition of personal data)?

setra · on Feb 27, 2018

No, I am simply using shortened text not the USA PII legal concept. GDPR has many more restrictions than the USA concept of PII.

_o_ · on Feb 27, 2018

To me it seems quite simple, if the information can be used to identify user it is personal information and you need explanation why you need it and opt in. If this is a problem for you, maybe avoid collecting what you don't need. The idea of "collect everything and audio & canvas fingerprint them, maybe I will need it later" wont pass, you will never get consent. Collect only what you really need.

gnfurlong · on Feb 27, 2018

This doesn't make sense to me. Wouldn't that make every id that is one to one or one to many with a customer PII? That seems absurd.

A user alias on some random site would meet that criteria, assuming they took name/address/etc when you signed up.

Unless PII has some other significance than I'm interpretting it to have?

freeone3000 · on Feb 27, 2018

Right. So you'd have to have a business case for the user to have a persistent login, if you want to offer login functionality, beyond simply "track the user to see what they want". It's ridiculous.

mark_edward · on Feb 28, 2018

Sounds good to me.

Screw you data vampires.

x0x0 · on Feb 27, 2018

cookies is about the only thing in there that may not qualify as personal data as defined by the gdpr.

setra · on Feb 15, 2018

A unique random cookie is not PII automatically. It is only PII if it can be associated with something like a name (not limited to a name). Anonymous data with unique identities is not under the consent requirements of the GDPR.

setra · on Feb 15, 2018

The data collected is probably not fit to be classified as personal data. GDPR does not automatically forbid collection of anything involved with a user otherwise things like page popularity ranking would not be possible. There was a court ruling stating that IP addresses could be considered personal data because an ISP would have a log that associated the IP with a person. As long as the are only collecting things like "what packages are installed", "how powerful is the system", "when did it last update", and anonymize the last section of the IP they should be fine.

setra · on Feb 12, 2018

A computer can understand that an optimal solution is not always needed. There is nothing about the problem being NP complete that means the computer HAS to find the optimal solution.

setra · on Feb 5, 2018

To keep this in perspective bitcoin was under $1000 at the start of 2017. It is still a 700% increase since just that recent time.

vinchuco · on Feb 5, 2018

How many data points to drop to gain perspective?

aviv · on Feb 5, 2018

At the start of 2017 Joe the cab driver did not put his money in after hearing this on CNN.