Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
CSV as a Data Source (chartio.com)
74 points by thingsilearned on Aug 1, 2013 | hide | past | favorite | 52 comments


MySQL has supported CSV as a data source for quite some time: http://dev.mysql.com/doc/refman/5.6/en/csv-storage-engine.ht...

You can run full SQL queries directly against a text file as if it was a table.


Yeah, and that's how we used to direct customers who wanted upload CSV data. We even wrote some nice tutorials on how to set it up

https://chartio.com/education/databases/excel-to-mysql

But we found that many people either aren't technical enough, or didn't want to go through the hassle of setting up a MySQL instance, defining a schema, and cleaning the data. So now we do that for them.


Good call. I'm a financial analyst who is becoming increasingly comfortable with code in general, but I can tell you 80-90% of business users lose interest as soon as anything SQL comes up. Heck, 70% of business users chicken out if an Access db is involved.

So this should really get a much, much bigger audience to use chartio.


Perl's DBI also has a driver for CSV as the datasource,

https://metacpan.org/module/DBD::CSV


I haven't used it in ages, but Log Parser -- a fantastic old Microsoft skunkworks project -- lets you run full SQL directly against CSV files, Apache logs, the Windows Registry, Active Directory, file paths...

http://www.microsoft.com/en-us/download/details.aspx?id=2465...

Windows only, natch.


Microsoft SQL Server has a couple of methods for querying CSV/XLS(X) files (I remember at least OPENROWSET and BULK INSERT), but sometimes it's a bit tricky to use/configure (have the right OLEDB driver, use INI files for configuration etc..) or the behaviour is not very consistent/well-documented

Edit: of course with BULK INSERT you don't directly query the file, you must load it to a (temp) table


Data isn't "locked up" in CSV, rather it is available in CSV.

CSV is one of the most open things possible.

As a programmer I do a lot of one-time makeshift data reports for other people, and I always use CSV (or precisely, tab separated) because that's what every program happily emits and consumes. If it does not, it's trivial to transform thanks to UNIX sort, awk and uniq.


Once your CSV (or TSV) files start having quoted fields, they become very tricky to parse using standard multi-purpose tools like sort, awk, & uniq.

It's hard enough when you have delimiters in quoted fields, but dealing with quoted newlines starts to become unreasonable, especially for line-based tools.

CSV files, as you say, are absolutely wonderful to create. Problems come up when you try to parse files other people write. Not everyone follows RFC 4180.


Problems come up when you try to parse files other people write. Not everyone follows RFC 4180.

Plus you've got encodings. If you're accepting CSVs from users, they'll generally come from Excel, which will produce different encoding in different circumstances.


With tab separated values this is not a big problem in practice. On the other hand, you can sort tsv, but you can't sort quoted csv.


Why not? I ask in earnest.


sort -n gets confused by quotes.


If I were to build a MVP for any type of data charting or visualization or analysis the first data format that I would think of supporting would be CSV.


You'd think so but when the use case involves dynamically retrieving fresh data it becomes one of the last things to work on. Lots of people have CSV data and it makes sense to support it but since it's generally a manual process it doesn't deliver a good "flow" for the customer. If they have to repeatedly export/import their data manually to have the chart data update it'll seem like a chore.

Compare this to connecting to a customer's database directly. You can refresh your (cached) chart data on your own schedule and the customer doesn't need to be involved at all.

Oh and if anyone is thinking, "Yeah but I can schedule a cron job to extract the CSV file and publish it every X minutes", yes you could do that but again that's work for the customer. If the pitch is "sign up, plug in to your database, and boom charts!" there really should be a "schedule automated cron job" component.

All that aside, it is useful to be able to manually add data like this. Particularly for static data where the manual work is infrequent.


OK, here's one to do with "just" sort, awk and uniq ;)

http://naa.gov.au/naaresources/govhack-2013/PassengersArriva...


CSV is the biggest pile of nuisance you'd never expect from a seemingly simple data format.

- Header line or none?

- "\n" or "\r\n"?

- Is there a newline at the last line? How about two?

- Escape quotes with doubling or backslash? How about both in the same file? How about both, inconsistently, in different fields? How about quotes including a newline and commas?

- Strings always quoted? Only if necessary? Is ,, a null or an empty string, or an error?

- How about mixed line lengths? Are missing trailing entries nulls? How about multiple data types in a file, with the first field being type, and line length only fixed per type?

I have generally found "TSV with a rule that data cannot represent tabs or newlines, period" as vastly superior.


Chartio actually will accept TSV files because it will detect most of these settings. It'll detect the delimiter, the presence of a header line, file encoding, and newline character(s).

I agree TSV is a lot nicer, and has the bonus that Excel will open a TSV file with an .xls extension without any problems (great for sharing!).


Back in the '80s, TSV was the standard way to work with data for us - and for many years Lotus, Quattro Pro and Excel (when it came out) could easily save directly to TSV and open TSV without problems.

Don't know why Excel and Windows no longer recognize the TSV extension as a unique file format, but it is easily fixed without having to go through the .xls extension route [which can be a bit of a pain since it requires identifying delimiters every time one opens a file].

The quickest solution I found is a 2 line batch file for Windows described here [1]. I've used this solution without issues on multiple computers. TSV is my preferred file format for data work. [I generally analyze data in Python and R and use Excel for looking at results or formatting a pretty version to send to others that prefer Excel.]

[1] http://social.technet.microsoft.com/Forums/office/en-US/1890...


TSV has all those same issues. However, you just have less frequent need of a tab, so most of the escaping edge cases never come up.

Those aspects are defined in RFC 4180 - just a lot of systems don't bother. How would you define a simpler data format?


It's a lot easier to administratively ban tabs/newlines in your data than it is to ban commas, and TSV doesn't have escape mechanisms or quoting. So you actually can parse it with line.chomp().split("\t"), and that doesn't break horrendously.

TSV is streamable and minimally wasteful, I rather approve of it. Netstrings are better though if having sized data and nested data is needed. They are proof against all the ills of quoting and escapes.


Largely agree. I was just picking up on "vastly superior". All the issues of line breaks, nulls, column lengths, headers still remain.

You're adding the constraint that you can't use tabs or newlines in your data (to use a newline in your String, you'd need to escape it). In all other cases, you need escaping, and once you've assumed escaping then CSV and TSV aren't really any different.


Also, I should have known there was a RFC for CSV. Too bad that the main call for CSV data is from cranky old versions of Excel in the hands of non technical customers.


Oh God line endings created such havoc for me using Django's FileField model fields. I had to finally subclass it and do a replace of \r\n with \n to simulate opening in universal newline mode. Btw that whole area in Django needs a good reactor.

I should probably package it up for submission upstream.


There is something called ARFF which is basically a CSV with a header that defines how to interpret data, a bit of documentation and a reference implementation.


Is it possible to set this up so that you can point it to a remote URL to fetch the CSV? For example, I'd love it if I could point it towards a Yahoo! Finance API call that returns a CSV file and have my dashboard auto-update. Or Dropbox URL. Or any remote URL. That would be awesome.


That exact use case is not currently possible, but it's something we've talked a bit about. We're still considering doing something like it in the future.


As someone who has worked in an industry where FTP'ing massive CSV files was the norm, I must say this looks really nice.


Thanks!


The biggest issue I have with csv files, is when the people sending them don't get the $%^&@ quotes right. The second biggest issue, is when they say "sorry, we're not able to do that" after I tell them exactly how to fix what they're sending.


Substringed JSON arrays work great for this as well. Basically you just convert each row to an array, convert the array to JSON, and the strip the leading and trailing brackets ("[" and "]"). Newlines will be properly represented as \n as well as other funky characters.


I prefer DSV files for this reason http://en.wikipedia.org/wiki/Delimiter-separated_values

I have a simple scheme where comma is the delimiter and , in content is escaped as \,. There are no quotes around values.


What is the proper way to encode the quotes? I'm playing around with writing a Livejournal data dump reader and will have to do the quote dance pretty soon. In LJ files they are encoded like so:

col1,col2,col3,"Longer data and somethin ""with"" quotes",col5


Yes, that's how they're supposed to be. Any field with quotes, or commas, or newlines in the data needs to have quotes around it. Any quotes in the data need to be escaped by preceding them with another quote.


From what I understand, you're basically right, but you're basically right only in a sense that that's what most people do. There's no "CSV Data Format" spec. It's all just what most people agree on, most of the time. Unless someone has another idea.


text/csv is defined in RFC 4180: https://tools.ietf.org/html/rfc4180


Strictly, the definition in RFC 4180 mandates ASCII which makes it unusable for many purposes. I guess there's nothing practical stopping you from using another encoding though.


  >      Common usage of CSV is US-ASCII, but other character sets defined
  >      by IANA for the "text" tree may be used in conjunction with the
  >      "charset" parameter.
http://www.rfc-editor.org/rfc/rfc6657.txt

https://www.iana.org/assignments/character-sets/character-se...


If you use SQL, check out http://github.com/dergachev/csv2sqlite

Its a little ruby script that in 1 command takes one or several CSV files, parses their structure into simplistic sqlite table definitions, and then creates a new sqlite database file populated with structure and data from these CSVs.


Hey neat. A csv2sqlite script is one of the first things I wrote in Python: https://github.com/elidickinson/csv-tools

It's a terrible hack, but I actually still use it pretty frequently.


That is really nice. I remember implementing something similar to step 2 and 3 on AppEngine a few years back for a project. It turned out quite nice and non tech-savvy users used it without much problems. We managed to have something that could easily work with different CSV formats (MS Excel and others).


The type/error detection would be really cool paired with something like DataWrangler: http://vis.stanford.edu/wrangler/


Yeah, DataWrangler is an awesome project. There are actually a few great companies coming out with products based around it. We may just integrate with one of those.


Ooh, which companies? I haven't come across any yet.


Eg. http://www.trifacta.com/ founded by the guys who started DataWrangler :)


Thanks!


I'm rather fond of using [1] csvkit for doing my csv wrangling. That and google refine (now open refine).

[1] http://csvkit.readthedocs.org/


CSV format does not define encoding. Also customers usually think about CSV as export/import format from/to Microsoft Excel. Unfortunately each Office localization uses different output encoding and delimiters (and date format). And you cannot suppose that the encoding is UTF-8. The most complete importer of CSV file which I have seen is in Open Office. It is also worth mentioning that Excel can import CSV with other than system locale and different delimiters - but it just not work through open file command.


This also detects the file encoding & delimiters.


Automatic file encoding detection is only heuristic and cannot be in principle exact. But I do not want to criticize the import functionality. I just wanted to say that CSV is an obsolete format and should not be used for data interchange.


This is really cool! Are there any open source libraries to do steps 2 and 3 - auto type detection and error correction/detection?


Not really. We utilize python-dateutil to help recognize the different date formats, but everything else we built ourselves.

http://labix.org/python-dateutil


A vastly under-rated and under-appreciated art form, indeed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: