Miklós Koren
Professor
I sometimes get excited by binary file formats for storing data. A couple of years ago it was HDF5. Now Apache Parquet looks pretty promising. But most of my data work, especially if I share it with others, is stored in just simple, plain text.
I believe portability and ease of exploration beats a tight schema-conforming database any time.
Be it CSV, JSON or YAML, I love it that I can just peek into the data real quick.
head -n100 data.csv
wc -l data.csv
are commands I use quite often. And nothing beats the human readability of a nice YAML document.
Sure, performance is sometimes an issue. If you are regularly reading and writing tens of millions of rows, you probably don’t want to use plain text. But in most of our use cases, a data product is read and written maybe a couple times a day by its developer and then shared with several users who read it once or twice. It is more important to facilitate sharing and discovery than to save some bytes. And you can always zip of gzip. (Never rar or 7z or the like. Do you really expect me to install an app just to read your data?)
Besides size (big) and speed (slow), there are three issues with CSV files:
So why am I a big fan of plain text data despite all these problems? I believe portability and ease of exploration beats a tight schema-conforming database any time. (Mind you, I am not working in a bank. Or health care.) See your data for what it is and play with it.