CSV file format

Why haven't you mastered you CSV yet? It's the UNIX way after all.

Other urls found in this thread:

cat-v.org/
tools.ietf.org/html/rfc4180
twitter.com/NSFWRedditVideo

It takes three seconds to master CSV. Why is this even a thread?

If you think you've mastered CSV in 3 second you don't understand it at all. Please repair your ignorance: cat-v.org/

I really don't understand what there is to CSV except commas and values. It's a fucking table that uses commas.

"it takes a genius to understand it's simplicity"

That's exactly why it takes 3 seconds to master it. It's very simple.

I learned more productive things, like how to proofread. Bitch.

UNIX way is p shit tbh

tools.ietf.org/html/rfc4180

You're not saying anything. You're fucking retarded.

Take your meme back to /g/ where you tried to force it first.

OP, I never read such a quote about CSV from Ritchie. I searched and found nothing about it, point the source.
The one regarding Unix simplicity is true, though.

Now, CSV is a fine format, I've been using for quite some time and it's really easy to parse even with simple tools like 'cut'. As long there's no comma in the value itself.


Thanks for that RFC, I didn't know about it.

is this a meme or legit?

...

he's right though, if you have commas in your CSV file you are fucked

this is why it takes a genuis to understand its simplicity apparently

Because strangely most people are incapable of doing so. Do you have any idea how fucked up the majority of CSV files are? I downloaded 600GB of dox and oh god let me tell you. You have no idea how many rows there are with variable numbers of columns, or files where there is a different delimiter for each row, or data cells which contain the delimiter, but for which the delimiter isnt escaped and the cell isnt encapsulated in a character to identify it as a string. For fucks sakes I have files where the row delimiter is CRLF for 99% of the rows and for others it will only be a CR or a LF.


You're actually supposed to enclosure the cell is a string identifier such as quotes (or whatever character(s) you want)

and for the other 1%

I got a csv file with dates in 21-Mar-16 format. is this normal? I want to kill myself.

Yes and if you import it in to SQL Server it is easy to change. You just define it as a date column and then use the CONVERT aggregate to change it to any format you want.

Why did people go with CSV instead of tab delimited? Tabs look prettier in plaintext editors, and also occur less frequently so you don't need to worry as much about escaping.

Because Excel? SQL Server defaults to tabs.

Two-digit years are sub-standard but as long as they're consistent it should be possible to restructure them.


That's just moving the problem from escaping commas to escaping quotation marks. Also does CSV ignore whitespace before/after a comma? It'd be a pain to structure for readability without that.


Because when a particular column has nothing, consecutive tabs, or tabs at the end of a line are not readable. Tabs also space irregularly with variable column widths.

Garbage in, garbage out. If the data isn't structured correctly, then your options are
1) dismiss your data and do nothing
2) deal with it by studying the data piece by piece

I always use tabs in my own pet projects for flat data files. Can't be bothered to worry about escaping.


It's readable enough for me in vim. I just keep this
highlight ExtraWhitespace ctermbg=4 match ExtraWhitespace /\s\+$/
in .vimrc to see white space at the end of a line, and use :set list when I want to see the actual tab characters.

Most people will use a simple text editor on CSV. Even though some of them let you display trailing whitespace (nano does it too) it's still dangerous.

Also another thing I don't like about CSV is that you cannot comment it. For what I've been learning (surveying) it's actually helpful to be able to, as a way of removing bad measurements without deleting them (it's bad practice to), or just to annotate data files. The usual output format allows this but CSV is occasionally used for specific types of data, like control points.

All you need in that case is a boolean column that says if the data is valid or should be discarded.

You're actually supposed to use characters which wont exist inside the data cell. Not go escaping everything.

Depends on the program but SQL Server Integration Services wont ignore it, you'd have to set the delimiter to be comma + space.

That requires special handling by the tool. You could do some sort of pre-process script before feeding it to CAD but that's annoying.


That's the idea. Which is why it might've been a better idea to use something else for such a generalized format.

Comma+whitespace (if any) I hope. Because indentation is why you'd even do it, so it's easier to read without needing a specialized tool to.

Still, maybe | would have been a better choice.

XML has the speed of CSV with the readability of binary
Any thoughts on this?

You should fix your tool then, if the procedures say some entries have to be kept but not processed.

The script would be like 10 lines. Drop all marked columns, generate new file in "preprocessed" directory. You can easily do a batch script too. Collect all your data, preprocess once, then only use your tool on the filtered datasets.


I don't know why people complain about XML. It's mediocre not terrible. At least the tags make it easy to parse, I never had issues with XML parsers and at least it's very clear what the data is. There are also nice schema visualizers. It's not that hard to read either, especially if your editor isn't shit and can fold/beautify.

Of course it's space-inefficient due to being verbose, but if you have such large data that it matters you can just compress it.

Although all else being equal it seems much better to just use something like JSON instead of XML. It's more readable in a text editor and doesn't use as much space, and it's also more straightforward.

Also yes, I think a sensible designer would pick the lowest frequency character that renders properly and is easy to type. Seems like the CSV authors just picked the first thing that comes to mind without giving it a modicum of thought. Probably made sense to them because who would store anything except numbers and the odd alphanumeric-only string anyway, right?

...

Fuck off. XML is the ugliest fucking piece of shit.

It's horrendous, not terrible.

I made a JSON parser in C that's less than 500 lines long and caused little challenge thanks to this golden tool no Pajeet YouTube video ever teaches called unions. JSON and CSV are enough for eveything, just as easy (if not easier) to parse than JSON and lighter.

With all these unreadable tags, you don't know where the data is.

You never tried JSON.

You're not supposed to read or write XML data by hand. It's designed to be human readable if there is ever a need to do so. It's just as easy to make things unreadable in JSON.

Fuckers send me unescaped commas and it's a mess to process that shit. What a great format, fuck you Ritchie.

XML and JSON are both good in different respects, and both have their own short-comings, but none of them are inherently better or worse than each other because that depends entirely on what you're trying to use them for.

Having said that, I don't like XML for simple API responses or configuration files because it's almost always overkill.

I'm more pointing out a weakness in CSV, that is it doesn't have a lot of provisions for readability (commenting, indentation may or may not work depending on how strings need processing). No issue when used as an info pipe between two programs but not as something designed with hand-editing in mind.


Bottom line is comma is too common.


XML is its own mess of reserved characters and escaping of when they occur. No thank you.