Thursday, November 29, 2007

Cheap Tricks XIII - Reading CSV Files using Header Keys

I read a lot of CSV file, most of which come with a one-line header describing the successive lines. For example:
ID,TEXT,COST
1,Gumball,0.05
2,Pack of Gum,0.50
3,Pepsi,1.00
In Ruby there are a few ways to go about parsing this file. You can install a gem like fastercsv, but I prefer just to parse it directly. You can also just skip the first line and assume a standard layout of data, but this is a little error prone if anyone decides to add or remove columns. It's easy to build a header lookup from that first line, like so:
h = line.split(',')
headermap = Hash[*h.zip((0...h.length).to_a).flatten]
This splits up the string by commas into an array, then uses the array's "zip" method to pair each element with a number - from the range 0 to length-1, then flattening, giving us an array that looks like this:
["ID",0,"TEXT",1,"COST",2]
Then we pass in the array as an unbounded param list and use the Hash[] function to create a map like this:
{"COST"=>2,"ID"=>0,"TEXT"=>1}
Pretty nifty for two line, eh? That way we can split up each row by comma, and access the array position by the header name, rather than a magic number:
SEPERATOR = ','
data_ary = []
hm = nil
IO.foreach(filename) do |line|
unless hm
h = line.split(SEPERATOR)
hm = Hash[*h.zip((0...h.length).to_a).flatten]
else
d = line.split(SEPERATOR)
data_ary << {
:id => d[hm["ID"]].to_i,
:text => d[hm["TEXT"]].to_s.downcase,
:cost => d[hm["COST"]].to_f,
}
end
end

# An array of hashes
p data_ary #=> [{:id=>1, :text=>"Gumball", :cost=>0.05}, ... ]

2 comments:

jimbojw said...

The problem with parsing CSV files directly is that it's not always as simple as the example stated in your article.

For example, if fields are strings which may contain commas, then they must be wrapped in quotes. Also, if a value is to have a quote in either the first or last position, then it has to be in a quoted string with its quote doubled.

So the value "hello there" (with quotes) becomes """hello there""" in proper CSV ready notation. Dealing with these oddities are (IMHO) the reason for using a third-party toolkit.

On the other hand, if you can guarantee that the data are all quoteless, then your solution is golden. ;)

Eric said...

Yes. If you have this case, replace 'split' with a regexp 'scan' and 'collect'.