Working With Massive CSV Data

Working with big CSV's can be a pain. You don't want to open Numbers or Excel because they are slow and have a limit on the rows they can view. You need to be able to view, edit and create shorter versions for testing. Sometimes there are formatting issues to fix. Read on for some quick and simple solutions to common problems.

Viewing Data

Reading top or bottom of file

$ head -n3 data.csv >>

head

or, to view the end of the file:

$ tail -n2 data.csv

Looking at data in columns quickly

The above is handy, but the columns are all messed up, sometimes thats no good. Better if it was formatted.

pretty csv

Add the following to your ~/.zshrc  file or equivalent.


function pretty_csv {
    perl -pe 's/((?<=,)|(?<=^)),/ ,/g;' "$@" | column -t -s, | less  -F -S -X -K>}

And run:

pretty_csv data.csv

Ref http://stefaanlippens.net/pretty-csv.html

Creating Test Files

Sometimes you want to run something against a smaller set of the data, eg local testing. You need to create an abriviated version of the file for testing:

head -n200 data.csv > ../testing/data.short.csv

And now you can use atom tablr plugin: https://atom.io/packages/tablr

Tablr will open quite big csvs, up to say 10mb on my machine. Ideally, create shoter test files of < 1000 rows to get a feel for the data in the columns.

tablr2

Editing Headers

Opening massive CSV files in Mac apps like Atom doesn't work very well. They are just too big. If you need to edit the headers (eg to match your db column names) open in VIM on the command line

$ vim data.csv

You can get by in Vim with the following commands:

:i - insert
:w - write to file 
:q - quit
:q! - exit without saving
:wq - save and quit
:? - search

Dealing with problems

Read File Encoding

Sometimes ther ewill be file encoding issues. Check the existing encoding like this:

> file -I UgCable.csv
> UgCable.csv: text/plain; charset=us-ascii>

And then if you need to change the file encoding, use iconv. this version of the command will force a replace of untranslatable chars.

iconv -c -f us-ascii//TRANSLIT -t utf-8  data.csv > data.utf8.csv

And then check for the diferences to make sure it's ok:

diff data.csv data.utf8.csv
2850714c2850714 4001618,37488,2,"SP5946",0,459082.53,246820.37,198306,13,0,0,0,0,0,"¬",5090,3,1,1
---
> 4001618,37488,2,"SP5946",0,459082.53,246820.37,198306,13,0,0,0,0,0,"",5090,3,1,1

Importing Data in Rails

To import into the app, typically Rails for our projects there are some things to remember to avoid blowing up your dev machine / server.

1. Download the file to disk for use


csv_file = Net::HTTP.get(url)
file = Tempfile.new('data.csv')
file.write(csv_file)

2. Use .foreach to loop steadily through each row of the csv, rather than trying to load all 6million into a single variable


CSV.foreach(file.path, headers: true) do |row|

3. Save the hash of the row in the loop to avoid memory build up.

node = UgCable.new(r.to_hash)
node.save!

Remember to edit the headers (in vim!) to match the attribute names in your schema, so that the to_hash matches up.

Outro

And thats a wrap. A whirlwind tour of how to view, edit, save, fix and import massive CSV files into your Rails app. Any quick tips or notes, please leave a comment.

Blog

HAVE A READ.