Working with big CSV's can be a pain. You don't want to open Numbers or Excel because they are slow and have a limit on the rows they can view. You need to be able to view, edit and create shorter versions for testing. Sometimes there are formatting issues to fix. Read on for some quick and simple solutions to common problems.
Viewing Data
Reading top or bottom of file
$ head -n3 data.csv >>
or, to view the end of the file:
$ tail -n2 data.csv
Looking at data in columns quickly
The above is handy, but the columns are all messed up, sometimes thats no good. Better if it was formatted.
Add the following to your ~/.zshrc file or equivalent.
function pretty_csv {
perl -pe 's/((?<=,)|(?<=^)),/ ,/g;' "$@" | column -t -s, | less -F -S -X -K>}
And run:
pretty_csv data.csv
Ref http://stefaanlippens.net/pretty-csv.html
Creating Test Files
Sometimes you want to run something against a smaller set of the data, eg local testing. You need to create an abriviated version of the file for testing:
head -n200 data.csv > ../testing/data.short.csv
And now you can use atom tablr plugin: https://atom.io/packages/tablr
Tablr will open quite big csvs, up to say 10mb on my machine. Ideally, create shoter test files of < 1000 rows to get a feel for the data in the columns.
Editing Headers
Opening massive CSV files in Mac apps like Atom doesn't work very well. They are just too big. If you need to edit the headers (eg to match your db column names) open in VIM on the command line
$ vim data.csv
You can get by in Vim with the following commands:
:i - insert
:w - write to file
:q - quit
:q! - exit without saving
:wq - save and quit
:? - search
Dealing with problems
Read File Encoding
Sometimes ther ewill be file encoding issues. Check the existing encoding like this:
> file -I UgCable.csv
> UgCable.csv: text/plain; charset=us-ascii>
And then if you need to change the file encoding, use iconv. this version of the command will force a replace of untranslatable chars.
iconv -c -f us-ascii//TRANSLIT -t utf-8 data.csv > data.utf8.csv
And then check for the diferences to make sure it's ok:
diff data.csv data.utf8.csv
2850714c2850714 4001618,37488,2,"SP5946",0,459082.53,246820.37,198306,13,0,0,0,0,0,"¬",5090,3,1,1
---
> 4001618,37488,2,"SP5946",0,459082.53,246820.37,198306,13,0,0,0,0,0,"",5090,3,1,1
Importing Data in Rails
To import into the app, typically Rails for our projects there are some things to remember to avoid blowing up your dev machine / server.
1. Download the file to disk for use
csv_file = Net::HTTP.get(url)
file = Tempfile.new('data.csv')
file.write(csv_file)
2. Use .foreach to loop steadily through each row of the csv, rather than trying to load all 6million into a single variable
CSV.foreach(file.path, headers: true) do |row|
3. Save the hash of the row in the loop to avoid memory build up.
node = UgCable.new(r.to_hash)
node.save!
Remember to edit the headers (in vim!) to match the attribute names in your schema, so that the to_hash matches up.
Outro
And thats a wrap. A whirlwind tour of how to view, edit, save, fix and import massive CSV files into your Rails app. Any quick tips or notes, please leave a comment.
Leave a Reply
Comments
No comments on this article.
comments powered by Disqus