Quick Ruby script to write file types report
April 27th, 2011 by walterToday I had to evaluate whether it was worth it to use Kete’s bulk import facility to migrate an existing site’s content to Kete or whether to just have someone drag the content over page by page.
To figure this out, I wanted to know roughly how many pages along with other files were on the site. I knew that it wasn’t going to absolutely massive, so I started by grabbing all of the site’s public content with wget (the details coming from http://linuxreviews.org/quicktips/wget/):
wget -p -r --wait=20 --limit-rate=20K -U Mozilla http://the_site/
I let that run in the background while I did other work.
When it finished I wrote up a little (ugly, unDRY, but took < 5 minutes) Ruby to give me a report by file type and called it file_report.rb based on a skeleton grabbed from http://blogs.sourceallies.com/2009/12/word-counts-example-in-ruby-and-scala/:
require 'yaml' # Change rootDir to the location of downloaded site rootDir = "/path/to/roodDir/for/entire/downloaded/site" raise rootDir + " does not exist" unless File.directory? rootDir # recursively add files and directories to report_hash based on their type def files(rootDir, report_hash) report_hash['directories'] = report_hash['directories'] || Array.new Dir.foreach(rootDir) do |dir| if dir != "." && dir != ".." dir_path = rootDir + "/" + dir if File.directory?(dir_path) puts "Processing " + dir report_hash['directories'] = report_hash['directories'] << dir_path Dir.foreach(dir_path) do |file| if file != "." && file != ".." file_path = rootDir + "/" + dir + "/" + file if File.directory?(file_path) report_hash['directories'] = report_hash['directories'] << file_path files(file_path, report_hash) else # add path to report_hash's entry for the file extension extension = File.extname(file).sub('.', '') report_hash[extension] = report_hash[extension] || Array.new report_hash[extension] = report_hash[extension] << file_path end end end else # add path to report_hash's entry for the file extension extension = File.extname(dir).sub('.', '') report_hash[extension] = report_hash[extension] || Array.new report_hash[extension] = report_hash[extension] << dir_path end end end end t1 = Time.now report_hash = Hash.new files(rootDir, report_hash) puts "File type counts: " report_hash.each do |k, v| puts "#{k} : #{v.size}" end puts "Writing complete report" File.open('report.yml', "w") do |f| f.write(report_hash.to_yaml) end t2 = Time.now puts "Finished in " + (t2 - t1).to_s + " seconds"
Finally, I called it with:
ruby file_report.rb
Nothing flash, but handy to have when you need it. I saved myself a lot of clicking around on their website.