Arthur Chang

Entrepreneur, Software Engineer, and Photographer
Mar 12

Refreshing thumbnails over large datasets with Paperclip gem

The City Below

In a SQL query, a SELECT for an anticipated result set of over 1000 entries will either be way too slow or comletely time out.  The old school way of getting around this is to do limits and offsets, and doing your operations in batches.  This not only helps lean out the queries, but it also uses much less memory since you don't hold so much all at once.

Rails' ActiveRecord has a great tool to help out: "find_in_batches" or simply "find_each."  The former will return a set of results, while find_each will yield a blog iterating over each in the batch.

The Paperclip gem by thoughtbot has a rake task that they suggest using for refreshing your thumbnails, and for generating new thumbnails for new styles.  These rake tasks both call

Paperclip.each_instance_with_attachment

In this method, it does a find(:all) to get every record in your model, then iterates through them to regenerate the thumbnails.  As you would expect, this times out and breaks on any dataset that's too big.

The fix I've put together replaces the find(:all) with a simple find_each

https://github.com/kineticac/paperclip/commit/71f4a60f17d3b42b73bb5255b7c8e4b...

This will have a little more overhead in the event that you have less than the 1000 records the batch is defaulted to limiting itself to, but the little extra overhead in a rake task should be negligible.

I've sent a pull request to the maintainers of the paperclip gem.  Hopefully it goes through.  In the meantime, if you'd like to use this version, you can edit your Gemfile as so:

# gem 'paperclip', '~> 2.4'
gem 'paperclip', :git => "git://github.com/kineticac/paperclip.git" 

 

Top Photo: Taken by myself at Twin Peaks in San Francisco.

Aug 16

Caching in development is important

Almost a year ago, I wrote about how to override caching when developing here, and only turning caching on when testing.  As it turns out that might not be a great idea.  Too many times I have had strange bugs on production that I could never figure out locally due to caching issues.  Only after a few hours of debugging did I realize it could have to do with caching.

There's also a really big reason why nobody's written an easy way to turn off caching in development, mainly because it's bad for you to see different behavior in development vs. production at any time, especially with queries and fragment caching.

The kind of caching you do want to turn off is class/controller caching for the sake of avoiding restarting your server just so it will pickup your new code.  

config.action_controller.perform_caching = false
config.cache_classes = false

So run memcached, do your fragment caching in development, and you should be good to go.  No reason not to cache queries or view renders.

Mar 17

Setting up Rails with Redis Resque and Rescue-Scheduler on Dotcloud

frozen fingertips

 

I learned a ton over the past week getting feedtopic.com's somewhat unique setup hosted on Dotcloud.  There are a few small caveats that you really have to pay attention to.

Before I get started, however, I need to give big ups to the Dotcloud team.  They are a fellow Y Combinator 2010 batch company, with amazing skill and genius, as well as incredible support ethics and all around amazing personalities.  I consider these guys my friends more than just tech support and company founders.

Now let's get down to what our backend is like.

Rails 3

Part of feedtopic.com is built on the Rails 3 framework.  This deals with our database, complicated machine learning / natural language processing, and custom algorithms we built on top.

Redis

We have decided to use Redis to deal with our queueing system.  We never need to store crazily marshalled objects, and json strings work great.

Resque and Resque-Scheduler

Naturally with redis, we use the resque gem to handle jobs in the redis queue, as well as resque-scheduler to basically do our cron for us.  It's slick because you can load up a front end on a subpath with Rack::URLMap and see how your queue and schedules are doing from a nice clean interface.  We originally used Redis To Go, so our configuration needed to be adapted when we setup the Dotcloud redis.

Postgresql

I want to mention we also chose postgres as our database.  We really don't need any fancy nosql setups or anything at this time.

Dotcloud

We chose Dotcloud for the job of hosting everything.  They are currently in "beta" mode, which means everything isn't as beautifully polished as it could be yet, but the support they provide, like REAL live support, is better than any docs you can get.  Because there's a few things to polish up, I decided to write this blog post with a few things we figured out along the way.  

Heroku (I still love Heroku though)

Another honest reason for choosing Dotcloud is, running this on Heroku in it's minimum "development" state would cost at least $72 a month.  The reason is, we need to run two workers on Heroku, and each one is about $36 a month.  That's a lot for two small workers (one resque, and one for resque-scheduler).  Also it's a bit hacked up on Heroku, because you need to run two apps, one for the resque job, and one for the resque-scheduler job.  Heroku only allows you to rake jobs:work, and that can only be mapped to one rake.  Thus you need to push a whole new app just to run the scheduler.  It works OK if you connect to the same Redis server, but the Redis server will again cost money (they use redistogo).  With Dotcloud, it's setup nicely and you can even setup your own Redis.  I definitely think Heroku will put something in place for this as well.  You could even get away with it if you use the cron add-on, but with a little less flexibility.

Also don't want to knock on Heroku.  These guys provide amazing service as well.  We've been using them successfully with Fanvibe.com for a year and a half now.  It's this special redis/resque setup that really prompted us to go with dotcloud.  Fanvibe.com not only powers our web property, but it also powers the backend of our iPhone app, and our API that powers all of the NBA's (National Basketball Association) properties too (iphone app, android app, ipad app, nba.com), so you can imagine how awesome Heroku can be.  Our backend doesn't just serve up pages, it cranks through live stats in real-time (sub seconds), uses heuristic algorithms to literally create on the fly prediction questions as well as ending answering and awarding people when the predictions are over, it also notifies people about news, stats, live scores, and what friends are watching and saying.  Anyway, I'll save Fanvibe talk for another post.  In a nutshell, I really spend very minimal time every worrying about Fanvibe on Heroku, so this in no way means Heroku isn't a great service.  It really is a great service.  The team over there is also very responsive and are great guys.  I have no doubt they'll really polish out a solution for resque/resque-scheduler soon enough.

Rails 3 setup

Going through the usual Dotcloud documentation, it's pretty straightforward.  When you first "deploy" your ruby app, you'll no doubt have a SystemTimer gem in your Gemfile.  Why?  Because resque gem docs tell you to put it in.  SystemTimer apparently fixes a crucial bug in Ruby 1.8.  I've heard this no longer exists on Ruby 1.9.  SystemTimer won't work on Ruby 1.9 without this fix being merged in, so you have two options:

  • deploy with dotcloud with the configuration of ruby 1.8, which is called "ree" in the config parameter.  ie. -c '{"ruby-version": "ree"}'
  • deploy without the SystemTimer gem and just let the default ruby 1.9 deal with it (this is what I ended up doing).

Another big problem was that, for some reason you'll need a nginx config fix on Dotcloud for some virtualhost problem.  This part I will need Dotcloud guys to step in.  I'm pretty sure they'll update the docs about this soon.  But if you're getting 404's on pages that you know work locally or on something like Heroku, ping them about the nginx config file and the vhost stuff.  More specifically, Sam over at Dotcloud put this together for me.

Postgresql Setup

I have our database setup on Dotcloud, just because it's easy.  One thing to note is that the password Dotcloud provides is pretty crazy, so feel free to wrap the password in double quotes, otherwise I believe it screws with the yaml.  Here's ours:

Redis Setup

The documentation is spot on here.  Note that I had previously setup a Redis To Go hosted redis, so these are more caveats on how to adapt that to your own redis setup on dotcloud. To connect it with your rails app, and your resque workers, you'll need to know a few things.

  • ENV["REDIS_URL"] doesn't really work on Dotcloud yet, so avoid that.  I would use a config/redis.yml file and load that in.  We had used an environment variable set in our development.rb / production.rb files per the instructions of Redis To Go.
  • The password again, is a bit crazy.  If you previously used Redis To Go, you'll see that they parse the URI, and that won't work with the Dotcloud super safe password.

Resque and Resque Scheduler Configuration

I'm going to group both of these here because they're related on how to set them up.  This part was a bit trickier, but we figured out a nice way of doing it.  I'm not going to go into how to get a working resque worker / scheduler working here, I'm going to assume you have it all working locally already.

 

supervisord.conf: This file is required at the root level of your app.  The problem here for a rails 3 app and the resque gems is that you need two different supervisord.conf files for different workers.  The different workers being a pure resque worker, and a resque-scheduler worker.  You can't put them both in the same file, but you also don't want to refactor the entire structure of the app to work in different directories on the rails app.  So we came up with cool solution

  • Each ruby-worker deployment has a unique $HOSTNAME variable.  It's basicaly whatever namespace.name you decided on when deploying.
  • Create two files, supervisord.conf_namespace.resque, and supervisord.conf_namespace.scheduler for example.
  • Make sure you also set the RAILS_ENV in the environment section of the config file.
  • Create a post install hook file: "postinstall" that creates a link to the correct supervisord config file, based on the hostname.  This will basically ensure the correct config file is used on a certain host!  FREAKIN SWEET

This is the resque worker's supervisord config file.  Notice we need to also set the rails environment to production.  For some reason it kept running on production for me.

This is the resque-scheduler's supervisord config file.  You can ignore the fact I have the environment setup there, the queue isn't used, for some reason I just still have it sitting there.

And lastly, our postinstall file.  This is basically making a link.  What a sweet hack.  Remember to chmod +x

We also added in a require in our Rakefile to include resque/tasks

That's it

Wow, ok that was a lot.  But it really seemed like a lot more when we were working on it.  The above were just the final conclusions we came to.  I'm pretty sure I've documented everything, but there is a possibility I left a few things out.  Again, these are very specific to our setup on Dotcloud, so what you have may vary.  Dotcloud guys might send me a few clarifications and I'll update as that comes in.

If you guys want anymore information about any of the topics, feel free to ask me in the comments or ping the Dotcloud team.

 

Photo: So I like posting photos I've taken with every post I make, regardless of how much sense it makes.  This was one I took of Issa when we visited Lake Tahoe.

Nov 17

Run cron jobs or any process easily with delayed_job

last one

 

The delayed_job gem has been around for a little while.  It's a great way to run cron jobs as well as any other background running process.  Some people think it can only be spawned once in awhile, triggered by some other action, but it's a lot more flexible.

At Fanvibe, I've built a huge array of delayed_job workers that are helping us with jobs such as fetching new sport stats every half second, then crunching all that data and figuring out who needs to see that data, as well as grabbing cool heuristics to help us create awesome polls with a mashup of user activity, player and team stats, time, and more.  This is all getting crunched in the background with delayed_job.

The trick is quick re-queueing of the job in an ensure block.  Unlike a message broker system (like rabbitMQ), this stores jobs in the database as one row with a marshalled string to hold things like parameters.  The read/writes are fast enough for things other than chat messaging and real-time collaborative stuff (i'm talking about actual real time within 20ms).

It also has total access to the rails environment, so you can do things like access ActiveRecord etc.

Below is a quick gist of how you would run the delayed_job worker:

Also know, you'll have to kick it off manually the first time.  Just go into your console and do the Delayed::Job.enqueue

 

(photo taken in San Diego a little while ago)

Sep 14

Olark jQuery Hack to load after DOM

Clouds stars and pillars

 

Official Update from Olark: "hey folks, just posted up some new information about how we load Olark after all other parts of the page, hope this helps explain how we improved times. if you grab the latest code from www.olark.com/install you should be all set :)"

 

Olark has stalled my document ready event from firing one too many times!  Yes, they're pretty fast most of the time, but still, it's time I could shave off.

The problem
Olark gives you one code snippet to add to the bottom of your dom, right before the end body tag. It creates a div with some info in it, then loads an external js library and initializes olark.  The whole time it does that, your document ready bound events are still waiting.  If olark takes some time, anything you setup in the ready event handler will not actually work, including more event binds.  Most use cases for the document ready event handler are to attach more event handlers to the dom that's now rendered out.  Without the dom finishing, there's no way to attach these js event handlers.  Since Olark could potentially block these from binding, you're left with a limbo stage where your users aren't firing off the correct js events.

The Goal
Render the divs Olark requires at the end of the dom, but loading the external olark js and initializing should come absolutely last, even after all other js in other document ready event handlers.

The Tools
I'm using jQuery to do some nice javascripting here, you can probably find equivalent ways to do it with your own javascript library or you can write them by hand.  I'm also using the Rails framework, so you'll see me say partials and the gists also shows them, etc.

First step
Dissect the snippet of code Olark gives you into parts (in this case I'm using partials in Rails found in app/shared/_olark.html.erb and app/shared/_olark_script.html.erb).  Then use jQuery's awesome getScript function to actually fetch the script, and on success call the initialization function.

Second step
Put the script part right before the end head tag, basically after any other javascript that runs in the head.  It's just a bit of text, so it won't slow anything down on initial page render.  For Rails users, I have it in my application.html.erb layout file, after all the javascript_include_tags renders, right before the end head tag.  The purpose here is to make sure this runs after all other document ready event handlers.  Olark is probably the lowest priority for rendering on your page.  See gist after the third step for the code example.

Third step
Put the div part right before the end of the body tag as usual.  It's not going to be slow to just put in one div and one a tag here.  Example below 

Done
That should basically be it.  Olark js and initialization will now happen after all your other document ready event handlers.

(photo taken a few weeks ago when I went on a spontaneous shoot in San Francisco.  It was really windy so the cloud movement was perfect)

May 12

Rails.cache override in development

My friend Calvin (IntoMobile Developer) and I were just talking about keeping your app from caching with Rails.cache in development mode.  One nice way to do it is to wrap all your Rails.cache calls in your application_controller.rb.  In this wrapping call, you can determine whether or not to execute the caching or to just let it through in development.  Another benefit is being able to change your caching behavior in the future without needing to mess around with every single line of code that you've written for caching.

In my development.rb file, I define the following:

I am using the memcache gem, rather than the more popular memcache-client gem, and running memcached on my development box as well (I like testing with it running, rather than using the built in rails way of storing it in memory).  The reason for using memcache gem is because I'm currently hosting our app on Heroku, which requires this gem to work in tandem with their cloud solution for memcached.

The CACHE constant is setup to use this memcache gem's caching mechanism, which would be equivalent to the built in Rails.cache.  Setting this constant up allows me to change this in my environment files once, without ever needing to worry about all the places I set it up.

Then there's the cache_store, which I set correctly to use Memcached gem.  This will help me store the cache correctly.

CACHE_DEVELOPMENT constant is for me to turn caching on / off for development mode.  You'll see how it works in the next piece of code below.

The last two lines that are commented out basically control action and class caching, which doesn't really effect if Rails.cache even goes through.  So just setting these two things to false, Rails.cache will still try to cache.  In order to keep this from happening, we have to do some logic in our application controller wrapper, here it is:

The cache_fetch method will be called everytime I want a cache call from a controller.  I pass in the cache_key as the first parameter to basically name this cache result, and I also accept the period of time the cache is kept before it expires.  0 means it never expires, so that is the default value.

The conditional statement basically allows me to turn caching on or off.  If I'm in the development environment and the constant CACHE_DEVELOPMENT is set to false, I just let the block through without hitting any caching.  Otherwise it goes into the caching logic.

Check out that # CACHE.fetch(key, time_expire){yield} block.  This is how it used to work, but recently a few things changed so that I needed to do a get, and if that didn't exist I'd have to manually set it.  I'm not sure why this happened, or if it's still needed, but you can see how I was able to change this detail in one place, rather than everywhere I tried to cache.

The rest shows you how I'm caching.  This would be ugly if I had to do all of that everytime I wanted to cache a result, and even worse if something changed.  This way I can control the development environment for testing caching, as well as keeping code as DRY as possible.

Here's a simple example of how I call the cache_fetch action from my controller:

Hope that helps!

 

Apr 8

Using Paperclip to save an image or file from a url

Paperclip is really good at saving images from local repositories or through forms, but there's little documentation on how to save an image from a remote location, say from a public URL somewhere.

The above uses a quick example using the Flickraw plugin to grab the most interesting photo from Flickr and saving it as a users' favorite photo.

Feb 17

A few Superfeedr API tricks

Bloggers love being promoted inside apps like FanPulse!  We also love it when we can depend on some awesome people to write great sport posts to help people keep up with their favorite teams and games.  The problem is, lots of infrastructure needs to be put together on the backend to support the synchronization of our algorithms with the premium content from blogs and news sources.  Superfeedr fortunately has some tricks for me to automate the process of adding and removing sources on the fly.

When a new source is added to our database, I fire off an after filter that subscribes Superfeedr to that feed.  When the source is destroyed, I also unsubscribe with Superfeedr.  The API makes it super easy.  Here are some examples from FanPulse's Rails framework:

Adding a feed to Superfeedr to subscribe to:

I have a few constants in there that are specific to our app.  

  • SUPERFEEDR_URL = 'http://superfeedr.com/hubbub'
  • SUPERFEEDR_LOGIN = my_superfeedr_login_name
  • SUPERFEEDR_PASSWORD = my_superfeedr_password
  • SUPERFEEDR_CALLBACK = my_servers_callback_method

I'm using the RestClient gem by adamwiggins to make the calls, as you can see it's pretty darn easy.  "hub.topic" is the url for the feed,  "hub.callback" is the url I want future feeds to be pushed to using webhooks as well as subscribing internally, "hub.verify" should be set to sync since we're not doing an asynchronous call, and lastly "hub.mode" is set to subscribe since we want to add this new feed.

There are a few slight subtleties I did not mention, including the fact that your callback has to implement the basic PubSubHubbub subscribe / unsubscribe spec as usual.  That will have to be in another discussion though.

Lastly, if you want to destroy the source from your database, just make sure you hit the Superfeedr API to unsubscribe the feed as well.

Hope that helps!  Big thanks to Julien for all the support and help getting FanPulse running smoothly with the awesome Superfeedr.

Jan 15

Ruby String Concatenation Escaping

Quick post about something strange I came across recently.  When you concat two ruby strings, it does an extra escape on each string.

The \r\n were double escaped so when you print the string, you don't actually get carriage return or newline characters, you literally get the slash r slash n.  That's all =)

Dec 28

Rails find all without associations

Finding all rows in one table that has a certain amount or no associations at all is a little tricky with ActiveRecord.  In fact, you just sort of have to hack it up or use find_by_sql instead.  Here's a smooth solution I used to find all entries without tags and notes (as an example):

Entry.find_in_batches(
  :select => "entries.*",   :joins => ["LEFT OUTER JOIN tags ON entries.id = tags.entry_id", "LEFT OUTER JOIN notes ON entries.id = notes.entry_id"],
  :conditions => "entries.entry_type = 0 and tags.id is NULL and notes.id is NULL",
  :batch_size => 500
) do |batch|
  # do something
end

Above, I'm using find_in_batches, which is an awesome new Rails 2.3 feature that batches your finds for you.  No more limiting and offsetting manually needed!  It works great.  I'm using this in a daily cron to clean up weird stuff, so there is sometimes a lot of entries to play with.  Read more about the Rails 2.3 release.

 

About Arthur Chang

Life
I live in the San Francisco Bay Area and love to surround myself with friends and family. I'm a technology geek with an obsessive startup mentality, a photography nerd, and love to play sports (basketball, tennis, and more).

Startups
I am an entrepreneur with a background in software engineering. Most notably, I founded a company in 2009, Fanvibe.com, backed by investors including Y Combinator, which was acquired in 2011 by beRecruited.com. I am now the Lead of Product and Engineering (fancy title) of beRecruited.

Hacks
I graduated from UC Santa Barbara's College of Engineering with a B.S. in Computer Science in 2005. I've been developing and designing products in web and mobile platforms with large corporations and many of my own startups. I'm obsessed with disruptive apps, cutting edge tech, social game mechanics, social network development, software security, and all things code.

Photography
Photography is one of my biggest passions. Historically, it has been a hobby of capturing stories within still images. I photograph weddings, engagements, travel destinations, landscapes, various events, and many good cause events as a volunteer.

I shoot with a iPhone 4S and various Nikon SLR gear. I'm available to shoot events, weddings, and engagements. I am also always happy to volunteer my time to photograph good cause events.


TwitterFacebookmetaweblog

Search Blog

Get Updates

Tags

Archive

2012 (8)
2011 (20)
2010 (41)
2009 (83)
2008 (2)