Archive for November, 2009

google app engine – a retrospective

Saturday, November 7th, 2009

The past few weeks I’ve immersed myself in the Google App Engine (GAE), porting over a local weather app into a scalable directory of hundreds of forecasts. It’s always a bit frustrating to leave your comfort zone, but generally it’s a rewarding to stick it out and pocket the experience – working with app engine was no different in that respect. There’s a lot to love about app engine. The cost of entry is nil and an overnight sensation should theoretically scale gracefully; furthermore the cost of resources above and beyond the quotas is reasonable.  But the benefits of the platform do little conceal a series of hard limitations in place. Many times it feels like the framework is fighting against you, and unlike the world of vps or quality commodity hosting there simply are no alternatives, no configuration files available for override or libraries to be installed to take care of a particular problem. GAE is a highly managed platform and the expectation of wiggle room against some of the more draconian limitations is foolishness.

The django compatibility along with the app patch was what drew me in initially. Had I been starting the development from scratch, I would never have gotten cornered by the framework as often as was the case – but my goal was porting over existing django. The promise of GAE is that you get django with a different models api which for the most part holds. My ported app relied on heavy use of SQL relationships without a direct GQL analogue and sizable updates based on a cron job that walks the locales. The GAE datastore isn’t a slouch, but side by side with a SQL database, it feels like slow motion. Over time I’d expect the datastore may get more highly tuned, but it’s a rude awakening when you’ve become accustomed to running 300 updates in sub-second speed over LAMP. On the GAE this particular logic was my first introduction to the DeadlineError. Without exception, no request/task/cron can take more than 30 seconds to complete and against the once speedy SQL update I hit the deadline wall in a fiery crash. I looked at the problem from every angle and came to the conclusion that there was simply no way the amount of data being used could be managed in the datastore with the deadline in place. As I would learn later looking at the quota overview, this one particular script was eating cpu like it was Fat Tuesday. Just deleting 10,000 entities after I decided to back out of the implementation basically broke the bank in terms of quota supplied CPU. If you take anything away from this post, remember that the GAE datastore is not a SQL database, if you go into GAE development thinking otherwise you will get burned. Once you accept the datastore for what it is the healing can begin. In my case, that healing would take the form of memcache, which in GAE is provided in abundance.

No critique would be complete without a mention of the other menacing limitation, the DownloadError. When your app is connected to a web service you should be prepared to encounter this. The stock fetch (a urllib2 wrapper) allows for 5 seconds to get in, out and onward with your http request – pass the threshold and you have nothing. Thankfully, that deadline can be pushed to 10 seconds using the deadline argument, but I still found that I was hitting this limit occasionally when a slowdown on the service side occured. As was the case with my datastore trouble, this was part of a background process. A 10 second limit on an http request seems tasteful when a user is on hold at the other end of the request, but in the context of a background process it’s nothing more than a cruel mistress. I understand Google’s need to implement some level of control here, but with the DeadlineError in place I’m confused why the the http timeout could not sit neatly under the standard request timeout. The lesson is clear, if your app includes a dependence on a slow or occasionally hammered service, GAE may not be the right move. In the world of commodity hosting, this is a common and easily reconciled issue – but there is simply no real solution within the GAE framework, the 10 second wall is immovable.

The timeouts wouldn’t have been so painful if my port was not based on code that took an approach to background processing where a fair amount of data was collected and stored at once. In my short time with Google App Engine, I’ve learned how to accommodate the limitations to some degree. With the use of task queues, cron, and breaking larger processes into smaller, most of the troubles seem to dissolve. It’s a good deal more work to break background processes out in this manner and other than satisfying the GAE limitations, it has no benefit whatsoever. In a more general sense, this is the common path of least resistance within the framework, no matter which limitation you may face the same concept can usually be applied, break it down into smaller processes and/or requests. In any case, let’s hope Google takes another look at background processing in future updates to GAE.

There is one last caveat I should mention – though unlike the timeouts, it’s less likely to affect you. What Google has done is to wrap specific image processing functionality a django developer would normally use PIL to accomplish. I would guess that it was developed with image thumbnail galleries in mind because that is one of the only suitable uses I can imagine. If your source material is photographic JPEG, you may be in decent shape. I, however, found myself in the unfortunate position of trying to use PNG output in dealing with png/gif inputs. The inputs image would come in with a small adaptive palette and very small image size, but once transformed within the api they became PNG32 monstrosity. It was disheartening to see a the same image come out of my development server as a 15kb PNG8 while the app engine would blow it up to 200kb+.  The available transformations aren’t all that limiting, but lack of output control combined with poor default behaviors (such as preserving the input pallete) is maddening. I expect I’m an outlier on this one, so I can only hope more use cases are considered for the images api. In my particular case, this was an insurmountable issue and I found myself moving image processing off the app engine and standing up the service off the cloud. I mention this not because you’ll hit this particular wall – but because you will likely hit a similar situation.  Perhaps you’ll need to prefetch elsewhere to workaround a timeout, or transform available data larger than the 1MB download download threshold into smaller files and reassemble them in the cloud, or present a result set count greater than 1000..  I can’t predict where or when it might happen to you – but the smart move would be to expect to do a dance or two for any moderately complex application. There are so many restrictions in play that the laws of probability are bound to kick in somewhere. I think of it as the cloud tax.

No matter the platform, some amount of frustration is inevitable. There wasn’t a whole lot I couldn’t make happen on GAE with a little more time or creativity. Coming out of the other end with a site in place on GAE is where much of the appreciation of GAE begins. The logging and dashboard are well thought through. I found it was easy to isolate the CPU intensive code based on the quotas and target optimizations effectively. The regex search of logs means looking up an error in a task queue of hundreds is relatively easy. For anyone used to greping server logs, you’ll not want to go back to the stone age.  In django, I’ve always thought such a system would be a godsend – it’s really appealing to get such a rich overview of the site without lifting a finger. Another major advantage over commodity hosting is that outgoing bandwidth is fixed at an extremely reasonable cost ($.12  p/GB) past the 1GB per day free quota.

Google App Engine is not a platform you will want to adopt blindly, I would highly encourage anyone considering such a move to opt for a rapid prototype to smoke out any showstopping limitations that happen to be built into the platform. They are certainly not the limits you’ll want to encounter a month down the road where you have virtually no control over them beyond a feature request. If you find you can work within the constraints of the platform, having the GAE team deal with the scaling and ops of your site more than offsets the less than attractive aspects of developing on the platform.