Using Scala to handle exponential growth at a startup (Part 2)

Brian Pugh profile As a follow-up to our previous Scala post about migrating to Scala from a CakePHP setup, Typesafe reached out to us to put together a case study. We are releasing a portion of the interview transcript below to give more specifics about why we decided to make the switch.

What were the business drivers, objectives or initiatives that led you to talk to Typesafe?

We looked at quite a few options but quickly narrowed in on PHP, Java and Scala. In the end we decided to go with Scala and Play. The core drivers that led us to this point were scalability, reliability and performance. We realized our current code base and architecture wouldn’t cut it as a medium to long term solution. We needed to do a significant rewrite to support the growth we were seeing and get the performance we wanted. I don’t doubt that we could have made the system scale in PHP, we could have solved this problem in a variety of ways, but since we were going to be rewriting a significant amount of code no matter which technology we chose, we decided to go with the technology we thought was going to best match our needs and provide the least painful path to get to a more scalable situation. We looked at a variety of options and found that Scala and Play really met our needs best.

Performance and scalability are a common reason we hear people moving to the Typesafe stack and it sounds like that was part of your decision.

Right, and like I said before, I’m not a PHP basher. It’s not my favorite language, but I do think we could have scaled the system with PHP. Core to this decision was that we needed to re-architect the system no matter what language was used. Scala and Play gave us better tools to re-architect to a service-oriented architecture and provided great support for parallel processing which was important to us. It gave us the tools we needed to make the architectural changes we wanted more effectively than I felt we could have done in Java or PHP.

Are there other factors that led you to choose Typesafe?

Efficient development was definitely important to us. We were previously using PHP which is very efficient for development. You make a change, you hit refresh, and bam, there it is. So the hot reloading that Play provides was a big deal for us. I think there’s still room for improvement. Play’s hot reload still is not as quick as it could be. The Scala compiler could still stand to be improved in terms of compile times, but having automated hot reload is a good solution that just works. I don’t have to worry about including another product like JRebel or anything…out of the box it pretty much works. That was important to us because we need developer efficiency. We were used to that very efficient environment and I’ve been in the other kind of environment where you make a change, you restart some big JBoss server or something and it takes two minutes before you can test your change. It’s painful to get anything done because it’s just so inefficient.

Yeah I came from that world so I can appreciate that. Anything else?

So efficient development was one big piece for us. Runtime monitoring was another one. Getting insight into what’s happening at runtime is not easy with PHP. And the JVM is pretty good–I can easily dump threads, I can connect with VisualVM, run jmap, use JMX–there’s just a lot of things I can do–and this is more the JVM and less the specific Typesafe stack, but visibility into the runtime was another thing that was big on our list.

Parallel processing was another one. PHP does not give you much help with parallel processing. You don’t have any notion of threads. You have no notion of Futures or Actors. I mean, you don’t have threads so you can’t even think about any of these higher level abstractions like executor services, Futures or Actors. Java obviously supports concurrency, but Scala with Akka gives us additional approaches to choose from when designing for concurrency like Actors and Futures. To be fair, you can do those things in Java, but it is not nearly as easy or clean. Most Java based concurrency is done with shared mutable state which can be hard to get right even for really good developers. So the concurrency support provided by Scala and Akka played into the decision.

Availability of libraries was another factor. PHP, Java, and Scala were on our short list because of library support. We didn’t feel like some other languages had the big set of libraries like PHP and the JVM do. With both PHP and the JVM it seems like whatever you need to do, you can almost always find a library that will help you out. Those were some of the criteria that were significant in our decision.

What version of Play are you using?

We are on 2.0 right now. I know that 2.1 should come out in the next week or two and we are looking forward to that because there are some things that have been done with Anorm that we hope to benefit from. Some of the performance characteristics of Anorm haven’t been fantastic for us and we’ve had to resort to straight JDBC in a few places were larger datasets are returned because the performance is significantly better. I know a lot of the performance issues have been fixed in Play 2.1 or at least we’ve seen some of the commit logs indicating that, so we’re looking forward to upgrading to 2.1 in the near future, by which I mean sometime in the next few months.

Did you find it difficult for your team to come up to speed on the Typesafe stack?

I think it’s pretty clear that Scala has a higher learning curve than PHP. I can get somebody out of school and they can get going on PHP quicker than they can in Scala. There’s just more concepts to grok in Scala than there are in PHP. But I feel like Scala makes up for that in the capabilities that it has and the way it encourages writing quality code. I feel like PHP has the whole object-oriented notion kind of bolted on top, and it’s very hacky, and it pushes you towards doing things that you shouldn’t be doing, towards writing mediocre code. Scala pushes you towards writing very high-quality code. It pushes functional concepts and immutability. It has closures, pattern matching, options, all the standard things that I’m sure you’re familiar with. I do buy into the story that Scala helps developers write quality code even if there is a higher learning curve initially.

How are the roles and teams split up?

Currently we’re split into two teams, and that’s kind of flexible, we change every so often. We have one team that focuses on the client–meaning the Javascript client–we have a substantial code base in Javascript. If you’ve seen our app at all, you’ve seen that it’s a pretty heavy JavaScript application. We have north of 150K lines of code just in JavaScript. So we have one team that’s focused on that side of it, and another team that’s focused on the server side infrastructure and operations. The client team is about 5 people right now and the server team is 7. The client team was fairly unaffected by the switch to Scala. The server team was obviously where the work related to Scala is happening.

Are you familiar with the Typesafe Console?

I have heard it mentioned but I have not looked at it closely, no.

(inaudible)

I would be interested, and as long as we’re on this topic–another thing that’s interesting to me is support for New Relic. We’ve been using New Relic in order to monitor our production environment and it does a pretty good job with PHP. We tried using it a little bit with Play about five months ago, maybe six months ago, and it wasn’t great. They hadn’t really gotten it nailed down to where we could get enough useful information that I could justify paying for it. They’ve got some pretty great tools in general but it is not tuned for a Scala and Play application. I don’t think it would be a huge effort for either Typesafe or New Relic or a combination of the two, to get to the point where you could get all the features in New Relic that I’m getting from PHP in a Play application. I know they do have support for some Java apps. I think it’s just that Play is different enough, because it’s Scala not Java, from what they typically expect that it doesn’t do great and they don’t have specific instrumentation for Play like they do for other frameworks and servlet containers. It would be nice to see Typesafe and New Relic work together to make the integration really great. But you know, if I can get what I need out of the Typesafe console, that would be fantastic as well. Especially if the console can handle situations where I have an autoscale group. So servers are coming in and out of rotation and new servers are registering themselves with the console as they come up and the console is allowing me to see aggregated metrics across all my servers. All of that would be interesting to continue to talk through going forward.

(inaudible)

…you had asked about which technologies we were using and how.

So obviously Scala 2.9 with Play 2.0 and Akka. Currently we still do have a good sized PHP code base in CakePHP. Over time, we’re migrating that CakePHP code base to Play based services. So we pick out chunks of functionality, build it in Scala and Play, and then have the PHP call out to the new service to actually do the work. We are trying to avoid a one time huge move to an entirely new system and instead release early and often individual services to replace the PHP based code. One example is how we generate the image for a diagram. We need images so users can export a diagram to a PNG or to show a thumbnail in the list of diagrams for the user.

Here’s how we generate that thumbnail: The request initially goes to the PHP application which farms it out to a service whose purpose in life is to generate images. In the PHP, we pull the data representing the diagram, send the data over to a Play service, the Play service generates the image and returns it to the PHP which returns it down to the client. In generating an image for the diagram, there may be multiple external images that need to be downloaded from the internet or from another place inside our own system. They are images that the user uploaded and placed in their diagram and are required to generate the one single image that has everything that’s on the diagram. That downloading of external images can be done in parallel. Also, if a diagram has 5 pages in it, we can do each of the 5 pages in parallel. So the image generation task parallelizes quite well and in more than just those two ways I mentioned. We use both Actors and Futures to handle that concurrency which is where we’re making use of Akka. We’re using it inside of Play and using Actors wherever it makes sense to parallelize things. Making use of concurrency is one place where you can see more easily the concrete business value. The number of failures we have in generating PDFs, PNGs or JPEGs has been reduced by almost an order of magnitude. Usually the failure would come because we’d run the PHP process out of memory or it would just take so long that it timed out. We have a 30 second timeout on our load balancers, so if a request takes longer than 30 seconds, we just cut it off. The outliers are way, way down. Right now, we can do 99% of our request in less than 20 seconds for a PDF or image generation and if you get down to the 95th percentile, we do it in well under 2 seconds. Previously, when we were doing that heavy processing of generating an image in PHP, it was much worse than that.

Also, with the services architecture, the service is on its own box. It doesn’t affect the rest of the site when it’s doing heavy processing. That’s another advantage. Each service can use its own resources and not starve other services. So the whole site doesn’t suffer if one service happens to be doing some intense processing or taking a lot of resources. It’s not really specific to the typesafe stack so much as the architecture change we made, but within the services architecture, where the typesafe stack does come in, is in the fact that I can make use of Actors in Akka in order to parallelize the processing and get responses much quicker. I can also make better use of the hardware because we’re on multi-core boxes. We’re making more efficient use of the cores that we have available to us by using Actors and the underlying use of multiple threads.

What kind of boxes do you use–Linux on Intel or something like that?

Ubuntu boxes, Ubuntu 12.04. We’re on Amazon so we use EC2 instances running Ubuntu.

Everybody and their dog is doing that lately.

It’s very easy for a startup to get going with Amazon and nice not to have to worry about a data center and hardware replacement and all that. Amazon is doing a great job at providing a lot of nice services as well.

Exactly. So, is there an anecdote that you can think of that totally was a no-brainer for you? Is there anything else that really affirms your choice?

What’s really hard for folks to measure is a decent ROI, so if you have any of that information, that would be great to have as well.

The most clear example of where we can measure the ROI is probably the image generation I mentioned before. It’s not just the performance I mentioned before, but also the quality and readability of the code. The features that Scala has as a language really lent themselves to giving us a good solution for this problem. Our proprietary format for representing a diagram has a hierarchical structure about it that lends itself to making an almost domain-specific language for writing out images. The code reads “select color here” and under that “draw line from x to y”. We can just nest things using closures and it starts to look like a domain-specific language for how we take our proprietary format for diagrams and turn it into a nice image–JPG, PNG, or PDF. I can send over a little code sample with some extra comments if that helps.

That would be great if it’s convenient. Any other information related to ROI?

I can provide some concrete metrics related to the increases we saw in both reliability and performance for image and pdf generation. This is a big part of our system because every time someone goes to their list of diagrams, we generate a thumbnail. So it’s something that’s happening all the time. We do check a cache, but we still end up doing a lot of image generation for thumbnails as people are making changes to diagrams all the time. I have some metrics I can send you and will try to get as concrete as I can about how much improvement we saw, both in terms of failure rates and performance improvements.

That would be great, thank you so much.

As a next step, I’d like to take everything you gave me so far and create a first draft of the story..I’ll send that over to you. It should be painless to you, I’ll do the bulk of the writing and you can review and we’ll iterate until you’re happy.

Read the full Typesafe case study.

3 Comments

Ben • February 20, 2013 at 11:18 pm

Very interesting read! I would like to read the first part, but the link to it is broken.
Oliver Plow • February 21, 2013 at 3:16 am

Hello,

thanks for the interesting article. The link to previous Scala post does not work:
Marc Fleury • February 22, 2013 at 11:35 am

“you restart some big JBoss server or something and it takes two minutes before you can test your change.” I guess you never heard of hot deploy? JBoss, Glassfish, Tomcat, they all have it.