Varnish: where developers & sysadmins meet
Earlier this week I attended the Varnish Summit in Amsterdam. This event is organized by Varnish Software, the commercial entity behind the Varnish Cache project.
I’m a big fan of Varnish and I’ve done a fair number of presentations on the topic. I was honored when Varnish Software invited me to speak.
Preaching to the choir
I must admit I was a bit intimidated by the thought: speaking about Varnish in front of an audience of Varnish users and Varnish Software employees.
What topic should I choose? I doesn’t make sense to do a technical talk, because I’d be outdone by the Varnish Software folks. And I didn’t feel like doing a typical customer case or a pitch.
I felt I would be preaching to the choir, so I decided to tell a story about how Varnish is used by the people in my network. These are clients, colleagues and fellow community members.
The premise of the story sounds pretty corny:
There is still a gap to bridge between developers and sysadmins. Varnish is a very relevant case in point.
It’s basically a story about collaboration and empathy. When applied to the tech scene, this is pretty much what DevOps is all about.
The presentation
As I said before, it would be impossible for me to do a relevant technical talk at the Varnish Summit given the speaker lineup and the audience.
So I decided to focus on the performance and delivery of my presentation. Once I have people’s attention, I will have built up enough credit to convey my message. And it worked! I managed to change the energy in the room and a lot of people chuckled throughout the presentation.
A notable quote from Steve Raby, the host of the summit:
Where do developers & sysadmins meet? Probably in a pub!
The atmosphere was great, the energy levels were rising and people could relate. That’s the buy-in I needed to convince them.
The message
The message is a pretty simple one:
Developers blame sysadmins for poor web performance and assume the infrastructure is not good enough. Sysadmins on the other hand blame the developers and assume the code doesn’t perform under heavy load.
The reality is that the internet has changed and the expectations of our end-users have changed along the way. In the grand scheme of things these are somewhat recent changes and not all organizations have transformed towards the way of The Cloud. But then again: on the web every year is considered a decade.
It’s kind of weird that I’m using the term “Cloud”, but that’s exactly the term I need to describe the transformation of traditional IT on the web towards pure web-based IT. And web performance is an essential piece of the puzzle.
Web performance also matters to the end user. It attributes to the user experience. Slow websites are just as bad as websites that are completely down.
Time to team up with your sysadmins and developers and use your combined skills to integrate Varnish the right way.
Why slow?
Why is your website slow or is it slowing down over time? In a lot of cases the slowest part of the application is where there’s communication with the database or an external API. Typically the webservers and application servers are waiting for data to come back from these systems.
- The netwerk overhead can sometimes cause delays
- Filesystem interaction can also cause delays
- Poor database design, bad queries and a lack of effective indices will definitely slow you down
But … it works on my machine.
Yes, the application performs pretty well on your machine, but that doesn’t mean there aren’t any delays. The scope and the scale of your initial tests are no where near the production volume, so you don’t feel the pain.
Multiply those tiny delays by the amount of visitors you’ll have when you’re live and all of the sudden, these tiny delays can amount to a massive slowdown.
Who’s responsible? The developer? The sysadmin?
So your website, web app or API is slow, who’s responsible? Easy one right!
It’s the developer’s fault! He/she should have written better code, better queries and designed the database in a better way.
Not really! Maybe, the ops people are to blame.
- Maybe the servers don’t have enough RAM and CPU.
- Maybe the disks are too slow
- Maybe there aren’t enough servers in the cluster to handle the load
- Maybe the Linux kernel wasn’t tuned to handle the traffic
- Maybe the database server is set up in such a way that queries don’t get the performance they usually get
- …
It’s safe to say that the blaming game doesn’t really work. OK, everyone is accountable for his/her actions. But the best strategy is to work together, look for a solution and keep the common goal in mind: stable and well performing systems attribute to the end-user experience.
Optimize, cache, use Varnish
Optimization is good and if you can identify and take away a bottleneck, things will look good. But these are usually temporary solutions. You will reach a point where you have optimized most of your code and most of your systems. There will still be high load, there will still be a significant cost to manage and expand your infrastructure.
That’s when people look at caching to solve their problem. Even if you don’t have a problem, caching is still a good idea.
Why would you re-compute a piece of information that hasn’t changed? Why waste resources on it?
In essence with computers, it’s caches all the way down. In this case and for this specific audience, Varnish is the caching solution I advocate.
Integrating Varnish is pretty simple: you put Varnish in front of your webserver, link it to your webservers, change the DNS record and you’re done. Sort of …
Hit rate
Installing and configuring Varnish is easy. Achieving a decent hit rate can be difficult. And it all depends on your code.
There are a couple of things Varnish doesn’t cache by default:
- Requests that contain a cookie header
- Requests that contain an authorization header
- Requests that change the value of the requested resource. (all requests that aren’t GET or HEAD)
- Responses that contain a set-cookie header
- Responses that have a cache-control/expires header that disable caching
So basically as soon as you use a cookie, you’re screwed. A lot of sites use cookies … think about it.
The only way to solve that is to write some VCL that sanitizes requests and removes cookies for requests that don’t need them. In a lot of cases you end up with a really complicated Varnish caching policies that need frequent updates.
Architecture
Here’s the clue of the story: most of your caching problems will be solved if you have a decent caching strategy built into your software architecture.
That’s right: you have to think about caching before you run into performance issues.
Here’s some advice:
- Respect HTTP and HTTP will respect you
- Avoid keeping track of user state (~cookies) whenever you can
- Use cache-control headers and assign the right value depending on the content
- Design your pages a a collection of independent content blocks
- Assemble your content blocks with Edge Side Includes or AJAX (if required)
- Assign the right cache-control headers to every content block
- Remove all the tracking cookies from your requests, they’re not processed by the webserver anyway
- Cookies aren’t always evil. Sometimes you’ll use cookie values as a valid cache variation
- Try to keep your VCL as limited as possible, let HTTP do the heavy lifting
- Have a solid cache invalidation strategy. Caching too aggressively is a bad thing.
- Provide hooks in your insert/delete/update logic to facilitate cache invalidation
- Caching dynamic pages is your priority, caching static files not so much
The challenge is real
The story I told is based on my 11 years of experience at Combell, Unitt and the other brands in the Intelligent holding. I still see myself as a developer working in an infrastructure environment. But to the outside world I’m an infrastructure guy giving coding advice.
The reality is that I have to deal with other people’s code on a daily basis. It’s not always pretty and in a lot of cases the code doesn’t scale well. The problem is often related to poor architecture decisions. But instead of pointing the finger at the client, I try to be part of the solution.
This talk, this blog post, the fact that I like Varnish a lot is the living proof that collaboration between infrastructure people and developers is the real solution. I learn something new every day and as a developer in a infrastructure environment I learned a lot from our sysadmins. In return I use my experience and try to teach something new every day. Hence the blog post, hence the talk.
I’m pretty sure web performance problems are easier to solve in 2015 than in let’s say 2005. Then again the load on the average website has severely increased since 2005.
Luckily you’re not alone. If you use a framework or a CMS, changes are that there’s a good Varnish VCL and maybe even a module to use AJAX or ESI. There’s a community of people thinking about these problems and trying to find solutions. Events like the Varnish Summit, the Varnish User Group meeting, dotScale, Velocity and DevOpsDays do a great job to gather people and talk about the challenges and possible solutions.
After my presentation I was approached by one of the attendees who was delighted with my presentation. He said he laughed during my talk, not because it was ridiculous, but because the story is so recognizable. He told me he’s experiencing all of these issues at work and that he traveled to Amsterdam to find solutions for his problem.
I’m not saying that my presentation solved these issues, but at least they were addressed and at least some possible solutions where suggested.
Yes, the challenge is real. Let’s put the right people at the table to overcome these challenges. This will probably be a complementary team with both developers and sysadmins. Heck, there will probably be managers and other stakeholders involved as well.
Keep the end goal in mind: web performance is a crucial aspect of the end-user experience