This is more about sharing and a bit about improvements that could be done to this code. This is not a novel idea, but I wanted to share nonetheless as it has made a great impact on our server.
End of last year with a surge of bots from China and Singapore putting the server to its knees I was forced to make some changes. I noticed that many requests had redundant query params and params where they did not belong. I could not figure out why this was the case.
Example 1: site and type did not belong to the news page.
/news?page=1&site=domain.org&type=page
Example 2: Multiple times the same param.
/list?page=1&page=15&page=120
Example 3: Broken query params
/news?;page=1
This caused Varnish to not properly cache content, and the cache was filled with duplicate content CNC Machining Service. So I decided to go with a query param whitelist in Varnish. I gathered where query params were used and filtered the rest. I used a custom header to temporarily store accepted params.
sub vcl_recv {
...
# Remove unnecessary query strings to improve caching.
# Check if URL has query string
if (req.url ~ "\?") {
# JS
if (req.url ~ "\.js\?v=") {
set req.http.X-Query = "";
if (req.url ~ "\?v=") {
set req.http.X-Query = req.http.X-Query + regsub(req.url, "^.+\?(v=[^&]+).*?$", "&\1");
}
set req.url = regsub(req.url, "\?.+", regsub(req.http.X-Query, "^&", "?"));
unset req.http.X-Query;
}
# Image styles
elseif (req.url ~ "(\?|&)itok=" && req.url ~ "\.(webp|jpg|png)") {
set req.http.X-Query = "";
if (req.url ~ "(\?|&)itok=") {
set req.http.X-Query = req.http.X-Query + regsub(req.url, "^.+(\?|&)(itok=[^&]+).*?$", "&\2");
}
set req.url = regsub(req.url, "\?.+", regsub(req.http.X-Query, "^&", "?"));
unset req.http.X-Query;
}
# Some page with multiple query params
elseif (req.url ~ "/some-page" && req.url ~ "(\?|&)(group|t1|t2|t3|t4n|a)") {
set req.http.X-Query = "";
if (req.url ~ "(\?|&)group=") {
set req.http.X-Query = req.http.X-Query + regsub(req.url, "^.+(\?|&)(group=[^&]+).*?$", "&\2");
}
if (req.url ~ "(\?|&)t1=") {
set req.http.X-Query = req.http.X-Query + regsub(req.url, "^.+(\?|&)(t1=[^&]+).*?$", "&\2");
}
if (req.url ~ "(\?|&)t2=") {
set req.http.X-Query = req.http.X-Query + regsub(req.url, "^.+(\?|&)(t2=[^&]+).*?$", "&\2");
}
if (req.url ~ "(\?|&)t3=") {
set req.http.X-Query = req.http.X-Query + regsub(req.url, "^.+(\?|&)(t3=[^&]+).*?$", "&\2");
}
if (req.url ~ "(\?|&)t4n=") {
set req.http.X-Query = req.http.X-Query + regsub(req.url, "^.+(\?|&)(t4n=[^&]+).*?$", "&\2");
}
if (req.url ~ "(\?|&)a=") {
set req.http.X-Query = req.http.X-Query + regsub(req.url, "^.+(\?|&)(a=[^&]+).*?$", "&\2");
}
set req.url = regsub(req.url, "\?.+", regsub(req.http.X-Query, "^&", "?"));
unset req.http.X-Query;
}
...
This worked very well and the server recovered quickly. Some query params like campaigns were only needed by JS, so they could be stripped out here as well.
It was a bit tricky and I missed a couple on our setup, causing some pages to not work properly, but they were quickly fixed and it was much better than a non-responsive server.
This set up has a couple benefits:
Varnish needs less memory
More requests are already in cache as their canonical URL
The <link rel=canonical> points to a clean URL
Probably has positive security implications because it's stripping out unwanted query params on almost every page
Since then, we didn't experience any struggles anymore, so maybe this idea can help others as well.
What do you think, do you have any improvements that could be made here?
Ideally I'd like to use subroutines to not have the same code over and over again, but I don't think that's possible in Varnish.