Dockerising a Contao website II
This article is based on Contao 3. There is a new version, see Dockerising Contao 4
A central idea of Docker is to install the application in an image and mount persistent files into a running container. Thus, you can just throw away an instance of the app and start a new one very quickly (e.g. with an updated version of the app). Unfortunately, using Contao it’s not that straight-forward – at least when using the image decribed earlier.
Here I’m describing how I fought the issues:
Issues with Cron
The first issue was Contao’s Poor-Man-Cron. This cron works as follows:
- The browser requests a file
cron.txt, which is supposed to contain the timestamp of the last cron run.
- If the timestamp is “too” old, the browser will also request a
cron.php, which then runs overdue jobs.
- If a job was run, the timestamp in
cron.txtwill be updated, so
cron.phpwon’t be run every time.
Good, but that means the
cron.txt will only be written, if a cron job gets executed.
But let’s assume the next job will only be run next week end!?
The last cron-run-time is stored in the database, but the
cron.txt won’t exist by default.
That means, even if the
cron.php is run, it will know that there is no cron job to execute and, therefore, exit without creating/updating the
Especially when using Docker you will hit such a scenario every time when starting a new container..
Thus, every user creates a 404 error (as there is no
cron.txt), which is of course ugly and spams the logs..
I fixed the issue by extending the Contao source code.
The patch is already merged into the official release of Contao 3.5.33.
In addition, I’m initialising the
cron.txt in my Docker image with a time stamp of
0, see the Dockerfile.
Issues with Proxies
A typical Docker infrastructure (at least for me) consists of bunch containers orchestrated in various networks etc.. Usually, you’ll have at least one (reverse) proxy, which distributes HTTP request to the container in charge. However, I experienced a few issues with my proxy setup:
HTTPS vs HTTP
While the connection between client (user, web browser) and reverse proxy is SSL-encrypted, the proxy and the webserver talk plain HTTP.
As it’s the same machine, there is no big need to waste time on encryption.
But Contao has a problem with that setup.
Even though, the reverse proxy properly sends the
HTTP_X_FORWARDED_PROTO, Contao only sees incomming HTTP traffic and uses
http://-URLs in all documents…
Even if you ignore the mixed-content issue and/or implement a rewrite of HTTP to HTTPS at the web-server-layer, this will produce twice as much connections as necessary!
The solution is however not that difficult.
Contao does not understand
HTTP_X_FORWARDED_PROTO, but it recognises the
Thus, to fix that issue you just need to add the following to your
system/config/initconfig.php (see also Issue 7542):
In addition, this will generate URLs including the port number (e.g.
https://example.com:443/etc), but they are perfectly valid. (Not like
https://example.com:80/etc or something that I saw during my tests… ;-)
This workaround doesn’t work for Contao 4 anymore! To fix it see Dockerising Contao 4
URL encodings in the Sitemap
The previous fix brought up just another issue: The URL encoding in the sitemap breaks when using the port component (
rawurlencode to encode all URLs before writing them to the sitemap.
rawurlencode encodes quite a lot!
Among others, it converts
Thus, all URLs in my sitemap looked like this:
https://example.com%3A443/etc - which is obviously invalid.
Issues with Cache and Assets etc
A more delicate issue are cache and assets and sitemaps etc. Contao’s backend comes with convenient buttons to clear/regenerate these files and to create the search index. Yet, you don’t always want to login to the backend when recreating the Docker container.. Sometime you simply can’t - for example, if the container needs to be recreated over night.
Basically, that is not a big issue. Assets and cache will be regenerate once they are needed. But the sitemaps, for instance, will only be generated when interacting with the backend.
Thus, we need a solution to create these files as soon as possible, preferably in the background after a container is created.
Most of the stuff can be done using the
Automator tool, but I also have some personal scripts developed by a company, that require other mechanisms and are unfortunately not properly integrated into Contao’s hooks landscape.
And if we need to touch code anyways, we can also generate all assets and rebuild the search index manually (precreating necessary assets will later on speed up things for users…).
To generate all assets (images and scripts etc), we just need to access every single page at the frontend.
This will then trigger Contao to create the assets and cache, and subsequent requests from real-life users will be much faster!
The best hack that I came up with so far looks like the following script, that I uploaded to
/files/initialiser.php to Contao instance:
The first 3 lines initialise the Contao environment.
Here I assume that
../system/initialize.php exists (i.e. the script is saved in the
The next few lines purge existing cache using the Automator tool and subsequently regenerate the cache – just to be clean ;-)
Finally, the script
(i) collects all “searchable pages” using the
(ii) enriches this set of pages with additional pages that may be hooked-in by plugins etc through
and then (iii) uses cURL to iteratively request each page.
The first part should be reasonably fast, so clients may be willing to wait until the cache stuff is recreated. Accessing every frontend page, however, may require a significant amount of time! Especially for larger web pages.. Thus, I embedded everything in the following skeleton, which advises the browser to close the connection before we start the time-consuming tasks:
Here, the browser is told to close the connection after a certain content size arrived.
I buffer the content that I want to transfer using
ob_end_flush, so I know how big it is (using
ob_get_length can safely be ignored by the client, and the connection can be closed.
(You cannot be sure that the browser really closes the connection. I saw
curl doing it, but also some versions of Firefox still waiting for the script to finish… Nevertheless, the important content will be transferred quick enough).
In addition, I created some
mod_rewrite to automatically regenerate missing files.
For example, for the sitemaps I added the following to the vhost config (or
That means, if for example
/share/sitemap.xml not yet exists, the user gets automagically redirected to our
In addition, I added some request parameters (
?target=sitemap&sitemap=$1), so that the
initialiser.php knows which file was requested.
It can then regenerate everything and immediately output the new content! :)
For example, my snippet to regenerate and serve the sitemap looks similar to this:
Thus, the request to
/share/somesitemap.xml will never fail.
If the file does not exist, the client will be redirected to
/files/initialiser.php?target=sitemap&sitemap=somesitemap, the file
/share/somesitemap.xml will be regenerated, and the new contents will immediately be served.
So the client will eventually get the desired content :)
Please be aware, that this script is easily DOS-able! Attackers may produce a lot of load by accessing the file. Thus, I added some simple DOS protection to the beginning of the script, which makes sure the whole script is not run more than once per hour (3600 seconds):
true, it won’t regenerate cache etc, but still serve the sitemap and other files if requested..
However, if there is also no
$_GET['target'] defined, we don’t know what to serve anyway and can
You could include the script at the footer of your webpage, e.g. using
/*...*/ or something…)
This way you would make sure, that every request produces a fully initialised system. However, this will probably also create unnecessary load every hour… You could increase the time span in the DOS-protection-hack, but I guess it should be sufficient to run the script only if a missing file is requested. Earlier requests then need to wait for pending assets etc, but to be honest, that should not be too long (or you have a different problem anyway…).
And if your website provides an RSS feed, you could subscribe to it using your default reader, which will regularly make sure that the RSS feed is generated if missing.. (and thus trigger all the other stuff in our
– A feed reader as the poorest-man-cron ;-)
As I said earlier, my version of the script contains plenty of personalised stuff. That’s why I cannot easily share it with you.. :(
However, if you have trouble implementing it yourself just let me know :)
- network (67) ,
- software (155) ,
- university (46) ,
- website (21) ,
- administration (41) ,
- web (82) ,
- php (15) ,
- programming (16)
- apache (15) ,
- config (21) ,
- explained (41) ,
- bug (6) ,
- contao (4) ,
- docker (16) ,
- network (78) ,
- university (42) ,
- curl (8) ,
- firefox (13) ,
- fix (13) ,
- http (6) ,
- job (10) ,
- cron (1) ,
- php (8) ,
- programming (74) ,
- snippet (13) ,
- ssl (10) ,
- trick (60)