Collecting Real User Metrics_

December 28, 2018 @10:37

I like metrics. I've been working lately to convert some aging parts of my monitoring and alerting infrastructure over to something slightly more modern (in this case Grafana, InfluxDB, Telegraf and collectd). As part of this project I'm looking broadly at my environment and trying to decide what else to monitor (and what I don't need to monitor anymore). One of the things that came up was website performance.

Challenges

Now if you go all the way back to my first post on this iteration of the blog you'll see that this website is static. In fact if you dig around a bit you'll see that I render all the pages from Jinja templates via a git post-receive hook so all my webserver is doing is sending you data off disk. Since server side metrics are practically useless in this case I naturally gravitated towards collecting RUM or Real User Metrics.

Privacy Goals

So now we enter what is (for me at least) the slippery slope. Once you decide to ship information back from the user agent to the server you open yourself up to a huge amount of options. Do I try to do cute stuff with cookies and LocalStorage to assign a user a unique ID (or maybe do some browser fingerprinting) to watch 'visits', do I try to gather detailed device information (ostensibly so I can test the site based on what people use), do I try to determine how people got to my site by history snooping? The possibilities for data collection in this fashion are pretty intense. For me though the answer is almost entirely a resounding no. As someone with a three tiered malware / tracking blocking infrastructure I don't want to be part of the problem. I firmly believe that privacy and security go hand in hand on the web so I refuse to collect any more information than what I need. I also think that this is just plain good security practice. Sadly this is in direct conflict with the urge and ability to just slurp everything into storage somewhere which seems to be a pretty successful business model (and generally horrific practice) these days. Realistically I don't think that someone's going to hack my site to get browsing details on the 20 or 30 visitors per day that I get but I see it as an excuse to lead by example here, to be able to say that you can collect useful real user metrics easily, fast and with an eye towards privacy.

Browser Side

If you view the source of this page, just before the footer you'll see a script tag loading 'timing.js'. It is pretty self explanatory, all this does is fire the sendLoadTiming() function once the DOM readyState hits the complete phase. This ensures that all the performance metrics are available in your browser. sendLoadTiming() then simply takes that PerformanceTiming data in your browser adds the current page location and your user agent string and sends it along to a Flask application.

Server Side

The Flask application receives the data (a JSON object) from timing.js and is responsible for sanitizing it and sending it off to InfluxDB so I can graph it. The first thing that happens to a request is that I compute 6 timing measurements from the data.

  1. How long DNS resolution took
  2. How long connection setup took
  3. How long the SSL handshake took
  4. How long the HTTP request took
  5. How long the HTTP response took
  6. How long the DOM took to load

I then sanitize the user agent string into browser and platform using the Request object provided by Flask which exposes the underlying Werkzeug useragents helper. Based on the platform name I (crudely) determine if the device is a mobile device or not. Finally I collect the URL, and HTTP version used in the request. All of this information becomes tags on the 6 measurements that get stored in InfluxDB.

Dashboard

Dashboard!

Once the data is in InfluxDB it is a pretty simple task to create a Grafana dashboard. The JSON for the dashboard I use is available here, though it will likely need some customization for your environment.

Conclusion

So far I think it's been a success. Leveraging JavaScript for this has kept all of the automated traffic out of the data set meaning that I'm really only looking at data with people on the other end (which is the point). It has enough information to be useful but not so much as to be a potential invasion of privacy. All of the parts are open source if you want to look at them or use them yourself.

I may add a measurement for DOM interactive time as well as the DOM complete time. It looks like a LOT of people have some strange onLoad crap happening in their browser.

Some strange timing from Android

Also, once the new PerformanceNavigationTiming API is broadly supported (once Safari supports it, at least) I'll need to migrate to that but I don't see that as a major change. I'd love to hear from you if you found this useful or if you have any suggestions on expanding on it.

🍻

Subscribe via RSS. Send me a comment.