Who hosts the web I browse?

February 28, 2020 · 22 minutes read (≈4224 words)

A few days ago I read a blog post about someone who wanted to find out which email services are mostly used for all his received product emails. This was kind of interesting and inspired me to ask a similar question:

What are the top services that hosts the websites I use?

Define “hosting”

Who hosts a website is not easy to answer. A modern website consists of multiple different component – ranging from images, stylesheets, JavaScript files, up to ads, videos or custom fonts. It is not necessary to serve all of them from a single domain and therefore it is not necessary to use a single hosting company.

To make matters worse, a lot of websites are using content delivery networks (CDN) nowadays. A CDN consists of multiple geographically distributed servers which provide fast delivery of all kinds of content. It sits between the end user and the hosting provider. In some rare situations it can replace proper web hosting, but in the vast majority of cases it works like a cache. Which means the end user never connects to the hosting network directly. Only in some cases it is still possible to find out who really hosts a website.

In the end, from my perspective it’s only important to whom I connect. Be it a CDN or a hosting provider. Be it a traditional hoster or a cloud platform. I just want to gather some napkin statistics about the web landscape my browser speaks to.

What’s the part of the web I use?

Before I can start creating statistics, I to know what websites I visit. At first I thought about tracing all my connections on my home router and gather data about to whom I connect to. This will work for all my internet traffic at least at home, including non-www parts. But firstly, I don’t really care about traffic outside of the world wide web and secondly, I would need to wait at least a few weeks to gather any meaningful data.

There is another source of information built into everyone’s browser - the history feature. Each common browser tracks the last websites you visited. Depending on your configuration it will last as long as your browser window is open or up to several months into the past. I personally really like the awesome bar in Firefox to quickly recall already visited websites. The best results are of course achieved by having a long history. That’s why I already have a fairly long history and thanks to Firefox’ sync service even from multiple devices including my PC, laptop and mobile phone¹.

Extracting history data

Info

What I describe here, is only valid for Firefox. I’m certain that other browser have similar capabilities. If you want to recreate this whole process with another browser, then please consult your favorite search engine.

The history of all visited websites is saved in an sqlite database called places.sqlite. The file is located in your profile’s directory. In case you don’t already know it, you can find out where the profile folder is located on your hard drive by opening about:profiles in Firefox.

There is a lot of interesting stuff in there. Feel free to poke around for a while if you are curious about your own browsing habit. To get started, you can get some helpful pointers from this gist from the user “olejorgenb”.

For my evaluation I just need a list of all visited domains. There is a table called moz_origins which just provides this. I used the cli tool sqlite3 to fetch the data just like this:

sqlite3 -separator '' places.sqlite \
  'select prefix,host from moz_origins where prefix IN ("http://","https://")' \
  > domainlist.txt

This select-query outputs all visited domains including the protocol. I used a where-clause to filter out all unwanted protocols like ftp or file. The generated list contains around 6000 entries in my case. The oldest entry is from August 2017². It’s a surprising short list for two and a half year of browsing history. I would have expected a lot more if I had to guess. On the other hand, after I took a closer look, it seems to be complete. I’m not missing any entry.

Hosting service detection

Finding out which part of the web I browse was easy. Identifying the hosting company was way harder than I anticipated. As mentioned above, detecting the hosting company is almost impossible when a CDN is used. That’s why I’m satisfied when I can detect either a CDN or a hosting company. I will not between distinguish them.

Method #1: Using IP address information

I thought I could find a solution by using IP address information from ARIN or RIPE. And after all, not all traffic goes through a CDN - maybe there is way to distinguish between hoster and CDN at least for some cases. Most of the times only static content is routed through a CDN (with exceptions like Cloudflare Workers). Maybe I could filter out the static part by seeing what belongs to a CDN and what is self-hosted. Then I would retrieve IP and AS information with RDAP to find out who hosts the sites.

Sadly, parsing the IP address and AS information is way harder than I imagined. The data is way less structured than one would expect. Field names like operator or company name have a lot of different names and have a rather vague definition. I would need to build a complex heuristic to determine the correct hosting service name. Too much work for such a small experiment.

Method #2: See what academia has to say about it?

I’m sure I’m not the first one trying to find out who hosts a website. There is a paper out there from Matic et al.³ about their approach to find an answer to this question. Their results look very promising, even if they mostly focus on differentiating between self-hosted and CDN hosted sites. This could be a good starting point for me. They even released their tool “PYTHIA” publicly. Great.

Unfortunately, I was not able to reproduce their results. It mostly just didn’t work correctly. Maybe I’m using it wrong, although I followed the steps from their README correctly in my opinion. The results were simply not very reliable. Even spot checks have revealed many false findings.

Their ideas itself are good. I learned a lot by going through the source code and reading the paper. With a bit more time and knowledge it could probably be a viable solution. But at the moment, that too is too much work.

Method #3: Using a SaaS service

After some searching I found some services that do exactly what I want. For example:

Most of these sites have a way to directly test a given domain on their website. If they provide an API, it is either restricted or cost some money. I’m not willing to spend anything other than my free time on this experiment. So, they are out of question.

The last one (cdnfinder) is interesting, because they have a link to their Github repository where we can find the source code for this project. I can use it from home without any other external dependency and wothout paying a cent on it. Bingo!

Method #4: Using an open-source tool

cdnfinder is an open source project from CDNPlanet. It uses a mix of CNAME suffix matching and an HTTP header detection on basis of a simple heuristic.

After building the project, I used the one-off cli command cdnfindercli to play around with the tool. The results were okay, but not great. Some smaller CDNs were completely missing. Even Cloudflare couldn’t be reliably detected. There are some open issues about this and even some pull requests to enhance the detection. Unfortunately, the maintainer seems to be inactive at the moment.

I found another open source project with a similar goal: True Sight. It’s a browser plugin. So, not really suitable for my use case, but the detection rate seems to be better. They have a list of all supported CDNs with a link to the actual implementation. Even is its written in JavaScript instead of Go, I was easily able to port most of their detection code over to the cdnfinder project.

Info

Just for comparison: The original cdnfinder detected in only 34% of all my requests a CDN. The patched version with the information from True Sight raised this to 70% with many more provider detected.

For the actual end evaluation, I used cdnfinderserver. After all, I wanted to test 6000 domains, it would be a waste of resources to restart the same process again and again instead of just starting the server one time and use the REST API.

The server listens to port 1337 by default. It expects either just a domain name to only use the CNAME method or a full URL to test each resource of the given website with both detection methods. It must be a POST-request and the payload must be in JSON format. The output is also JSON. More details can be found in the README.

I wrote a super small bash script to iterate over the list of all my domains. The output gets filtered by jq, because I don’t need all the data from the cdnfinder tool.

Help wanted

I was not able to find an elegant method to concat all json objects that jq outputs. I just print out each object which results in something like { object1}{object2} which is not valid json. Using sed to replace }{ with },{ seems hacky. If you have a better solution, please let me know! I am curious.

export IN_FILE="domainlist.txt"
export OUT_FILE="cdn_list.txt"
echo -n '[' > "$OUT_FILE"
while IFS= read -r DOMAIN; do
  echo $DOMAIN >&2
  curl -s localhost:1337 -X POST -d "{\"url\": \"$DOMAIN\"}" -H "Content-Type: application/json" | jq -cje '.everything[] | {cdn: .cdn, size: .bytes}'
done < $IN_FILE | sed 's/}{/},{/g' >> "$OUT_FILE"
echo -n ']' >> "$OUT_FILE"

This took longer than I expected: Nearly 7 hours! To be fair, my connection at home is not that great at the moment. Apart from that, invoking headless chromium for each site takes it toll, too. The resulting file can finally be used to generate some neat statistics.

Results

I’d like to be able to compare my results with those of others. Most market share comparison reports are using completely different metrics like number of customers or revenue. Nothing I can compare my data with. And those who compare the number of websites, which are using CDNs, have very different findings between them. I will nevertheless compare individual values with those of Datanyze or, if appropriate, with some other anecdotally data sources.

… by count

A simple measurement is to count all requests to each used CDN. The percentage results are shown in Chart #1 (red upper bars). The green bar is for comparison with global data coverage from Datanyze.

Chart #1: Percentage rate of all requests for each CDN (red). Global data from Datanyze is shown as a comparison (green).

As you can see, almost one third of all my request are either using no CDN or the CDN couldn’t be detected. This is consistent with the statement from the README from True Sight that 70% of all websites are using a CDN. My results doesn’t seem too far off from the truth. Great.

Google is by far the biggest CDN according to this measurement method. This is not at all surprising, since Google hosts a lot of different services like Google Fonts, Google Hosted Libraries or Google Ads. One could argue about whether Google Ads should be counted here. It’s a very specific niche and ads aren’t really necessary from the users perspective (website owners might see this differently). To be honest, I would have preferred it if I could distinguish between Google Ads, their own hosted content and their cloud platform GCP. Using some filter lists from adblockers could at least sort out the ad service, but I was too lazy to implement this. Maybe next time…

Cloudflare makes it to the 2rd place – in my case and maybe even globally. Again, nothing surprising here. Cloudflare is indeed huge. According to W3Techs, Cloudflare is used by around 12% of the top 10 million websites. Datanyze even claims that more than 35% of the top 10 million websites are using Cloudflare and Wappalzyer actually says 70% are using Cloudflare of all their tested websites! Unbelievable.

Amazon, the 3rd most used hoster for website I visit, consists actually of multiple different services, just like Google’s results. It includes Amazon Cloudfront, Amazon’s dedicated CDN network, and resources like S3 object storage.

An outlier in my browsing history, in comparison with the global average, is (probably) Github. But I’m not sure how I should count this. Github is not a traditional CDN provider. I’m not even sure that Github itself is an independent hosting provider. However, zou can use raw.githubusercontent.com to serve static content, but it uses a Content-Type of text/plain and is therefore not suitable for a lot of cases. Maybe it’s just because I browse a lot of github.io sites… Anyway, 1.24% seems a lot!

At first I through the stats for Microsoft Azure are wrong. 0.31% doesn’t look that much. Especially considering that Microsoft Azure is one of the largest cloud providers in the world. But after comparing it with the global average, it seems to be that this is not too far off from reality. Maybe, detecting Azure is especially hard or it’s actually really small. I’m personally not using any Microsoft tech. Maybe it has something to do with this. Althrough I can find typical Microsoft domains like microsoft.com or office.com in my list.

Fastly is a surprise for me. I never knew them before. They seem to be quite big after all. A lot of reputable sites like airbnb.com, bit.ly, dropbox.com, imgur.com and github.com are using fastly in one way or another.

… by traffic

Another interesting measurement method is sorting the result by traffic. Each request has a certain size, added up values are shown in Chart #2.

Chart #2: Cumulative request size in MiB to each CDN provider (in red). Values from Datanyze (green) are not directly comparable, just purely informative as an analogy to Chart #1.

With this metric, about half of my traffic is either from a non-CDN source or a CDN couldn’t be detected. I thought it would be more. But if you think about what kind of traffic was measured here, it makes sense. The scraper script is just opening the main page and loads all external resources like a normal browser would do. That means, that especially large files are not included in the statistics, because large files wouldn’t be loaded by just opening a main page. Besides, I’m not entirely sure if a traditional CDN would be best to serve large files (unless you include something like Amazon S3, which I honestly do here).

The distribution shifts slightly with this metric, but the providers remain largely the same. Google drops from first to third place. I could imagine that ads, fonts, JavaScript files, etc are making a lot of requests but are not so huge overall. Images and videos are way larger and are often delivered by CDNs.

Conclusion

It was a fun little experiment. Granted, I didn’t discovered any groundbreaking insights. But I learned some interesting facts about CDNs, hosting companies and the cloud business in general.

It’s a shame to see that the modern web is so highly centralized, that it’s so dependent on only a few companies. I understand why so many companies and organizations are using these services. It is convenient. It is simpler than managing your own server fleet around the globe. I just wish it weren’t.

CDNs can be dangerous. They are a very lucrative target for attacks. Even without attacks, you hand over the control of your data to a third party. A third party provider where you never know exactly what they do with your data, they have their own interests and you can never be sure how they interact with (foreign) governments departments.

In some cases the user is also getting a false sense of security. The end user thinks that their personal connection is secure since their browser shows a green lock symbol (HTTPS connection; i.e. HTTP with TLS encryption). And they are right in principle. But what they don’t know is that their connection is only secured between their browser and the CDN. The TLS encryption gets terminated at the CDN and, in the worst case, the traffic between the CDN and the actual backend server is plain unencrypted HTTP.

Unfortunately, I don’t know a good alternative. Self-hosting is probably only an option for smaller websites or purely local businesses.

FAQ

Are these results correct?

Absolutely not! The CDN/hosting service detection algorithm is just a simple and incomplete heuristic. A lot of providers were not even considered, some were only partially recognized and some were completely wrongly classified.

How many domains did you check?

Around 6000 domains with overall over 40000 different resources.

How long did it take to gather the results?

After acquiring the list of domains, it almost took 7 hours to pull down each resource of the ~6000 entries and run the detection algorithm on each of them.

Can I have a list of your tested domains to reproduce your results?

Nope. You can’t. I’m sorry.

A list of domains I visited, regardless how often and for which reasons, is highly private. I recommend that you are also careful with your own data and that you are not giving them away carelessly.

Even though I am absolutely against collecting (external) data that a lot of big companies do, I collect quite a lot of data of myself and for myself. ↩
select date(min(visit_date)/1000000, 'unixepoch') from moz_historyvisits; ↩
Matic, Srdjan, Gareth Tyson and Gianluca Stringhini. “PYTHIA: a Framework for the Automated Analysis of Web Hosting Environments.” WWW ‘19 (2019). ↩

Your reply is send via email. If there is any problem with the form below, then you can simply send me an email with your reply to blog@jdsoft.de as an alternative.

$ whoami

Show less Show more

I'm Jens Dieskau, a software developer, infrastructure operator, tinkerer, and wannabe designer from Germany. Before I do a handshake three times, no matter how easy and fast it is, I prefer to automate it right away. I am not afraid of new technologies and challenges. My home is the shell. My tools are ssh, vim, tmux & co. Code that is not in a git repository does not exist for me. I love and live open source software and there is hardly any technology I have not looked at already at least once. Besides programming, I have in-depth knowledge in areas like virtualization, networking (LAN & SAN) and in operating server applications of all kinds.

Who hosts the web I browse?

Define “hosting”

What’s the part of the web I use?

Extracting history data

Hosting service detection

Method #1: Using IP address information

Method #2: See what academia has to say about it?

Method #3: Using a SaaS service

Method #4: Using an open-source tool

Results

… by count

… by traffic

Conclusion

FAQ

Are these results correct?

How many domains did you check?

How long did it take to gather the results?

Can I have a list of your tested domains to reproduce your results?

Leave a Reply