Geocoding IP addresses in Q
Most communication between users and websites requires IP addresses. However, when profiling web traffic it is more useful to consider where users are physically located. Geocoding is the process of translating an IP address to a physical location. In this post I describe how to geocode IPs in Q.
To demonstrate geocoding, we’ll use the IP addresses of the universities listed on this website. We are using universities because they each have a defined location, enabling us to check if this matches their geocoding. The first task is to convert URLs to IP addresses using DNS. The 25 IP addresses are listed below. If you are analyzing traffic from a website then you’ll have a list of IP addresses already (rather than URLs).
Note that we are using IPv4 addresses but this works with IPv6 addresses as well.
Data input and output
To perform geocoding in Q, select the variable containing the IP addresses from the data tree (in the bottom-left of the screen) then navigate to Automate > Browse Online Library > Data > Geocode IPs or use the search box.
A new categorical variable containing the countries deduced from the IP addresses is added to the data set. Below is a table of counts of this new variable.
Imprecision of geocoding
Below I plot the countries where correct encodings are blue and wrong encodings are red. The two wrong results are King Saud University of Saudi Arabia which was geolocated to the Netherlands and Utrecht University of the Netherlands which was encoded to Belgium.
When we say that some results are wrong, there are at least 2 possible explanations.
- The web servers for King Saud University genuinely are in the Netherlands. If this is the case the encoding is correct but our hypothesis that universities host their websites in their home country is wrong.
- Geocoding is an imprecise science. It works by looking up an IP address in a database. Databases use a variety of information sources to link IPs and locations, such as tracing web traffic and ownership of IPs. However there is no permanent mapping from an IP address to a location, so this is a “best efforts” service. Across many IPs it’s likely that several locations are inaccurate and conclusions should be drawn on the overall distribution rather than for specific IPs.
You can follow my steps, or play around with your data in Q by downloading this QPack.