Using the Maxmind GeoIP API with R

IP-based Geolocation


There are many useful applications of IP-based geolocation, like automatically selecting the users’ language, timezone, or displaying the users’ location on a collaborative map.

Maxmind provides free and commercial data to locate the user on the earth surface:

  • GeoLite Country: Allows to determine the country of the user based on IP address
  • GeoLite City: Allows to locate a user on city level including latitude / longitude on the earth surface.

Applications can access these datasets by using the free Maxmind GeoIP API. The API is available for many languages, including PHP, Perl, Python, Ruby, Java, JavaScript, C#, VB.NET, and even Pascal.

 

R

But what about R? R offers comprehensive spatial data analysis and visualization functionality and can create publication quality plots and reports, let alone power web-based applications to explore data.

So to empower R wizards with geolocation abilities, we will show here how to access the Maxmind GeoIP API from R.

 

The Maxmind GeoIP API

To access the Maxmind data, we will use the Maxmind C API and write an R plugin to interface with it. So the first step is to download and install the GeoIP C API, we will assume a local installation which does not require system user priviledges.

Download the Maxmind C API here.

After downloading, create a folder, copy the package file there, unpack and compile the library (your version the library might be different):

tar xzf GeoIP-1.4.6.tar.gz
cd GeoIP-1.4.6/

./configure
make
make check

 

The R plugin

To use the Maxmind library, we need to include the relevant header informations. Be sure to correct the path to the GeoIP library, if necessary.

#include
#include
#include "GeoIP-1.4.6/libGeoIP/GeoIP.h"
#include "GeoIP-1.4.6/libGeoIP/GeoIPCity.h"

Next, we include a short function from the Maxmind API to convert empty strings:

static const char * _mk_NA( const char * p ){
 return p ? p : "N/A";
}

That’s already all we need, we can now focus on writing a function to process IP adresses which we can call from R. We use the .Call interface of R, not the somewhat simpler .C interface. The .Call interface allows us to use internal R data structures and thus eliminates the need to do memory copies when calling the function. The .Call interface also allows for returning R objects by the usual C return command.

The internal R data structures can be accessed by the SEXP type, representing an S-expression (“symbolic expression”). Sidenote: Although S-expressions are used here as internal R data structures and the name might sound like being related to S/S-PLUS, it is actually an independent name for list-based recursive data structures which are most famous for their use in Lisp. As both code and data can be represented by S-expressions, they can be useful to interface with R and store not only data, but also R formulas for instance.

The next thing to consider is that as SEXPs are internal R structures, they are allocated, tracked, and garbage collected by R. To prevent garbage collection of instances used in the native code the R protection stack has to be used utilizing the PROTECT macro. Before returning, the protection stack has to be emptied to restore control to R and prevent memory leaks.

Everything else is straight-forward usage of the Maxmind C API, see their examples and adapt the function to your needs.

/* returns character vectors of geolocated countries based on
IP addresses in long integer format */

SEXP fetchMaxmindCountryName(SEXP ipVect) {

	unsigned int vectLen;
	vectLen = length(ipVect);

	SEXP countryStrs;
	PROTECT(countryStrs = allocVector(STRSXP, vectLen));

	/* Lookup Maxmind GeoIP  */
	const char *str = NULL;
	GeoIP *gi = NULL;

	/* change the path of the GeoIP.dat file, if necessary */
	gi = GeoIP_open("GeoIP.dat", 0);

	if (gi == NULL) {
		Rprintf("RMaxmind GeoIP Country - ERROR: "
			"Could not open GeoIP.dat!\n");
	} else {

		for (unsigned int i = 0; i < vectLen; i++) {
			str = GeoIP_country_name_by_ipnum(gi,
				(unsigned long)REAL(ipVect)[i]);

			SET_STRING_ELT(countryStrs, i,
					mkChar(_mk_NA(str)));
		}
	}

	GeoIP_delete(gi);

	UNPROTECT(1);

	return countryStrs;
}

The GeoLite City database provides much richer data, including timezone and position coordinates useful for mapping applications. The Maxmind API can return all these data at once in a GeoIPRecord structure. An R list is then used to hold vectors containing that information:

/* returns a list of vectors with geolocated informations (name
of country, country code, region, city, timezone, latitude,
and longitude) based on IP addresses in long integer format. */

SEXP fetchMaxmindGeoDataName(SEXP ipVect) {

	const char *time_zone = NULL;
	unsigned int vectLen;
	vectLen = length(ipVect);

	char *names[7] = {"country_name", "country_short",
				"region", "city", "timezone",
				"latitude", "longitude"};
	SEXP list, list_names;

	double *p_latitude, *p_longitude;

	SEXP latitudeVect;
	SEXP longitudeVect;
	SEXP countryStrs;
	SEXP countryCodes;
	SEXP regionStrs;
	SEXP cityStrs;
	SEXP timeZoneStrs;

	PROTECT(countryStrs = allocVector(STRSXP, vectLen));
	PROTECT(countryCodes = allocVector(STRSXP, vectLen));
	PROTECT(regionStrs = allocVector(STRSXP, vectLen));
	PROTECT(cityStrs = allocVector(STRSXP, vectLen));
	PROTECT(timeZoneStrs = allocVector(STRSXP, vectLen));
	PROTECT(latitudeVect = NEW_NUMERIC(vectLen));
	PROTECT(longitudeVect = NEW_NUMERIC(vectLen));

	p_latitude = NUMERIC_POINTER(latitudeVect);
	p_longitude = NUMERIC_POINTER(longitudeVect);

	/* set list names */
	PROTECT(list_names = allocVector(STRSXP,7));
	for (unsigned int i = 0; i < 7; i++)
		SET_STRING_ELT(list_names, i, mkChar(names[i]));

	/* create list*/
	PROTECT(list = allocVector(VECSXP, 7));

	/* Lookup Maxmind GeoIP */
	GeoIPRecord *gir = NULL;
	GeoIP *gi = NULL;

	gi = GeoIP_open("GeoIPCity.dat", 0);

	if (gi == NULL) {
		Rprintf("RMaxmind GeoIP Data - ERROR: "
			"Could not open GeoIPCity.dat!\n");
	} else {

	for (unsigned int i = 0; i < vectLen; i++) {
		gir = GeoIP_record_by_ipnum(gi,
			(unsigned long)REAL(ipVect)[i]);

		if (gir != NULL) {
			time_zone
			= GeoIP_time_zone_by_country_and_region(
				gir->country_code, gir->region);

			SET_STRING_ELT(countryStrs, i,
			mkChar(_mk_NA(gir->country_name)));	

			SET_STRING_ELT(countryCodes, i,
			mkChar(_mk_NA(gir->country_code)));	

			SET_STRING_ELT(regionStrs, i,
			mkChar(_mk_NA(
			GeoIP_region_name_by_code(
			gir->country_code, gir->region))));	

			SET_STRING_ELT(cityStrs, i,
			mkChar(_mk_NA(gir->city)));	

			SET_STRING_ELT(timeZoneStrs, i,
			mkChar(_mk_NA(time_zone)));	

			p_latitude[i]  = gir->latitude;
			p_longitude[i] = gir->longitude;

			GeoIPRecord_delete(gir);
		} else {
			SET_STRING_ELT(countryStrs, i,
				mkChar(_mk_NA(NULL)));
			SET_STRING_ELT(countryCodes, i,
				mkChar(_mk_NA(NULL)));
			SET_STRING_ELT(regionStrs, i,
				mkChar(_mk_NA(NULL)));
			SET_STRING_ELT(cityStrs, i,
				mkChar(_mk_NA(NULL)));
			SET_STRING_ELT(timeZoneStrs, i,
				mkChar(_mk_NA(NULL)));
			}
		}	

		/* attach vectors to list */
		SET_VECTOR_ELT(list, 0, countryStrs);
		SET_VECTOR_ELT(list, 1, countryCodes);
		SET_VECTOR_ELT(list, 2, regionStrs);
		SET_VECTOR_ELT(list, 3, cityStrs);
		SET_VECTOR_ELT(list, 4, timeZoneStrs);
		SET_VECTOR_ELT(list, 5, latitudeVect);
		SET_VECTOR_ELT(list, 6, longitudeVect);

		/* attach names to list */
		setAttrib(list, R_NamesSymbol, list_names);

	}

	GeoIP_delete(gi);

	UNPROTECT(9);

	return list;
}

 

Using the R plugin

You can download the code here.

Compile the C code, specifying the path to the compiled Maxmind library:

R CMD SHLIB RMaxmind.c -lGeoIP -LGeoIP-1.4.6/libGeoIP/.libs/

We can now run R and load the plugin using the dyn.load command:

dyn.load("RMaxmind.so")

As one usually obtains IP addresses in the form of strings, we also need a function to convert them into the long integer format which simply represents the 8-bit parts of an IP address at the appropriate position in a 32-bit integer:

ipAddressToIpnum <- function(x) {
	s <- data.frame(
		do.call(rbind,
			lapply(strsplit(x, ".", fixed=TRUE),
			"as.numeric")))
	s <- transform(s,
		ipnum = 16777216*X1 + 65536*X2 + 256*X3 + X4)

	return(s$ipnum)
}

Taking the example from the Maxmind API, we can now query the Maxmind API:

ipAddresses <- c("24.24.24.24",
		 "80.24.24.24")	

geoCountryNames <- .Call("fetchMaxmindCountryName",
			ipAddressToIpnum(ipAddresses))
geoCityData <- .Call("fetchMaxmindGeoDataName",
			ipAddressToIpnum(ipAddresses))	

## print results
cbind(ipAddresses, geoCountryNames)

geoCityData$ipAddress <- ipAddresses
do.call(cbind, geoCityData)

It is now also possible to compute the great-circle on the earth surface between the geolocations using the haversine distance. The haversine distance in kilometers is defined as follows:

toRad <- function(x) return(x * pi /180)

haversine <- function(lat1, long1, lat2, long2) {

	dLat  <- toRad(lat2-lat1)
	dLong <- toRad(long2-long1)

	a <- sin(dLat/2)*sin(dLat/2)
		+ cos(toRad(lat1))*cos(toRad(lat2))
		* sin(dLong/2)*sin(dLong/2)
	c <- 2 * atan2(sqrt(a), sqrt(1-a))

	d = 6371 * c  # km

	return(d)
}

Using the haversine function, we now easily determine that both IP addresses are displaced by approximately 6023 km:

> haversine(geoCityData$latitude[1], geoCityData$longitude[1],
	    geoCityData$latitude[2], geoCityData$longitude[2])
[1] 6023.053

Another usecase is to plot the geolocations of multiple IP addresses onto a map. As an example, we plot the obtained locations on a Google Map. Let’s start by generating an URL for the Google Maps API:

googleMapsUrl <- function(geoCityDataset) {

	## center map view
	centerLatitude = min(geoCityDataset$latitude)
		+ (max(geoCityDataset$latitude)
		- min(geoCityDataset$latitude)) / 2				     

	centerLongitude = min(geoCityDataset$longitude)
		+ (max(geoCityDataset$longitude)
		- min(geoCityDataset$longitude)) / 2  

	urlHead <- paste("http://maps.google.com/",
			"maps/api/staticmap?center=",
			centerLatitude, ",",
			centerLongitude,
			"&zoom=4&size=512x512&maptype=roadmap&",
			sep="")

	## add markers for each position
	geoPositions <-
		lapply(data.frame(rbind(geoCityDataset$latitude,
			geoCityDataset$longitude,
			geoCityDataset$city)),
		function (x)
			paste("markers=",
				## random color
				sprintf("color:0x%06X",
				round(runif(1)*((2^24)-1))),

				## use first letter
				## of city name as label
				"|label:", substr(x[3], 1, 1),

				## geolocation
				"|", x[1], ",", x[2], sep="")
		)

	geoPositions <- paste(geoPositions, collapse="&")

	return(paste(urlHead,
		geoPositions,
		"&sensor=false",
		sep=""))
}

Now we can easily generate maps based on geolocated IP addresses:

## more IP addresses
ipAddressVector <- c("192.106.51.100", "147.251.48.1",
			"134.102.101.18", "193.75.148.28",
			"194.244.83.2", "151.28.39.114",
			"151.38.70.94", "193.56.4.124",
			"195.142.146.198", "139.20.112.104",
			"139.20.112.3", "145.236.125.211",
			"149.225.169.61")

## query Maxmind API
geoData <- .Call("fetchMaxmindGeoDataName",
			ipAddressToIpnum(ipAddressVector))	

## generate URL
googleMapsUrl(geoData)

 

Maps can of course also be created natively in R. Let’s see all the Maxmind test addresses on the world map:

library(maps)
library(ggplot2)

# read the IP address list included with
# the Maxmind API and query geolocations
ipAddressTable <- read.table("GeoIP-1.4.6/test/country_test.txt")
geoData <- .Call("fetchMaxmindGeoDataName",
			ipAddressToIpnum(
			as.character(ipAddressTable$V1)))

# extract polygon from map data
world.df <- map_data("world")

# create a plot of the world
worldmap <- ggplot(world.df, aes(long, lat))
		+ geom_polygon(aes(group = group),
		data = world.df,
		colour = "grey", fill = NA) 

# add geolocations
worldmap <- worldplot
		+ geom_point(aes(
			x=geoData$longitude,
			y=geoData$latitude,
			colour=geoData$country_name),
			size=3, alpha=0.5)

# draw the map
print(worldmap)

You can see the full size map here.

 

Dieser Beitrag wurde unter Development veröffentlicht. Setze ein Lesezeichen auf den Permalink.

Die Kommentarfunktion ist geschlossen.