20 Aug '10 Social Archipelago
I've been analysing Foursquare data for Paris, London and New York. In case you don't know, Foursquare is a location-based social network which users interact with on their 3G mobile phones.
Visit the project page for a number of visualisations and remarks on the data. What follows here is a discussion of the data and analytical techniques that form the basis of the content on the project page, so have a look at that page first before reading below.
Data
I've used Foursquare data relating social venues to checkins (activity) at those venues. The data has been collected by a systematic crawl of the Foursquare Search API, which returns upto 50 nearby venues when supplied with a geolocation. The radius of this search is not explicitly documented by Foursquare. For each city, I constructed a lattice of search locations 2km apart and performed a search on each point of the grid. 2km was chosen as it produces some overlap in results, implying good coverage of the intervening space between search locations. This resulted in 200-400 searches per city, the exact number varying based on the size of the surface area covered for each city.
The search API additionally takes keyword searches, and further passes of the grid were carried out using a number of keywords (bar, club, restaurant, cafe, museum, hall, food). The resulting venues list was then de-duplicated. As such, the data does not represent a comprehensive data dump, but sufficient venue data has been collected (in excess of 6,000 venues for each city) to assume a representative sample of Foursquare data for each city.
It should be noted that Foursquare produces data skewed towards the network demographic, which is a 3G mobile phone owning portion of the population engaged in online social networking (typically skewed towards under 35s).
Venues are classified into parks, arts, shops, food and nightlife according to Foursquare's own classification scheme.
Analytics
The claim that Paris has a more contiguously walkable structure is based on a scan-based clustering of the venue data, using the DBSCAN algorithm. With a threshold distance of 400m (chosen as a comfortable walking distance) and a minimum cluster size of 3 venues, Paris breaks down to far fewer, larger clusters than the other two cities (PAR,NYC,LDN = 254,394,439 clusters), generating under a quarter of the noise (PAR,NYC,LDN = 401,1795,1599). Noise in this case represents isolated venues that cannot be assigned to a cluster.
The claim that activity is less spatially dispersed in Paris is based on dispersion calculated for the 100 highest activity walkable cells using a weighted standard distance measure [more] where venue popularity (total number of checkins) is used as a weighting factor and euclidean distance is used as the distance measure. This gives us a measure of standard deviation in space of all the points taken into consideration, measured in meters (PAR,NYC,LDN = 4965,6473,8657).
Walkable links per venue are calculated by constructing a network representation in which venues are nodes and edges are produced when any two nodes are within a walkable distance of 400m. This creates undirected graphs with K edges (PAR,NYC,LDN = 55478,39277,44977). Edges per venue gives a rudimentary expression of global connectivity (PAR,NYC,LDN = 9.64,5.64,6.30). The degree distributions of these networks and further network characteristics are outside the scope of this article, but can follow from these representations.
Power law remarks relate to a regression analysis of venue popularity rank distributions for each city. Zipf's Law is only a fit for part of the distribution.
Tools
Processing and Flash were used for visualisation, Proj4 and Proj4js for coordinate conversion, Tom Taylor's Boundaries for neighbourhood names, the iGraph Ruby extension for network representations, R for statistical analysis, Google Maps Geocoding API and lastly, the Foursquare API, for all the venue and checkins data. Everything else was done by hijacking a language designed for Hypertext Preprocessing.
The data used to create the images is made available in full under the Foursquare API terms. The data consists of csv files with aggregate checkin figures and geo-location data for each venue crawled. These files are broken down by city. Additionally, csv files are included with venue location converted to the so called WebMercator/GoogleMercator (EPSG:900913) projection, to facilitate visualisation using metric coordinates.
This data represents a snapshot collected in mid July 2010. In all there are over 800,000 checkins at over 20,000 venues. Checkins are expressed in summarised form for each venue as this is what is available via the public API. No raw checkins appear in the data.
Foursquare.zip (856 kB)