Social Data Analysis Final :: National Parks

As a final for Social Data Analysis at NYU ITP, I wanted to examine some of our National Parks. I have always had a great appreciation for the outdoors, and for the effort made by the US to establish and preserve these areas for generations to visit and experience. Considering these are typically very scenic areas, Instagram was a great source of data. People take a lot of pictures in these parks, and I wanted to see where in the parks people were actually taking photos. I also wanted to try to get a glimpse of who is coming to the parks. People from all social classes within the US visit the parks, but I wanted to see which international groups visit parks, and if there were differences in demographics between the parks.

I chose Yellowstone, the Grand Canyon, and the Great Smoky Mountains parks. Yellowstone was the first National Park established in the US, and the Grand Canyon one of the most famous locations in the world (but not often thought of in terms of the park boundaries). I chose the Smoky Mountains after finding out they were the most-visited park in 2013, a fact that was pretty surprising to me.

After investigating the hashtags most used in each park, I ran a hashtag search through the Instagram API and pulled about 40,000 media items per park. That code is here. After perusing the data, I wanted to start with some timeseries analysis.


This is a graph of the frequency of posts per day using the hashtag #smokymountains. The frequency hovered around 30–40 posts per day, with expected spikes on the 4th of July (90 pics) and dips in the winter, even in the more mild climate of Tennessee. The long tail represents some sort of data leak from Instagram, where several photos appeared intermittently, up to two years back. That caused some confusion in further timeseries analysis, which in hindsight could be remedied by collecting data further back in time. For comparison, the same graph from Yellowstone produces this:


Of particular note is the frequency: 40,000 photos from the Smoky Mountains stretched over one full year, whereas Yellowstone stretched only one month (and note the drop in frequency as the year gets later and temperatures drop). The Smokys may have been the most-visited park, but the frequency of photos taken is drastically different.

Location Analysis::

Clearly people are taking a lot of photos related to the parks, but where in the parks are photos being taken? Using python and the basemap library, I plotted photos within geographic bounding boxes around the parks on top of historic maps.


This map shows activity around Grand Canyon National Park. The photos begin as route 180 comes North into the park, clustering at the visitors center. The other large cluster is around the Lake Meade Recreational Area, offset by various remote photos. Out of view on the map, you could see photos, trace along 180, with a huge cluster in Las Vegas, directly left of the map.


The Smoky Mountains were somewhat of an anomaly. The photos focus in the center of the park, around the the main visiting and camping areas, but the Appalachian Trail seems strangely underrepresented. It runs diagonally through the middle of the park, and I had expected to see photos all along the trail. It’s possible that is not the most picturesque section of the trail, but I doubt that. It’s something currently unexplainable to me on this map.


Yellowstone doesn’t offer a ton of surprises. It’s treated as a park many people drive through in a day, as there is a cluster of National Parks in the area. You can see photos come in straight up the road from Grand Teton National Park in the South. The photos continue to trace directly along the roads, especially the main loop in the middle of the park. Of note is Old Faithful, the dense cluster on the bottom left corner of the loop.

Going Forward ::

One of the biggest things I had hoped to discover was the nationalities of park visitors. Through some manual examination of the data, I found groups visiting from countries like Norway, Denmark, Spain, and Tibet, but I would like to take a more programmatic approach to uncovering that information. Collecting more data to perform comprehensive timeseries analysis is another goal, as well develop a more sophisticated method of data collection. Focusing on the geography from the beginning, instead of the hashtag, would produce more comprehensive reuslts. However, it is an exciting project, one which is near and dear to me, and I will continue to refine my techniques and move this project forward.

Social Bieber Analysis

An exploration into the world of Justin Bieber, on Instagram.  Posted media that included the tag #belieber was pulled, parsed and analyzed.  Our attempt at leapfrogging each others data requests was unsuccessful, and shown below.

After that, an Openord hashtag co-occurrence graph displaying the many communities using the #belieber tag, grouped by the other hashtags they use – some were expected, like those tweeting about other teeny bopper bands and Justin Bieber.  Others were not expected, like someone trying to promote their jewelry brand, or the cast of the TV show Teen Wolf, or a group who mainly tweeted about guns and 2nd Amendment rights.

The map shows the locations of the most frequent #belieber posters.  Made with David Tracy.








Analyzing OccupyCentral



An analysis of twitter activity on the Occupy Central protests in Hong Kong.  Nodes are sized by weight, and colored by type [see legend].  The graph features 20052 nodes with 84157 edges, and a community of 891 users.  The average degree is 8.394, and modularity of 0.339.  The main graph shows detail, including how the hashtags of the cause dominate conversation.  The graph would probably tell some more interesting stories were they to be removed from the data.  The smaller version on the left demonstrates the long tail of tweets, primarily mentions to another twitter user.  

Social Data Analysis update

Update from Social Data Analysis.  Below are three graphs produced in Gephi, of network relationships from twitter data extracted using python.  The first graph shows the community of twitter users who tweet about Arduino.  Communities of users based on following are grouped by color, with the main cluster centering in the blue and green around the company Arduino, and things like Make magazine.  Smaller fringe communities lie on the outside.



Next is a graph of users tweeting about the flooding in Kashmir, India.  Certain figures like Swadeshi Vichar or Vinay Kumar Sahu proved to be influential nodes in the network.  Also of note is a small cluster of government twitter bots to the right of the main cluster.  All the profiles appear to be using randomly chosen profile pictures, only follow each other, and only retweet news articles from a government-run source.





The last is a graph of word co-occurrence with the hashtag #genderequity, sparked by Emma Watson’s speech at the UN on womens rights.  Emma’s twitter handle easily had the highest factor of co-occurrence.  The hashtag #HeForShe, which is a movement her speech sought to launch, is also featured on the right. You can see a lot of other terms which fit the mold: understand, campaign, try, love, #humanist, someday.  The graph has 92 nodes and 182 edges, and needed to be run through a word stoplist which I didn’t get to unfortunately.gender2