Qwotd: Our prototype

The quarter is over and our project is wrapping up. We presented a demo of Qwotd on June 4th. We decided to start with a popular article to show how our browser extension will highlight quotes it finds on Twitter. We selected a McSweeney’s article titled “Client Feedback on the Creation of the Earth.”

qwotd_screen_shot

Without getting into the technical nitty-gritty, our browser extension interacts with an Application Programming Interface, which we built. Our API acts as an intermediary to the Twitter API, which is used to search for tweets about an article. Our API will store those tweets in a database, so future API calls are a lot faster.

For the next example in our demo, a random link on the New York Times homepage was clicked on. While our system was processing the information from Twitter we explained that the more popular an article is the longer it takes to process. Twitter only returns 100 results at a time, so if there are more than 1,000 tweets then it has to call the Twitter API over ten times.

Finally, we clicked on a random post from the NextDraft news digest email. Even though this had always worked while testing, this still made us nervous because if our system does not process the URL correctly then it will not get the right results from Twitter. For example, NextDraft adds an identifier to every link, “utm_source=nextdraft,” so that websites will know where the traffic came from. We strip things like this from the URL before searching Twitter to make sure we get the largest number of results possible.

The greatest lesson we learned was that things always take longer than you want. We had to demo our system for five weeks in a row, and every week we wanted to show progress. Some weeks it came down to the wire. When we demoed the highlighting for the first time we literally had it working about two minutes before class started. We built the extension in Kango, a cross-browser extension framework that you can program in JavaScript, HTML and CSS. A benefit is that since Kango is cross browser, a single extension will work in Chrome, Firefox and Safari. However, development in Kango is different from creating a web page, and we had to go through some trial and a lot of error before getting each incremental step working.

This is an extremely innovative class that creates teams consisting of journalism and computer science students. However, I am the only person in my Masters of Journalism program with a computer science degree, so I was able to contribute on the programming side more than the other journalism students. On the other hand, some of my fellow journalism students have been avidly studying web design and front-end development, and it showed in their projects.

The main lessons we learned were starting development early to be able to account for difficulties, and the importance of taking incremental steps. Sometimes it can be tempting to go for everything at once, but nothing beats the feeling of seeing incremental aspects of your system working.

Sourcerous: Our Prototype

Screen Shot 2014-06-09 at 11.26.55 AM

We’ve made it.  Sourcerous is out in the wild. No more late night tweaks from this team.

Sourcerous is a web-based application designed to locate and identify sources from online news organizations. Sourcerous was built from the ground up with the purpose of simplifying the process of locating sources for news articles. The current build of Sourcerous is designed specifically for business journalists, as it searches articles from primarily business-focused publications.

That’s the professional pitch.  Sourcerous is supposed to help reduce the time it takes for a journalist to find sources and instead let them focus on finding the facts and making the story the best it can be.

Sourcerous was built specifically for working journalists.  Any journalist at any point in their career should be able to use Sourcerous as a jumping off point for locating sources.

Sourcerous utilizes several key technologies in order to provide fast results.  It is primarily based on Django and HTML5.  It utilizes multi-threading to prevent timeouts, and utilizes the Alchemy API to fetch the relevant data.

While Sourcerous is currently a fantastic product, more time and work would improve the product.  Our team would want to enhance the search functionality and add more areas to search from.  We would also improve the front end to make it less cluttered and more attractive to the eye.

As we wind down with Sourcerous, its great to look back at what brought us to this point.  Our initial group meetings were actually fairly close to what the final product ended up being.  We had a vision from the start, and together we executed that vision.

The journalist on the team had a fast and easy to use product in mind and was reigned in by the computer science majors.  But ultimately, the final product was very similar to the original design.  Based on feedback from our peers and professors, we were able to tweak and execute on our initial vision.

It’s been a pleasure sharing Sourcerous with everyone.  I hope you find our product worthwhile!  Take it for a spin at http://sourcerous.herokuapp.com/Sourcerous/

VizAnalytics: Our prototype

At the beginning of the quarter, we were tasked with creating a system that could provide accurate analytics benchmarks for hyperlocal websites. We believe that, by the end of the 11 weeks, we were able to provide a tool that could have immediate utility for publishers.

In case this is your first time reading about VizAnalytics, it is an analytics benchmarking service that uses a company’s Google Analytics and uses them to compare that site to other similar sites in metrics that matter for evaluating site performance. You can find a deeper explanation of VizAnalytics by clicking here.

Audience

In creating VizAnalytics, we had Web publishers in mind. We wanted to make VizAnalytics their one-stop shop for all of the metrics that matter most when it comes to evaluating the performance of their site. This is something that doesn’t currently exist.

As great as Google Analytics is, it can be a bit overwhelming at times. All of the data is sorted into various tabs throughout the screen with no road map to tell you, “This metric matters!”  Once you find the metrics that do matter to you, you are left without context. What is a good bounce rate? How many pages per session should my users by viewing? Are people coming back to my site at a good rate? With VizAnalytics, you can have all of these questions answered in a way that is both quick and easy.

In the most basic terms possible, VizAnalytics takes the data from your Google Analytics, tosses it into the pool of data from the analytics of other sites, and churns out an evaluation of where your site stands in relation to others. The picture below shows how this data is presented on the landing page after you log into VizAnalytics.

Screen-Grab-2

Demo

Ready to see this baby in action? The video below is a screencast of a demo of the site. I was apparently in a shouty mood when I recorded it, so I’d turn the volume down a bit before you press play.

One thing I forgot to mention within there is the question feature under the “more” button. Because we are using calculated metrics (and don’t insist you read our blog post about how these metrics are calculated – even though you should), we wanted a way people could get a quick and dirty explanation of what an “expected bounce rate” or the other calculated metrics meant. To do so, you can click on the question mark on the page, and a screen like the one you see below will pop up.

Screen-Grab-1

Key technologies used

In order to make Google Analytics possible, we relied heavily upon the Google Analytics API, a data manager server, a getter server (which was in charge of communication between the data manager server and the Google Analytics API), and PostgreSQL (an open-source database). These technologies combined to create raw and calculated metrics which were then cached to make speed upon future loadings quicker.

Future work

There’s still plenty of additional work to be done with VizAnalytics to make it as great as it potentially could be.

First, we want to get access to the analytics of more sites so we can make our benchmarking more accurate. The more data we have, the better we can gage how well each individual site is performing. If you’re interested in sharing your Google Analytics with us, feel free to email Professor Rich Gordon at richgor@northwestern.edu.

Second, we want to open VizAnalytics up to broader categories than just hyperlocal sites. This would require we have the analytics to other types of sites, but we believe that this service is useful to any kind of Web publisher. We just want to ensure we are always comparing apples to apples, so we would keep the various categories of sites separate from each other.

Finally, we want to show publishers how their data has changed over time. Currently VizAnalytics takes the data from the previous month and shows you only that. We want a way to show how that data has changed over time. You can get this from Google Analytics, but making it appear on VizAnalytics would be far easier for publishers. This would also allow you to see the evolution of your calculated metrics over time, which Google Analytics could not.

The experience behind creating VizAnalytics

Overall, the creation of VizAnalytics was an exciting process. Our team members, (Liu Liu, Adrien Tateno and me – Jim Sannes), all came from different backgrounds in both academic and life experiences. Liu and Adrien were well-versed in the technological side of things while I had worked with/drooled over/stared endlessly at Google Analytics previously. In the end, it worked out being a great combination all around.

The site evolved significantly from the start of the process to the end. The number of designs that we used throughout the quarter before ending up with the final one is nearly mind-boggling. However, I believe that the current design is one that could end up sticking with VizAnalytics as it moves beyond this class and, hopefully, into the hands of those for whom it could be most useful.

Tweet Talk: Our prototype

Tweet Talk is a browser extension that enhances the news reading experience. Designed to give the reader “more,” Tweet Talk allows readers, looking at an online article, to expand their knowledge of its topic by providing expert opinions, links to related articles, and information from experts talking about that topic instantaneously.

Screen Shot 2014-06-09 at 8.48.09 AMWho wants to use Tweet Talk?

Tweet Talk was designed to appeal to a large audience and subsequently can be used by a multitude of people for a variety of reasons.

It could be used by a student as a research tool to gain a wider perspective on a topic, or by a sports enthusiast looking for varying viewpoints on the last night’s game, or by a businessman looking to see what experts are saying about yesterday’s stock market activity. The possibilities for using Tweet Talk are endless, but the goal is the same: to instantaneously provide more information and opinions from the people who know best.

Using Tweet Talk

Our usability goal for Tweet Talk was that it require minimal effort on the part of the user. In order to meet this goal, we created a very simple, clean interface that does most of the work for the user.

To use Tweet Talk, the user needs to download the extension from the Google Chrome app store, available at bit.ly/tweettalk. While reading an article, the user activates Tweet Talk by simply clicking on the extension icon in their menu bar. The extension instantly presents the user with tweets related to the article from experts in the field.

If the user is curious about an expert they don’t see in the list of returned tweets, they have the option to search for that expert, or anyone else for that matter, by entering their Twitter handle into the search bar provided.

How it Works

  • Front End: HTML/JavaScript/JQuery/CSS
  • Server Side: Node.js
  • Backend: Firebase

When the user clicks on the Tweet Talk icon, the front end JavaScript makes an AJAX call to the ‘tweetResult’ RESTful method declared in Node.js server. The RESTful method receives the URL of the current web page and searches for that URL in the Firebase database, which stores tweets related to articles already viewed by a Tweet Talk user —  like a cache. If the news article is found in Firebase and has been processed within the last 2 days, the stored tweets are returned to the front end, where the JavaScript modifies the HTML to display the tweets.

Tweettalk_DiagramOtherwise, Alchemy API pulls the content of the Web page and designates a list of keywords, listed in terms of their relevance to the article. Then eight queries, using the top eight keywords, are used to pull tweets with Twitter’s Streaming API. The optional parameter ‘result-type’ is set as ‘popular’ to retrieve tweets which Twitter thinks to be important, based on number of retweets and ‘favorites’. Once the tweets are pulled, they are immediately sent to a filtering process. First, we check if the tweet is from an organization by comparing names with a blacklist of organization-related words. The function also checks if the tweet is from someone with less than 10,000 followers or more than 1 million followers; the intent of this filter is to identify experts but not celebrities. All of the tweets that pass those tests are then ranked based on their relevance to the article. We assign a weight based on several factors: the number of matching (non-stop listed) strings between the article and the tweet; the placement of each matching string (the higher the string, the more important it is); and the relevance value, between 0.0 and 1.0, assigned by Alchemy API. We also assign more weight for tweets from a news correspondent or ambassador, who tend to be more knowledgeable about their area of focus. Finally, the tweets are sorted by relevance and stored in Firebase along with the article’s url, and then returned to the front end to be displayed to the user.

When the user wants to search for relevant tweets from a particular expert, the user enters the Twitter handle of the expert in the search bar topping the extension. An AJAX call is made to the ‘expertSearch’ RESTFul method declared in the Node.js server. The same mechanism uses Alchemy API to determine and rank keywords of the article. Then the tweets of the particular expert are retrieved using Twitter’s User Stream API. The resulting tweets then follow the same process as above to be filtered and ranked. After the tweets are ranked, they are returned in order of relevance to the extension to be displayed.

Enhancements/Improvements

In the future, we want the Tweet Talk extension not only to work on Google Chrome but also to work on all other major browsers: Safari, Firefox and IE. To work on Safari, we need to convert some JavaScript to AppleScript for all functions to initialize properly. Unlike Safari and Chrome, Firefox uses the browser engine Gecko instead of Webkit, which is much more difficult for conversion. These are some of the challenges we met.

We also want Tweet Talk’s searching functionality to allow Twitter names, rather than only handles, as well as an improved ability to display more than one relevant tweet from a searched expert.

To improve the user experience, it would be helpful to add the ability to create a custom list of experts — to be checked any time the extension is used on an article — or the ability to ‘favorite’ experts so they are more prominent in the list for all users.

For users unfazed by the friction of signing in, offering the tweets in a more Twitter-like format, with options to ‘favorite’, ‘reply’, or ‘follow’ the expert from within the extension, would create a much more interactive extension.

Finally, although we implemented a filter that catches many, if not most, organizations and non-experts, there is always room for improvement. Certain organizations are still missed and sometimes tweets appear from experts on topics other than the one given. Perhaps using Wikipedia to ascertain the background of the expert could be used as an alternative to finding non-expert indicators in the tweets.

Print Share: Our prototype

Print Share aims to solve a simple problem: it is far easier to share articles when you’re reading them on the Web than when you’re reading them in print. On your computer, phone, or tablet, a share is only a few clicks or taps away. But when you’re reading a newspaper or magazine, the process of finding the right link and sharing it completely disrupts your reading, and takes away from the experience of enjoying a print publication.

We saw that problem, and we found a solution. With Print Share, you can take a picture of a print article and let our app find the link to the corresponding web article for you. Then, you’ll be ready to share it and get right back to what you were reading. Admittedly, the system isn’t perfect, so we included some error-reducing functionality, for the cases when the right link isn’t found on the first try. You can crop a picture to select only the headline or body text, and you can input keywords for the article’s publication, author, date, and anything else you see fit.

Print Share in action
Print Share in action

Print Share owes its functionality to a few different pieces of software. First off is Google’s Tesseract OCR code, which converts pictures of text into text that a computer can actually use. That data then gets sent to a Google Custom Search engine, whose results are shown to the user in our front end built with Flask – a tiny yet powerful Python web framework. In addition, we made use of JCrop, a JQuery cropping plugin that allows us to select parts of the picture that Tesseract will be more likely to make sense of. Finally, the live app is currently running on Heroku’s platform, which lets us actually see what the thing looks like in the wild.

Print Share is great, but we’re willing to admit that there’s room for improvement. One thing we’d love to see added is more social media integration by means of the social network sites’ API’s. This would allow users to have more sharing options, as well as to share the picture they took along with a link to the corresponding article. On top of that, we’d love to see the OCR get built upon, including more versatile response to varying text sizes and fonts, and, if possible, recognition and reverse-searching of images, so that users can find articles by their photos instead of just their headlines.

This quarter has been an excellent learning experience, and it’s been downright crazy to watch Print Share grow from a pipe dream to a fully-functioning web app. From day one it was clear that we had a unique challenge, in that we would have less to focus on from the journalistic side of the course and more technical hurdles to handle. In spite of that, we managed to not only jump those hurdles, but also to keep the original idea and motivation in sight the entire time. We learned a lot along the way about what good teamwork looks like and the different forms it can take, from dividing and conquering, to sitting and staring at something together until it starts to make sense.

Another huge lesson we learned is the value of making do with something that works, even if it didn’t do so in exactly the way we had originally envisioned. We had many pieces of the app that came together from very disparate discoveries and strategies, and even if it wasn’t always the most elegant smattering of code, it came together to make something that works. Print Share was an absolute blast to make, and it feels great to be able to show it off with the knowledge that it could very well blossom into something far more robust.

 

The Weekender: Our prototype

The big day has come and gone and The Weekender is now out in the world. Our presentation in the McCormick Tribune Center on Wednesday night may have won us a few fans for our coordinated dress sense, but it also gave the world a chance to see what we’ve been working on.

The Weekender is a travel suggestion site with a difference. Our aim was to make planning a vacation to a new city simpler. Our inspiration came from The New York Times Travel section’s “36 Hours” column. The Times columns provide a curated list of things to do and places to eat and drink during a short break to towns across the world. We wanted to provide a similar experience – something that gives people looking to get away an easy-to-follow and inspiring itinerary – for any city. We also wanted to take into consideration something that almost no travel site provides – ideas actually relevant to the person taking the trip.

There are many useful resources on the web for people looking to plan a short trip. Sites like Trip Advisor and Lonely Planet have rich content on almost every city across the globe, and sites like Yelp and Google+ Local have encyclopedic lists of restaurants, bars and events worldwide. A lot of these sites have user reviews and some even make suggestions on where to go when you’re on vacation, but none of them really get to know what you like before they make suggestions.

Weekender Itinerary page
A results page in The Weekender for a trip in Chicago

We decided to take a lot of the good information on the site, aggregate it all, and create a recommendation engine that shows you places based on your interests and situation. If you’re looking for a quick trip with your significant other on the cheap in Chicago, we can show you where you might like to go.  If you’re looking for a lavish four-day weekend trip to the Big Apple with your buddies, no expense spared, we can give you that too.

We take information from Google Places, Lonely Planet, Foursquare, Yelp and Google Maps to help design a simple, easy-to-follow itinerary for your next getaway. We also included social links for the major social networks so you can share your trip with your friends and family.

We started with a few key cities to make sure our concept worked – Chicago, New York, Omaha and Milwaukee – but given the time, we would build this out for every city in the States, or even worldwide.

We’ also really like to make this a mobile-friendly site, where users can pull up itineraries on the go when they’re actually on their vacations. We would’ve liked to add restaurant and hotel booking capability into the app, and the ability to update old itineraries, but there’s also so much you can do in ten weeks.

We’re pretty proud of what we managed to achieve in this short time, and we hope you enjoy The Weekender. We really enjoyed working together to build this – three computer scientists and a journalist – and we feel that learned a few things along the way. We learned that it’s good to have big ideas, but you’ve got to start small. We learned that matching the product to the audience is just as important as the product itself. We learned that we can do a lot in a short period of time, and that ideas for innovation can come from anywhere.

Give The Weekender a try today – maybe you’ll feel inspired to do a little exploring.

TED 2.0: Our prototype

TED 2.0 aims to automatically generate a compelling TED talk on a given topic. Inspired by TED A.I. XPrize, an upcoming competition focused on the automatic generation of a TED Talk, our main goal was to design and build a proof of concept that can eventually compete for the XPrize.  Our system is meant to accommodate a wide range of users. For example, it can help a user prepare for a speech or debate on a certain issue or topic, assist a user in improving her public speaking skills, inspire a user, and even provide a TED Talk enthusiast with a TED Talk-like speech to read or listen.

To use, simply visit our website, select a topic, hit submit, and in under two minutes, the speech will be generated. We have also included a few extra features to enhance the user experience, including news links and TED talks based on the user selected topic, an in-app dictionary and thesaurus tool, and an avatar which can deliver the speech. Unfortunately, the avatar is only available for a limited amount of time because we are using a free trial subscription.

How it works

Speech
Sample speech generated

To give the talk a clear structure, we first generate a thesis statement that is a road map for the rest of the talk. We form the thesis by picking a debate from debate.org and combining the opinionated arguments presented by commentators on that debate. Next, we find related text by extracting keywords from the thesis and then using Google’s related search to find relevant content for the other sections of the talk such as “Importance”, “Problem”, “Solution”, and “Impact”. Once we have a list of possible paragraphs for different sections, we apply a bunch of filters to the paragraphs to have a reasonable length, match keywords, and avoid duplicate text, code and embedded links. We then add connectives (i.e. short phrases ) that improve the flow of the talk. Lastly, we use a taxonomy filter to find a relevant quote from BrainyQuote to end the talk with.

Technologies used

Avatar
An avatar reads out the talk

We programmed TED 2.0 mainly in Python and used HTML/CSS,  JavaScript, JQuery and Flask for the front end. To scrape content from different websites, we used Python’s in-built package urlib2 along with external libraries like Beautiful Soup and Google search. Despite what the name suggests, the Google search library is not endorsed by Google.  It is possible to exceed the Google’s frequency limits for searches, so we had to be careful with their use by adding pause time between different searches so that Google does not block us. Besides scraping, we used NLTK and Alchemy API for natural language processing on the text. For the extra plug-ins we used the TED API, Bing News API, SitePal for talking avatar, and Pearson and Altervista.org for the dictionary and thesaurus.

Future work

Looking ahead, we can develop an active supervised learning framework to improve the quality of the talks by learning to distinguish between good and bad TED talks. We can also try to make the talk more TED like by incorporating personal anecdotes. Additionally, we need to find a better way of pulling content from the web so we can remain within Google’s search limits.

Reflections

Looking back, we feel that we learned a lot through this project. Not just through the technical skills that we developed like scraping the web, using web frameworks etc. but also through collaborating and communicating with the other team members. We each had unique insights and skills to offer and together we found ways to leverage each other’s strengths. Sure, we had disagreements but we settled them through clear communication and cooperation. Without doubt this project was tough and required a lot of effort, but through consistent hard work and perseverance, we achieved results we are proud of.

Trendable: Our prototype

Trendable began the quarter as Datatrack, an easy way for journalists to check in on key trends from automatically updating yearly crime and transportation datasets in Chicago. While limited in scope, it was an exciting concept. With Datatrack anyone could be a data expert. Take this hypothetical, but realistic, example of what a journalist might find with the original version: Our hypothetical journalist is doing a story on crime in Chicago, he clicks on Datatrack crime statistics and he’s immediately alerted that there has been a 500% increase in reported murders from the prior year. Our journalist hero wouldn’t have to be a genius to figure out there might be a story there.

With our new update, Trendable, we looked to maintain this functionality while also making the app more useful for how journalists work in the real world. This meant we needed to allow for journalists to add their own data to explore and analyze. The result is a system that allows journalists to easily add their own CSV data, stitch together files which might be separated by quarter, month or year, and automatically generate that same useful analysis built with Datatrack. With Trendable, journalists can upload whatever monthly or yearly data they’re interested in and instantly spot key trends and basic stats.

How it works for the user

Trendable is targeted at journalists interested in working with data that’s split into multiple files based on time periods (typically years or months). This is a fairly common practice in the way governments and other organizations report their data, and it can lead to journalists wasting hours trying to copy-paste it all into one usable file.  It’s also no secret that many journalists aren’t numbers experts, so we wanted to make the system as easy to use as possible. The result is a simple interface that requires minimal thinking from the user. Here’s the basic workflow:

1) Call to action at homepage:

Screen Shot 2014-06-06 at 1.22.42 PM

When a journalist arrives at the Trendable site, he or she is greeted with a clean, inviting and friendly interface. Our tagline “Easily import, knit together and analyze time series data,” is accentuated in large text, and makes the point of Trendable clear to a first time user. The image of the results displayed on an iPad also serve to explain to the user the kind of results they can expect from Trendable. By pressing either of the available “Import Data” buttons, the journalist can be brought to the next step.

2) Select data to use

Screen Shot 2014-06-06 at 1.27.53 PM

Here the “Import Data Tool” is accentuated. By selecting “Choose File” and specifying the type of data to be analyzed (Yearly or Monthly… The specific month or year) the journalist can get the process underway. They journalist may also specify if he’d like to add another file, which would later be stitched together with this data. Instructions explain this for the less tech-savvy user, and the steps in the process are highlighted near the top of the window to ensure the user understands where they are in the process.

3) Check Data

Screen Shot 2014-06-06 at 1.32.20 PM

As the journalist moves through the process, he or she is asked simple questions which are used to help with the analysis process. If there is a major, correctable problem in the data, Trendable will highlight where it is located and give the journalist the option to stop the process and correct it. This is faster than what a journalist would usually have to do, moving through each spreadsheet in search of problems without any guidance.

4) Analysis

Screen Shot 2014-06-06 at 1.35.09 PM

Once the journalist has moved the the process they’re brought to an analysis screen. This provides access to key trends, graphs of the data and a data table.

How it works behind the scenes

So that’s the magic, but what’s behind the curtain? Trendable uses a PHP backend to do data import and analysis. After the user has uploaded a dataset, Trendable searches for errors in the data. If it finds one, it tells the user about it. Otherwise, we ask the user a few simple questions to help them format the data in the way that he or she wants it. After merging the data into a single table, we perform a variety of statistical and trend analysis on it and make observations, which are presented to the user. Finally, we save the dataset to a database, so that they may access it later.

Key Technologies

– PHP

– MySQL

– Javascript/jQuery/Bootstrap

Future possibilites

There is still a lot of work to be done on future iterations of Trendable. While we’re happy with the basic functionality, branding and usability, Trendable is tackling a big task, and making it an effective solution for all the time series data a journalist might be using isn’t easy. First we’d like to expand functionality to quarterly datasets, which would open up possibilities for business reporters.

On the import side, we have three specific improvements that could me made. First, by allowing users to edit “bad” data values right in the application, we would speed up the process considerably by not forcing the user to ignore it or having to restart the process altogether. We would also like to investigate further improvements to the error detection and handling process. Finally, we would like to allow users to upload monthly data where each file is a year containing rows or columns of months (or similar), making Trendable even more flexible.

VizAnalytics: Comparing site analytics in a relevant way

In our last blog post, we discussed the various metrics that VizAnalytics will be using to show you how your website is performing. If you haven’t read that yet, I strongly recommend it because it will help the rest of this be less of a garbled mess (and because it will increase our pages per visit, duh).

In my opinion, one of the coolest aspects of VizAnalytics is the calculated metrics that we will be presenting within the broader raw metrics. I’d like to take some time to explain what those calculated metrics are, how we calculated them, and why they’re relevant to you and your website so that it doesn’t just seem like we’re making stuff up on the fly (okay, maybe a little bit).

Expected Bounce Rate

When you click on the option to “see additional engagement metrics,” one of the metrics that will come up will be the deviation of your site’s “expected” bounce rate from your actual one. Confused yet? Let me explain a bit more.

When we were looking at the analytics of the websites, we noticed something: as the percentage of referrals via social media increased, so did their bounce rate. If you think about this, it makes sense. When users are sent to a site via Twitter, they are most often sent to a specific article. Once they’ve read the article that their friend recommended, they typically click away from the site. That’s a “bounce” (a visit consisting of just one page view).

However, this does not necessarily mean that your page is designed poorly. It could just mean that you’re getting a lot of social media traffic, and your bounce rate is being punished for it. We wanted to try to account for this, so we created the expected bounce rate.

In order to create an equation that would (as well as we could) remove the negative effects of a high social referral rate, we charted the bounce rate and percentage of sessions via social referral for each company. The correlation coefficient for this data was -0.38, which is statistically significant.

Next, we created a line-of-best-fit for the data. We took the equation of this line to give an expected bounce rate based on the percentage of sessions via social referral. We hope that this will give publishers a better glimpse into the design of their page and how that may be helping or hurting their bounce rate as opposed to how the referral traffic is affecting it.

Expected Pages Per Visit

Much like expected bounce rate, the expected pages per visit had a high correlation with one of the other metrics. In this instance, the higher your percentage of direct referrals, the higher your pages per visit. This makes sense because a direct referral means that a person is typing your URL and going to the site directly. What’s the most common landing page if a person is typing out an actual URL? The home page. How many times is a person going to go to the home page and then just scamper away? Hopefully not very often.

A relatively high number of pages per visit could create a false sense of security that a site’s page design is superb. A high direct referral rate inflates your pages per visit, but it doesn’t mean that you can’t do better.

s we did with the bounce rate, we charted the percentage of direct referrals and pages per visit of each company. The correlation coefficient of these two metrics was 0.47. We used this to create a line-of-best-fit to give us an equation to provide an expected pages per visit for each company.

After calculating these expected rates, we show each site its deviation from each. If your deviation (the difference between your actual total and your expected total) for pages per visit is positive, that means you’re doing things well. Your actual pages per visit were higher than what we would have expected based on your direct traffic. Conversely, if your deviation for bounce rate is positive, it means that your bounce rate is higher than we expected based on your referrals via social media. That’s a bad thing, and it means the design of your page could be optimized to get readers to click on additional articles and stay on the site longer.

Percentage of Loyal Users

For this metric, we are looking at the percentage of users that come to the site, on average, more than once per week. Our algorithm simply takes the number of users that had been to the website five or more times in the past month and divides that by the total number of users.

The utility of this one should be pretty obvious. You want to hook your users and make your site a place they stop every time they hit the Web. This metric should be able to show you whether or not you are accomplishing exactly that.

If you have any feedback or thoughts on these metrics, feel free to send a tweet to @VizAnalytics1 or leave a comment below. We’d love to hear what y’all have to say.

WikiNow: Our prototype

Wikipedia is a huge internet resource and articles get millions of views per week. Whether it was to look up a celebrity’s biography or to learn about a historical event’s anniversary, we all use Wikipedia to find information.  But instead of visiting Wikipedia’s front page, we usually search for the particular article directly from a search engine.  There is no central point for Wikipedians to gather and discover Internet trends.

Here’s where WikiNow comes in. WikiNow is the new front page for Wikipedia that displays a list of popular Wikipedia articles, categorizes and relates them to the news in a visually pleasing way. The site aims to entice news junkies and regular Wikipedia viewers, providing a summary of the hottest Internet trends of the day. News junkies can compare the popularity of certain news trends, Wikipedians can discover news articles related to their favorite Wikipedia pages, generating a central location for such groups to gather.

Our site works as described by our architecture diagram below.

Architecture Diagram

WikiNow pulls the lists of articles from the website Wikitrends, namely the daily rising, weekly most visited and the weekly “downtrends.”  For each rising article, we also find relevant news articles from Google News and a corresponding image from Google Images.  The Wikipedia entry is also categorized using the Knight Lab’s news categorizer.  Once all of this data has been pulled the data is displayed on WikiNow as shown below.

Screen Shot 2014-06-03 at 8.01.23 PM

WikiNow was built with the Django web framework, and the bulk of the code that pulls data from Wikitrends was written in Python.  To parse through the top lists on Wikitrends, we used the Beautifulsoup Python library.  For the Google News articles, we used the Google News RSS Feed and used the XML Tree Python library to parse through the output data.  And as mentioned before, we used the Knight Lab’s news categorizer to identify the categories for each article.

In the future we hope to add more content, provide more meaningful categorization and add a people dimension to the page.  Because we currently rely on data from the Wikitrends website, we can only pull 10 articles from each list.  We would like to be able to create our own rising lists using hourly page view statistics for each article and to use our own algorithms to decide what is ‘rising,’ and ‘downtrending.’

We would also like to provide more news-like categories.  Since the concept of our application follows the design of a newspaper, we would like to be able to show rising lists in each of the typical news categories.  That way, someone interested in sports can look at popular WIkipedia articles in sports while a technologist could look at the trending Wikipedia articles about technology.

Finally, Wikipedia was built using thousands of editors.  There is a huge Wikipedia community and we would like to be able to highlight those who have contributed to the site.

We consider the project a minor success since our team collaborated very well together. The biggest lesson and the most helpful resource throughout the quarter was user testing. The project went through multiple iterations in design that greatly benefited from user feedback.

The news dimension was added based on feedback of a lack of context with the first iteration of the design. The categorization was added since users were really longing for some sort of direction reminiscent of news sites.

We also learned important lessons in collaboration. Our team consisted of two computer scientists and a journalist, and collaboration was possible due to an understanding of technology and a passion for storytelling on both parts. We delegated based on each person’s strength and used git for collaboration and version control.

While there were still communication gaps, we were able to have fruitful discussions and achieve a successful project because of a certain level of knowledge and understanding on both ends.