Visually Blog Predicting the Winner of the FIFA World Cup with Data and Visualization | Visually Blog

Predicting the Winner of the FIFA World Cup with Data and Visualization

published on May 23, 2014 in Storytelling

Andrew Yuan is a software engineer, project manager and data science student from Brazil, currently living in New York City. He combined his love of soccer and passion for making concepts and data visually comprehensible in a simple and smart way in his latest project, “2014 Fifa World Cup Brazil Predictions.” The interactive graphic, which you can explore by clicking the image below, recently caught our eye and we asked Yuan to share the inspiration for this project and and his creative process with the Visually community.

With the kickoff of the 2014 FIFA World Cup fast approaching, every soccer fan in the world is dying to know: Who will capture the coveted trophy? The 2014 FIFA World Cup is scheduled to take place June 12 to July 13, and will be hosted in 12 different cities across Brazil. As blogger NewsJuice nicely summarized on CNN, “The 2014 FIFA World Cup in Brazil is shaping up to possibly be the most watched sporting event in history. Both in terms of participants and viewers, soccer is hands down the world’s most popular sport.”

In the spring of 2014, I took an Exploratory Data Analysis and Visualization class, as part of a Data Science program at Columbia University. I am originally from Brazil and very passionate about our national soccer team, as most Brazilians are. So I figured, “wouldn’t it be awesome if my final project presented the predictions of the World Cup that starts in 6 months?” Quoting Barney Stinson from one of my favorite TV shows, my answer was: “Challenge accepted!”

And it was challenging indeed. How to define a predictive model for a tournament with so many factors involved and even some randomness? Many of those factors are either subjective or hard to measure (e.g. players/coach skills or team reputation). Also, for most of these factors, there is no historical data that could be used to train my model. Therefore, my first definition for the project was to keep it simple and investigate factors that are measurable, available, and can be good indicators of a match outcome.

Exploring the results of every official FIFA match since 1993, I found a very good correlation between the teams’ relation in the ranking table, the location where the match took place (home, away or neutral field) and the proportion of matches won. A logistic regression over those data points did the job to model my match outcome probabilistic function. With that, I simulated every possible outcome of the World Cup and calculated their probabilities, coming up with the numbers seen on my visualization. (For further details on this methodology, check http://andrewyuan.github.io/methodology.html.) As you can see, Brazil does not have the highest probability just because I am Brazilian. The numbers speak for themselves!

Here are some of the sketches I made before getting to the final visualization:

I was (pleasantly) surprised by the amount of positive feedback I have been receiving from this project. It really has motivated me to learn more about data exploration and visualization in the future. I am grateful to my professors and classmates, who encouraged me to make this project possible. It wouldn’t have gotten the popularity it did without their support.

You can contact Yuan through his website or LinkedIn.