CD Studio Project 3 — Data Visualization

16 min readNov 12, 2020

Analyzing the data visualization

The visualization, which focuses on cash crops, can be found here.

What data is introduced?

The dataset is titled “What is the world’s biggest cash crop?” but biggest could be interpreted in different ways. That is why the graphic first introduces four columns characterizing crops — most planted, most fecund, most popular, most revenue. Subtitles further explain the measurement that determines these titles, and horizontal bars provide comparisons between crops. 21 cash crops are introduced, ordered by most planted.

Below, a fifth dimension is shown — most lucrative — but this time, it is shown as comparatively sized circles, along with dollar amounts for all crops and icons for the top five crops.

How would you characterize the steps in the story?

Overall, there seem to be two steps: comparing most planted/fecund/popular/revenue, followed by most lucrative. Within the first step, most planted is the most obvious first step since it is ordered vertically, followed by the other three columns from left to right.

There are a couple of visual components I don’t really understand within most lucrative. One is the decision to reintroduce a background color in blocks at the bottom of the visualization. I can’t figure out how it relates to the data itself, since the most lucrative crops are not organized by amount horizontally or vertically, so it is a bit distracting. Additionally, why introduce icons for grapes and tomatoes, but not other foods? If only the three most lucrative crops had icons, the icons would be another point of visual emphasis around the lucrativeness of drug-related crops. However the introduction of additional icons makes me wonder what the value per km2 cutoff was for an icon to be included, and why the makers didn’t just make all the circles big enough to include an icon, and scale up accordingly.

Why are these background color blocks there? Why put icons on grapes and tomatoes, but not other foods?

What relationships emerge from the visualization?

The four columns at the beginning are sorted by most planted, and notably, this dimension does not always align with the three others. For example, wheat is the most planted worldwide but ranks low on most fecund and somewhere in the middle on most revenue/most popular. Cannabis is one of the least popular, fecund, and planted on the list, but produces the most revenue by far.

In the visualization of most lucrative crops, the circles representing drugs (cocaine, cannabis, and opium) dwarf the lucrativeness of all the other cash crops, whereas all three were the least planted on the list.

What do you believe the maker wants you to see?

The maker seems to place the most emphasis on how lucrative each crop is by giving lucrativeness its own section on the page with the most visually striking discrepancies between amounts. A clear trend emerges here that the drugs included in this list of cash crops are the most lucrative.

Why is their stance important?

The explanation below the graphic mentions the decision several states have made to legalize cannabis, and posits that the decisions have been influenced by the amount of revenue states stand to gain from legalizing cannabis and other drugs. They insinuate that these incentives might be enough in the future for other governments to follow suit. The implications of the lucrativeness of drugs seem to be the driving factor for creating the visualization.

Their stance is also important because it is not the only stance or trend they could have highlighted; for example, sugar cane ranks high on fecundity and popularity and plenty has been written about its barbaric history, detriments to our body, and more. But instead of choosing to focus on sugar or the dimension of popularity, they took another route.

Analyzing the dataset

What other relationships might be inherent in the data and of value to highlight?

As mentioned above, the popularity of sugar is notable. Another important relationship or question to address might be why we dedicate the most area to crops that don’t have the highest yield per area — does this have environmental implications? Is it driven by nutritional or financial value?

Additionally, the dataset is not explicitly categorized in the existing visualization to compare produce vs grains vs oils and spices vs legal stimulants vs illegal stimulants (or however else we might categorize the content/use of these crops); I found myself searching for similarities and groupings among the crops produced as I went through the graphic.

My mind is already wandering to what is not included in the dataset but might be influential to these trends, too. Where are these crops grown? Are some more amenable to a range of climates than others (and does this influence fecundity or widespread-ness)? Do some have a higher or lower impact on the climate?

Moving forward with our own visualization

What have you gathered from the readings and class activities to date?

What facets of your data are you considering using in your project and why?

At first I tried to find information on greenhouse gas production by crop, but it was more difficult than I anticipated. The original dataset was taken largely from datasets collected by the Food and Agriculture Organization of the United Nations (FAO), which separates aspects of agricultural production that contribute to greenhouse gas emissions, but not crops. I saw that various reports and research addressed individual crops like sugar, wheat, or corn, to try and estimate their carbon footprints, but since the main point of the project was not gathering data in itself, I shied away from having to try and reconcile different measurements, formats, and possible holes in what I was trying to find.

I looked next at sugar production, but got bogged down in the classifications of sugar in the FAO data tools. Some metrics differentiated between sugar beet and sugar cane, while others had many more classifications.

All the definitions of sugar in the import/export dataset. I wanted to use Sugar, Total, but it’s one of two entries without a definition so I wasn’t sure which of the other line items it might include…

If I used the dataset for imports and exports or production alone, I could simply choose which measurement of sugar would make the most sense, and stick to it. Part of what I wanted to show was an increase in sugar consumption over time, but I was also interested in geographic comparisons in agriculture. I quickly realized sugar might not be as important of a important crop to many countries, and while I had data on production, I didn’t have as much on consumption, so if I wanted to incorporate a geographic element, there might be more interesting areas to explore.

So I turned to look at data by country that was not confined to one type of crop. Our food travels all around the world (which still relates to emissions and global warming), so I was curious what it might show us about where our food comes from. I got a little overwhelmed in trying to parse all of the various metrics available. Should I look at which crops countries were using the most land to produce? How imports and exports vary? FAOStat has such extensive data on imports and exports that it took a long time to download and clean the dataset (I wound up downloading the pre-existing datasets by continent and combining them).

There is a lot there to explore in terms of layering data, so I am planning to stick with exploring how countries vary in their imports and exports.

I was proud of the Excel hack I used to pull out the “top 10” exports and imports for each country (I couldn’t find a built-in way to do this) so I’m outlining my process below for future reference:

Sort columns first by “Element” (Export quantity, export value, import quantity, import value), then by country, then by the value/amount for the most recent year of data available.

Insert three columns. In the first column, set up an IF formula so that if there’s a new country in the country column, you get a 0, and if not, you get a 1.

In the second column, set up an IF formula so that if there’s a 0 in the first column, you get a 1, and if there’s a 1, the number in the cell increases by 1 from the previous row. Copy these two formulas down the entire column.

Copy and paste the second column into the third column as values. Then, filter out any rows that have numbers 1 through 10 in the third column. Delete all visible rows, and you are left with the top 10 values for each country.

After looking at the resulting dataset, I realized that some of the “top” exports appeared to be composites of other top exports (e.g. “Beverages” might include “Beverages, distilled alcohol” and “Beverages, non alcoholic”), so I will have to go back to the definitions to make some further decisions about how to parse this data.

What design research question is guiding your project?

Where does our food come from, and where does it go?

What organization methods do you imagine leveraging in the data (LATCH)?

Location — since much of the data I will include is geographically driven, I will prioritize a location-based organization. Even if I abstract country shapes into circles to indicate relative amounts, for example, I will use their relative locations on a typical map to place and ground this abstracted data.
Categories — I think it might helpful not just to see top imports and exports, but to categorize them to see what broader patterns might emerge (grains, beverages, vegetables, legumes, etc, especially knowing that some of these categories/data points are already sums of individual crops.
Hierarchy — I am still considering what hierarchy will emerge as the most important one as I sift through this data — which countries are the top exporters? Top importers? Which are the crops with the top combined import/export value around the world? They all seem like interesting questions to consider.

What coordinate systems do you see emerging as logical and appropriate?

A geographical coordinate system will work well to explore where sugar is produced in the largest quantities. I am thinking a cartesian coordinate system would work better for comparing imports and exports, in order to see side by side whether the countries that produce the least sugar import the most, or any other patterns that emerge. I’m also wondering if I’ll need to draw comparisons with world population to demonstrate that consumption is not just increasing due to an increasing number of people to consume sugar.

What may serve as a logical sequence for people to move through the content (narrative/indexical/combo)?

I’d like to set it up as a map where people have the option to turn various components on and off, so a fairly self-driven exploration. I think this means it will be more indexical in nature, but I will make sure the data included supports an overarching narrative.

Mid-Review

I found the feedback from mid-review to be very constructive and it gave me a lot to think about. I’ve inserted a few of my slides below alongside what I took away from the reviewer’s critiques. Overall, it was clear I am still in a stage of using visualization for my own understanding, and will eventually need to shift to visualization for the viewer’s understanding.

Because of how many variables I wanted to explore, the questions still need to be clarified. I think the reviewers first meant this from a mostly grammatical standpoint, but given other points of confusion, they need clarification in terms of the content, too.

While I’d predominantly considered how many variables could be layered at once, Stacie suggested we reconsider time as a more important factor in our visualizations, since viewers will be able to interact with them. This could help me to consider what people really need to see and compare, how I can guide more of a narrative, and what doesn’t need to be displayed at the same time.

Our class received repeated suggestions to consider the cognitive associations with the visual elements we’ve chosen to use, in order to reduce the amount of work viewers need to do (including referring to a key) to understand what variable the visual elements encompassed. So for my dataset, an outline or a dot would be difficult to identify as “most”, whereas a the brightest fill and largest size are clearer established associations to something being the most of its category. The visual layer was another indicator that I am currently looking at too many variables, and any kind of central question or exploration is getting lost in that breadth.

Many of us, myself included, received feedback that a map did not seem necessary to what we wanted to display, especially when there is one data point that is overwhelmingly larger than the rest. Instead, we could use a suggested map — where amounts by country are situated in relative positions that reflect the location of the countries — to avoid overlapping data and irrelevant details.

A couple of other questions that stuck with me from the reviewers’ critique were, what is the first thing you want people to see? What will speak the loudest? When do you need geographic coordinates versus when could polar or linear coordinates suffice?

I started incorporating this feedback by thinking about what a simpler way to ask my research question might be, landing on what are the relationships between quantity, country of origin, and value of agricultural imports to the US? But even then, I got stuck on what changes to make next, especially knowing that I’d eventually need to write about my data for seminar, too.

I met with Stacie, who reiterated that the point of the visualization was to show how your system allows you to dive into your research question, and raise questions that could be probed further elsewhere. I realized that in analyzing the data, I kept coming across questions with complex answers (why do we import 3 tonnes of tomatoes from the Netherlands when we’re already importing 660k+ tonnes from Mexico?) that the numbers in themselves couldn’t answer. And that’s ok, it’s part of the reason why data visualization is important, but I had to focus on the sequential questions that the data could answer, too. These seem to come in the form of “where,” “what trends,” “what kinds,” and “how much,” for example, rather than “how” or “why”.

So I took another look at the questions I might ask of my data sequentially:

What are the categories of food that the US imports?
What quantites are they imported in?
What kind of foods are in each category?
How much of each food does the US import?
What countries does the food come from?
How much of the food comes from each country?

I did a couple more revisions of my narrative and indexical sequence in Miro:

The original, too-broad, partially complete journey

The stage of my narrative before my presentation

Breaking things down more sequentially after presenting

Prototyping for further adjustment and peer review

I shared sketches with Stacie as to how I was planning to make the sequence unfold over time, and she suggested moving back into Figma.

As I researched topics tangential to food imports, I found out that the US has been importing more food (both overall and proportionally) over the years — an interesting trend. I’d tried to shy away from incorporating data over time just given the sheer amount of it in the dataset, but it felt like too interesting of a question not to incorporate.

I exported the FAOStat dataset for all years available, looking just at the US’s reported imports of one crop, tomatoes. Not every country had data for every year, and there was just one column of data for all years. I used a few layers of formulas to transpose it all and began comparing the layers of data in Figma:

A few things emerged from this round of prototyping, which I shared with my classmates and with Stacie during class. First, the changes in coordinate systems were difficult to follow — if I felt it was important to have a map at the end (I did), I would want to standardize the coordinates for the rest of the progression through the data.

Second, the color palette was off — I’d found in trying to select a shade true to the content of each category, that none looked good together. But rather than this meaning I should choose a two-tone palette for categories more and less closely aligned to plants, Stacie suggested I might have too many categories. After working to simplify them more, I created two variations on a color palette for eight categories rather than thirteen, which I continued to refine.

The palette on the left, with slight revisions, became the palette for my final piece

As for the mapped portion of the visualization, one thing that I will take away from this data visualization project is that circles are hard to compare. The logarithmic scale was less well received than the literal scale because it did not show the magnitude of discrepancy between countries as drastically as the real values (even if the real values obscured the relative country locations more). Refining the transitions between coordinate systems would surely help with this, but I also needed to consider how to use labeling to alert viewers to the fact that they were now looking at a map, as well as what other forms I might use besides circles to indicate amounts. How could something somewhat abstracted still relate to the topic of international agricultural trade?

I also began thinking the narrative didn’t make a lot of sense (or at least was less interesting) without chronological data at higher levels, too. I re-exported the data by year (in 5-year increments) for all the crops I was including, summing it by crop and by category. Again it took a few hours to transpose everything, but it gave me a dataset that actually felt more complete and interesting. I took another stab at adjusting the narrative structure again in Miro:

The small post-its at the top are my prior iteration

Final Presentation Takeaway

My final list of variables and ranges was as follows:

And I simplified my narrative structure from Miro in order to present it more easily alongside the coordinates I had chosen:

I put together an interactive prototype in XD for the final presentation. (This was the first time I’d used the smart animate feature on such a high number of elements, and I feel there’s more to learn with workarounds/hacks to make the transitions look even smoother.)

walkthrough.mp4

Edit description

drive.google.com

The sequence I’d chosen to walk people through the data resonated well, as did the shipping container metaphor.

One thing that I’d struggled with in creating the prototype was how much to label what could be interacted with on screen — and it felt to viewers like this could have been clearer, that there were a lot of hidden screen elements in order to connect the layers.

While I had fun making the tomatoes and brainstorming how they could communicate proportional levels of scale, the use of pictoral elements at the end seemed a bit disconnected from the level of fidelity of the prior screens to folks. This led to one suggestion that I up the level of detail across the board, one reference to Farmville, and the idea that it might be best to use circles to abstract the scales a bit. I’d tried singular circles for the entire value imported by country, but not breaking down the scales like this (by log values), which could help fit everything on screen. I think the differences between 3 and 1.6 million will be very hard to represent proportionally regardless, but it is worth playing around with this more!

My biggest takeaway from moving through this project is that the stories data are telling is not always obvious, especially when the datasets are very large! Going forward I will be able to better understand what kinds of questions data can answer (how much/many, when, what kinds, etc) and what it can’t necessarily answer (why). I saw what a big difference introducing variables temporally could make in telling a story, and how grounding the visuals in something relevant to the content can help or hinder the viewer’s experience of interacting with it. I’d like to continue honing my understanding of the kinds of UI elements that can help viewers understand how to navigate a visualization.