The sunk-cost fallacy, one of many harmful cognitive biases that afflict all of us, refers to our tendency to devote time and resources to a lost cause because we have already spent — sunk — so much time in the pursuit. The sunk-cost fallacy leads us to stay in bad jobs, grind away at a project even after we know it won’t work, and, yes, continue to use the tedious, outdated Python plotting library Matplotlib when better alternatives exist.

After years of struggling with Matplotlib, I realized the only reason I continued to use it was the hundreds of hours I sunk into learning the convoluted syntax. The complexity of Matplotlib had caused me hours of frustration on StackOverflow figuring out how to format dates or add a second y-axis. Fortunately, once I understood my reluctance to switch was irrational, I found there are now many easier-to-use alternatives for plotting in Python. After exploring the options, a clear winner emerged as measured by ease-of-use, documentation and functionality: the Plotly Python library. 

4 Reasons to Switch to Plotly in Python

  1. Ability to efficiently create charts for rapid data exploration
  2. Interactivity for subsetting/investigating data
  3. Ability to show different data views to find relationships or outliers
  4. Customization of figures for presentations and reports

In this article, we’ll dive into Plotly, learning how to make interactive plots in less time than Matplotlib, often with one line of code. This rapid iteration means we can more fully explore our data and use it to make better decisions — the ultimate point of data science.

All of the code for this article is available in a Jupyter notebook on GitHub. The charts are interactive, and since GitHub doesn’t render Plotly plots natively, you can explore the visuals on NBViewer here

Note: This article is intended to show the capabilities of Plotly and does not always follow the best visualization practices as laid out by Edward Tufte. An accessible, free, online book teaching these best practices is Fundamentals of Data Visualization by Claus Wilke.

python-plotly
Example of Plotly figures 

More From Will KoehrsenUse Precision and Recall to Evaluate Your Classification Model

 

Plotly Overview

The Plotly Python package is an open-source library built on plotly.js, which in turn is built on the powerful d3.js. We’ll be using a lighter-weight version of the core Python Plotly library, Cufflinks, which is designed to work natively with Pandas DataFrames

In terms of abstraction, Cufflinks > Plotly > plotly.js > d3.js which means we can work with Python code at a high level and get the incredible interactive graphics capabilities of d3. Cufflinks can also be extended with the core Plotly library functionality for more detailed charts.

Note: The maker of the Python library is also called Plotly, a graphics company with several products and open-source tools. The Python library is free to use, and we can make unlimited charts in offline mode plus up to 25 charts in online mode to share with the world.

I did the work in this article in a Jupyter Notebook with Plotly + Cufflinks running in offline mode. Installing Plotly and Cufflinks is as simple as pip install cufflinks Plotly, and the notebook shows how to import the libraries and set up offline mode. The data set for this article contains stats about my Medium articles, which you can also find on Github to follow along.

 

Single Variable Distributions: Histograms and Boxplots

Single variable (univariate) plots are a standard way to start a data analysis. We use the histogram to show a one-dimensional distribution (although it has some issues). First, let’s make an interactive histogram of the number of claps by articles (claps being a form of appreciation for a Medium article). 

Note: In the code, df refers to the Pandas DataFrame with the article stats.

df['claps'].iplot(
    kind='hist',
    bins=30,
    xTitle='claps',
    linecolor='black',
    yTitle='count',
    title='Claps Distribution')
python-plotly
Histogram of article claps

All Plotly graphs are interactive even though it’s hard to tell from the static image. Interactivity means we can rapidly explore the data, zoom in on points of interest and compare stats.


For those used to Matplotlib, all we have to do is add one more letter to our plotting code (iplot instead of plot), and we get a much better-looking and interactive chart! 

To compare two one-dimensional variables, we plot overlaid histograms. Here we graph the time of day at which I started writing articles and the time of day at which I published articles.

df[['time_started', 'time_published']].iplot(
    kind='hist',
    linecolor='black',
    bins=24,
    histnorm='percent',
    bargap=0.1,
    barmode='group',
    xTitle='Time of Day',
    yTitle='(%) of Articles',
    title='Time Started and Time Published')
python-plotly
Histogram comparing the distribution of article start times and time of publication

It looks like I tend to start writing articles later in the day (6-8 p.m.) and most frequently publish around 9 a.m. with a secondary peak at 10 p.m.

With a little bit of Pandas data manipulation, we can make a barplot:

# Resample to monthly frequency and plot
df2 = df[['view','reads','published_date']].\
        set_index('published_date').\
        resample('M').mean()


df2.iplot(kind='bar', xTitle='Date', yTitle='Average',
   title='Monthly Average Views and Reads')
python-plotly
Bar plot of article views and reads by month over time

Combining Pandas data manipulation with Plotly graphing means we can rapidly create many different graphs to explore our data from different perspectives.

For a boxplot of the fans per article by publication, we pivot the data and then plot:

df.pivot(columns='publication', values='fans').iplot(
       kind='box',
       yTitle='fans',
       title='Fans Distribution by Publication')


Boxplots contain a lot of information on the distribution of a variable, and the interactive plot allows us to examine each of these values. We can also compare the distributions segmented by category (in this case the publication for the article).

Plotly Tutorial 2021

 

Time-Series

A large portion of real-world data has a time element. Fortunately, Cufflinks was designed with time-series visualizations in mind. If we set the index of the data frame to a time-series and then plot other variables, Cufflinks will automatically plot a time series with correct date-time formatting on the x-axis.

# Set index to the publication date to get time-series plotting

df = df.set_index(“published_date”)

# Plot fans and word count over time
df[['fans', 'word_count', 'title']].iplot(
    y='fans',
    mode='lines+markers',
    secondary_y = 'word_count',
    secondary_y_title='Word Count',
    opacity=0.8,
    size=8,
    symbol=1,
    xTitle='Date',
    yTitle='Claps',
    text='title',
    title='Fans and Word Count over Time')

With this single line of code, we do the following:

  • Graph the fans and word count over time with points connected by lines

  • Add a secondary y-axis because our variables have different ranges

  • Add in the title of the articles as hover information

For more information, we can also add text annotations to a graph. Here we graph the monthly word count over time with annotations:

df_monthly_totals.iplot(
   mode='lines+markers+text',
   text=text,
   y='word_count',
   opacity=0.8,
   xTitle='Date',
   yTitle='Word Count',
   title='Total Word Count by Month')
python-plotly
Time series of total word count by month with text annotations

For those who are so inclined, you can even make a pie chart to show the percentage of a variable in different categories:

df.groupby("publication", as_index=False)["word_count"].sum().iplot(
    kind="pie",
    labels="publication",
    values="word_count",
    title="Percentage of Words by Publication",
)
python-plotly
Pie chart of words by publication

Pie charts often get a bad rap in the data science community because it’s hard to compare pie slices. However, they still seem to be popular outside data science (especially in the C-suite), so I’m guessing we data analysts will have to keep making them. 

On That Note ...Make Company Leadership Stop Ignoring Your Analytics

 

Two or More Variable Distributions

So far we’ve looked at graphs showing the distribution of one variable (histograms and boxplots) and the evolution of one variable over time (time-series line plots). Next, we’ll move to graphs with two or more variables. We’ll start with the scatterplot, a straightforward graph that allows us to see the relationship between two (or more) variables.

Let’s look at the relationship between the percentage of an article read and the estimated reading time of the article:

df.iplot(
    x='read_time',
    y='read_ratio',
    xTitle='Read Time',
    yTitle='Reading Percent',
    text='title',
    mode='markers',
    title='Reading Percent vs Reading Time')
python-plotly
Percent of article read vs. reading time of article (in minutes)

We can clearly see the decreasing percentage of the article read as the length increases. This must be evidence of the decline in attention span from the internet we’re always hearing about! 

More Data Viz AdviceForget Bokeh. Use Pygal to Create Data Visualizations Instead.

With Cufflinks + Plotly, we can customize our scatterplots by changing the axis scale to log, adding a best fit (trend) line or showing a third variable by coloring the points. Here’s an example of the latter:

df.iplot(
    x='read_time',
    y='read_ratio',
    categories='publication',
    xTitle='Read Time',
    yTitle='Reading Percent',
    title='Reading Percent vs Read Time by Publication')
python-plotly
Reading percent vs. read time of article colored by publication

We change an axis to a log scale with a Plotly layout (see the Plotly documentation for the layout specifics) and specify bestfit=True to add in a trend line.

# Specify log x-axis using a layout
layout = dict(
    xaxis=dict(type='log', title='Word Count'),
    yaxis=dict(type='linear', title='views'),
    title='Views vs Word Count Log Axis')

df.sort_values('word_count').iplot(
    x='word_count',
    y='views',
    layout=layout,
    text='title',
    mode='markers',
    bestfit=True,
    bestfit_colors=['blue'])
python-plotly
Article views vs. word count with log X-axis and trend line

There doesn’t appear to be a strong relationship between views and word count, at least from this view.

Note: the log scale can sometimes hide or reveal relationships that are or are not visible with a linear scale.

We can show four or even five variables on the same chart by sizing points by a variable and using different shapes for different categories. However, charts with too much information can be difficult to make sense of so don’t get carried away adding variables just because you can. 

As with univariate distributions, we can combine Pandas data manipulation with Cufflinks to get more detailed views of the data. The following graph shows the cumulative views by publication.

df.pivot_table(
   values='views', index='published_date',
   columns='publication').cumsum().iplot(
       mode='markers+lines',
       size=8,
       symbol=[1, 2, 3, 4, 5],
       layout=dict(
           xaxis=dict(title='Date'),
           yaxis=dict(type='log', title='Total Views'),
           title='Total Views over Time by Publication'))
python-plotly
Cumulative article views by publication with log Y-axis

See the notebook or the documentation for more examples of Cufflinks’s extra functionality. 

You Know It’s TrueData Scientists, Your Variable Names Are a Mess. Clean Up Your Code.

 

Advanced Plots

Now we’ll get into a few plots you probably won’t use all that often but are nevertheless visually striking. These graphs are not the mainstays of data exploration, but they can serve as an eye-catching plot to draw a viewer into a presentation. For these figures, we’ll use the plotly figure_factory module, another wrapper on the core Plotly library for more advanced visuals.

Scatter Matrix

When we want to explore relationships among many variables, a scattermatrix (also called a scatterplot matrix or splom) is a solid option:

 
import plotly.figure_factory as ff

figure = ff.create_scatterplotmatrix(
    df[['claps', 'publication', 'views', 'read_ratio', 'word_count']],
    height=1000,
    width=1000,
    text=df['title'],
    diag='histogram',
    index='publication')
python-plotly
Scatterplot matrix of multiple variables

This plot is fully interactive (see the notebook), which allows us to identify relationships between pairs of variables we can dive into even further (using the standard plots discussed above). The diagonal shows histograms for each variable, which can help us identify outliers in our data set that we’d need to address before further analysis or machine learning.

Correlation Heatmap

To visualize the correlations between numeric variables, we calculate the numeric correlations (Pearson correlation coefficient) and then make an annotated Plotly heatmap:

corrs = df.corr()

figure = ff.create_annotated_heatmap(
    z=corrs.values,
    x=list(corrs.columns),
    y=list(corrs.index),
    colorscale='Earth',
    annotation_text=corrs.round(2).values,
    showscale=True, reversescale=True)
python-plotly
Correlation heatmap between all numeric variables in the data set

Correlation heatmaps, like scatterplot matrices, are helpful for identifying relationships between variables that we can analyze further using standard graphs or statistics. Relationships between variables are also crucial for machine learning because we need to use features with predictive power. There are several more complex plots available in figure_factory for data set exploration.

 

Themes

Cufflinks has several themes we can use to apply different styling with no effort. For example, below we have a ratio plot in the “space” theme and a spread plot in “ggplot” (which may be familiar to those used to working in the R statistical language):

python-plotly
The ratio of views to reads over time in the “space” theme
python-plotly
Spread between views and reads in the “ggplot” theme

Cufflinks also supports making 3-D plots, although I generally advise against them (as do books on data visualization). Three-dimensional plots are difficult to comprehend and extract usable insights from. Graphs should never be more complicated than necessary, and most actionable information is going to come from easily-understood charts showing only one or two variables.

With all the charts covered in this article, we are still not exploring the full capabilities of the library! I’d encourage you to check out the Plotly documentation to see the range of visualizations available and to look at some spectacular examples.

python-plotly
Plotly interactive graphics of wind farms in United States 

Plotly allows us to make visualizations quickly for data exploration and helps us get better insight into our data through interactivity. Also, let’s admit it, plotting should be one of the most enjoyable parts of data science! With other libraries, plotting is a tedious task, but with Plotly, we regain the joy of making a great figure.

python-plotly
A plot of my enjoyment with plotting with Python libraries over time

 

Expert Contributors

Built In’s expert contributor network publishes thoughtful, solutions-oriented stories written by innovative tech professionals. It is the tech industry’s definitive destination for sharing compelling, first-person accounts of problem-solving on the road to innovation.

Learn More

Great Companies Need Great People. That's Where We Come In.

Recruit With Us