Data Scientists, Your Variable Names Are a Mess. Clean Up Your Code.
Quick, what does the following code do?
for i in range(n): for j in range(m): for k in range(l): temp_value = X[i][j][k] * 12.5 new_array[i][j][k] = temp_value + 150
It’s impossible to tell right? If you were trying to modify or debug this code, you’d be at a loss unless you could read the author’s mind. Even if you were the author, a few days after writing this code you might not remember what it does because of the unhelpful variable names and use of magic numbers.
Working with data science code, I often see examples like above (or worse): code with variable names such as
ii and numerous unnamed constant values. To put it frankly, data scientists (myself included) are terrible at naming variables.
Clear Variable Names in 3 Steps
- A variable name should describe the entity the variable represents.
- When writing your code, prioritize ease of reading over speed of writing.
- Use consistent standards throughout a project to minimize the cognitive burden of small decisions.
As I’ve grown from writing research-oriented data science code for one-off analyses to production-level code (at Cortex Building Intelligence), I’ve had to improve my programming by unlearning practices from data science books, courses and the lab. There are significant differences between deployable machine learning code and how data scientists learn to program, but we’ll start here by focusing on two common and easily fixable problems:
Unhelpful, confusing or vague variable names
Unnamed magic constant numbers
Both these problems contribute to the disconnect between data science research (or Kaggle projects) and production machine learning systems. Yes, you can get away with them in a Jupyter Notebook that runs once, but when you have mission-critical machine learning pipelines running hundreds of times per day with no errors, you have to write readable and understandable code. Fortunately, there are best practices from software engineering we data scientists can adopt, including the ones we’ll cover in this article.
Note: I’m focusing on Python since it’s by far the most widely used language in industry data science. Some Python-specific naming rules (see here for more details) include:
Variable/function names are
Named constants are in
Classes are in
There are three basic ideas to keep in mind when naming variables:
The variable name must describe the information represented by the variable. A variable name should tell you concisely in words what the variable stands for.
Your code will be read more times than it is written. Prioritize how easy your code is to read over than how quick it is to write.
Adopt standard conventions for naming so you can make one global decision in a codebase instead of multiple local decisions.
What does this look like in practice? Let’s go through some improvements to variable names.
X and Y
If you’ve seen these several hundred times, you know they commonly refer to features and targets in a data science context, but that may not be obvious to other developers reading your code. Instead, use names that describe what these variables represent such as
house_features and h
What does the value represent? It could stand for
revenue_total. A name such as
value tells you nothing about the purpose of the variable and just creates confusion.
Even if you are only using a variable as a temporary value store, still give it a meaningful name. Perhaps it is a value where you need to convert the units, so in that case, make it explicit:
# Don't do this temp = get_house_price_in_usd(house_sqft, house_room_count) final_value = temp * usd_to_aud_conversion_rate
# Do this instead house_price_in_usd = get_house_price_in_usd(house_sqft, house_room_count) house_price_in_aud = house_price_in_usd * usd_to_aud_conversion_rate
usd, aud, mph, kwh, sqft
If you’re using abbreviations like these, make sure you establish them ahead of time. Agree with the rest of your team on common abbreviations and write them down. Then, in code review, make sure to enforce these written standards.
tp, tn, fp, fn
Avoid machine learning-specific abbreviations. These values represent
false_negatives, so make it explicit. Besides being hard to understand, the shorter variable names can be mistyped. It’s too easy to use
tp when you meant
tn, so write out the whole description.
The above are examples of prioritizing ease of reading code instead of how quickly you can write it. Reading, understanding, testing, modifying and debugging poorly written code takes far longer than well-written code. Overall, trying to write code faster by using shorter variable names will actually increase your program’s development and debugging time! If you don’t believe me, go back to some code you wrote six months ago and try to modify it. If you find yourself having to decipher your own past code, that’s an indication you should be concentrating on better naming conventions.
xs and ys
These are often used for plotting, in which case the values represent
y_coordinates. However, I’ve seen these names used for many other tasks, so avoid the confusion by using specific names that describe the purpose of the variables such as
What Makes a Bad Variable Name?
Most problems with naming variables stem from:
A desire to keep variable names short
A direct translation of formulas into code
On the first point, while languages like Fortran did limit the length of variable names (to six characters), modern programming languages have no restrictions so don’t feel forced to use contrived abbreviations. Don’t use overly long variable names either, but if you have to favor one side, aim for readability.
With regards to the second point, when you write an equation or use a model — and this is a point schools forget to emphasize — remember the letters or inputs represent real-world values!
We write code to solve real-world problems, and we need to understand the problem our model represents.
Let’s see an example that makes both mistakes. Say we have a polynomial equation for finding the price of a house from a model. You may be tempted to write the mathematical formula directly in code:
temp = m1 * x1 + m2 * (x2 ** 2) final = temp + b
This is code that looks like it was written by a machine for a machine. While a computer will ultimately run your code, it’ll be read by humans, so write code intended for humans!
To do this, we need to think not about the formula itself (the how) and consider the real-world objects being modeled (the what). Let’s write out the complete equation. This is a good test to see if you understand the model):
house_price = price_per_room * rooms + \ price_per_floor_squared * (floors ** 2) house_price = house_price + expected_mean_house_price
If you are having trouble naming your variables, it means you don’t know the model or your code well enough. We write code to solve real-world problems, and we need to understand the problem our model represents.
While a computer will ultimately run your code, it’ll be read by humans, so write code intended for humans!
Descriptive variable names let you work at a higher level of abstraction than a formula, helping you focus on the problem domain.
Other Variable Naming Considerations
One of the important points to remember when naming variables is: consistency counts. Staying consistent with variable names means you spend less time worrying about naming and more time solving the problem. This point is relevant when you add aggregations to variable names.
Variable Names — Dos and Dont’s
- Use descriptive variable names
- Use function parameters or named constants instead of magic numbers.
- Describe what an equation or model represents with variable names.
- Put aggregations at the end of variable names.
- Use item_count instead of num.
- Use descriptive loop indexes instead of i, j, k.
- Adopt conventions for naming and formatting across a project.
- Don’t use machine-learning specific abbreviations.
Aggregations in Variable Names
So you’ve got the basic idea of using descriptive names, changing
xs to distances,
e to efficiency and
v to velocity. Now, what happens when you take the average of velocity? Should this be
velocity_mean, or v
elocity_average? Following these two rules will resolve this situation:
Decide on common abbreviations:
stdfor standard deviation and so on. Make sure all team members agree and write these down. (An alternative is to avoid abbreviating aggregations.)
Put the abbreviation at the end of the name. This puts the most relevant information, the entity described by the variable, at the beginning.
Following these rules, your set of aggregated variables might be
distance_max. Rule two is a matter of personal choice, and if you disagree, that’s fine. Just make sure you consistently apply the rule you choose.
A tricky point comes up when you have a variable representing the number of an item. You might be tempted to use
building_num, but does that refer to the total number of buildings, or the specific index of a particular building?
Staying consistent with variable names means you spend less time worrying about naming and more time solving the problem.
To avoid ambiguity, use
building_count to refer to the total number of buildings and
building_index to refer to a specific building. You can adapt this to other problems such as
item_index. If you don’t like count, then
item_total is also a better choice than
num. This approach resolves ambiguity and maintains the consistency of placing aggregations at the end of names.
For some unfortunate reason, typical loop variables have become
k. This may be the cause of more errors and frustration than any other practice in data science. Combine uninformative variable names with nested loops (I’ve seen loops nested include the use of
jj, and even
iii) and you have the perfect recipe for unreadable, error-prone code. This may be controversial, but I never use
i or any other single letter for loop variables, opting instead for describing what I’m iterating over such as
for building_index in range(building_count): ....
for row_index in range(row_count): for column_index in range(column_count): ....
This is especially useful when you have nested loops so you don’t have to remember if
i stands for row or column or if that was
k. You want to spend your mental resources figuring out how to create the best model, not trying to figure out the specific order of array indexes.
(In Python, if you aren’t using a loop variable, then use
_ as a placeholder. This way, you won’t get confused about whether or not the variable is used for indexing.)
Variable Names — Conventions to Avoid
- Numerals in variable names
- Commonly misspelled words in English
- Names with ambiguous characters
- Names with similar meanings
- Abbreviations in names
- Names that sound similar to one another
All of these rules stick to the principle of prioritizing read-time understandability instead of write-time convenience. Coding is primarily a method for communicating with other programmers, so give your team members some help in making sense of your computer programs.
Never Use Magic Numbers
A magic number is a constant value without a variable name. I see these used for tasks like converting units, changing time intervals or adding an offset:
final_value = unconverted_value * 1.61 final_quantity = quantity / 60 value_with_offset = value + 150
(These variable names are all bad, by the way!)
Magic numbers are a large source of errors and confusion because:
Only one person, the author, knows what they represent.
Changing the value requires looking up all the locations where it's used and manually typing in the new value.
Instead of using magic numbers in this situation, we can define a function for conversions that accepts the unconverted value and the conversion rate as parameters:
def convert_usd_to_aud(price_in_usd, aud_to_usd_conversion_rate): price_in_aus = price_in_usd * usd_to_aud_conversion_rate return price_in_aus
If we use the conversion rate throughout a program in many functions, we could define a named constant in a single location:
USD_TO_AUD_CONVERSION_RATE = 1.61 def convert_usd_to_aud(price_in_usd): price_in_aud = price_in_usd * USD_TO_AUD_CONVERSION_RATE return price_in_aud
(Remember, before we start the project, we should establish with our team that
usd = US dollars and
aud = Australian dollars. Standards matter!)
Here’s another example:
# Conversion function approach def get_revolution_count(minutes_elapsed, revolutions_per_minute): revolution_count = minutes_elapsed * revolutions_per_minute return revolution_count # Named constant approach REVOLUTIONS_PER_MINUTE = 60 def get_revolution_count(minutes_elapsed): revolution_count = minutes_elapsed * REVOLUTIONS_PER_MINUTE return revolution_count
NAMED_CONSTANT defined in a single place makes changing the value easier and more consistent. If the conversion rate changes, you don’t need to hunt through your entire codebase to change all the occurrences, because you’ve defined it in only one location. It also tells anyone reading your code exactly what the constant represents. A function parameter is also an acceptable solution if the name describes what the parameter represents.
As a real-world example of the perils of magic numbers, in college, I worked on a research project with building energy data that initially came in 15-minute intervals. No one gave much thought to the possibility this could change, and we wrote hundreds of functions with the magic number 15 (or 96 for the number of daily observations). This worked fine until we started getting data in five and one-minute intervals. We spent weeks changing all our functions to accept a parameter for the interval, but even so, we were still fighting errors caused by the use of magic numbers for months.
Real-world data has a habit of changing on you. Conversion rates between currencies fluctuate every minute and hard-coding in specific values means you’ll have to spend significant time re-writing your code and fixing errors. There is no place for magic in programming, even in data science.
The Importance of Standards and Conventions
The benefits of adopting standards are that they let you make a single global decision instead of many local ones. Instead of choosing where to put the aggregation every time you name a variable, make one decision at the start of the project, and apply it consistently throughout. The objective is to spend less time on concerns only peripherally related to data science: naming, formatting, style — and more time solving important problems (like using machine learning to address climate change).
If you are used to working by yourself, it might be hard to see the benefits of adopting standards. However, even when working alone, you can practice defining your own conventions and using them consistently. You’ll still get the benefits of fewer small decisions and it’s good practice for when you inevitably have to develop on a team. Anytime you have more than one programmer on a project, standards become a must!
You might disagree with some of the choices I’ve made in this article, and that’s fine! It’s more important to adopt a consistent set of standards than the exact choice of how many spaces to use or the maximum length of a variable name. The key point is to stop spending so much time on accidental difficulties and instead concentrate on the essential difficulties. (Fred Brooks, author of the software engineering classic The Mythical Man-Month, has an excellent essay on how we’ve gone from addressing accidental problems in software engineering to concentrating on essential problems).
Now let's go back to the initial code we started with and fix it up.
for i in range(n): for j in range(m): for k in range(l): temp_value = X[i][j][k] * 12.5 new_array[i][j][k] = temp_value + 150
We’ll use descriptive variable names and named constants.
PIXEL_NORMALIZATION_FACTOR = 12.5 PIXEL_OFFSET_FACTOR = 150 for row_index in range(row_count): for column_index in range(column_count): for color_channel_index in range(color_channel_count): normalized_pixel_value = ( original_pixel_array[row_index][column_index][color_channel_index] * PIXEL_NORMALIZATION_FACTOR ) transformed_pixel_array[row_index][column_index][color_channel_index] = (normalized_pixel_value + PIXEL_OFFSET_FACTOR)
Now we can see that this code is normalizing the pixel values in an array and adding a constant offset to create a new array (ignore the inefficiency of the implementation!). When we give this code to our colleagues, they will be able to understand and modify it. Moreover, when we come back to the code to test it and fix our errors, we’ll know precisely what we were doing.
Clarifying your variable names may seem like a dry activity, but if you spend time reading about software engineering, you realize what differentiates the best programmers is the repeated practice of mundane techniques such as using good variable names, keeping routines short, testing every line of code, refactoring, etc. These are the techniques you need to take your code from research or exploration to production-ready and, once there, you’ll see how exciting it is for your data science models to influence real-life decisions.
This article was originally published on Towards Data Science.