How to Define Empty Variables and Data Structures in Python

Missing values in large datasets are a common problem, and dealing with them is an essential skill for every programmer.

Written by Sadrach Pierre
Published on Sep. 08, 2022
Image: Shutterstock / Built In
Image: Shutterstock / Built In
Brand Studio Logo

Missing values are commonly encountered when processing large collections of data. A missing value can correspond to an empty variable, an empty list, an empty dictionary, a missing element in a column, an empty dataframe, or even an invalid value. Defining empty variables and data structures is an essential part of handling missing or invalid values. This is important for tasks such as variable initialization, type checking and specifying function default arguments. 

For variables such as floats, integers, booleans, and strings, invalid types can often lead to failing or error-causing code. This can trigger programs to crash midway through a large processing job, which can lead to a significant waste in time and computational resources. Being able to define functions with sensible default values such that they return a consistent and expected output error free is an essential skill for every programmer and can save the engineer or data scientist much headache down the road. 

Some of the topics we will cover include:

Table of Contents: How to Define Empty Variables and Data Structures in Python

  1. Empty variables with None and NaN
  2. Empty lists for initialization
  3. Empty dictionaries for initialization
  4. Empty dataframes for initialization
  5. NaN default function arguments
  6. Empty list default function arguments
  7. Empty dictionary default function arguments
  8. Empty dataframe default function arguments
  9. Other uses for empty variables

 

Defining Empty Variables with None and NaN

Defining empty variables in Python is straightforward. If you wish to define a placeholder for a missing value that will not be used for calculations, you can define an empty variable using the None keyword. This is useful because it clearly indicates that the value for a variable is missing or not valid.

For example, let’s say we have demographic data with sets of values for age, income (USD), name and senior citizen status:

age1 = 35

name1 =  “Fred Philips”

income1= 55250.15

senior_citizen1 = False



age2 = 42

name2 =  “Josh Rogers”

income2=65240.25

senior_citizen1 = False



age3 = 28

name3 =  “Bill Hanson”

income3=79250.65

senior_citizen3 = False

There may be instances where some information may be missing or include an invalid value. For example, it is possible that we may receive data with invalid values such as a character or string for the age or a floating point or integer for the name. This can occur often with web applications that use free-text user input boxes. If the app isn’t able to detect and alert the user to an invalid input value, it will include the invalid values in its database. Consider the following example:

age4 = “#”

name4 =  100

income4 = 45250.65

senior_citizen4 = “Unknown”

For this person, we have an age value of “#,” which clearly is invalid. Furthermore, the name entered is a number, which also doesn’t make sense. Finally, for our senior_citizen variable we have “Unknown.” If we are interested in keeping this data, since the income is valid, it is best to define age, name and senior_citizen as empty variables using the None keyword.

age4 = None

name4 =  None

income4 = 45250.65

senior_citizen4 = None

This way any developer looking at the data will clearly understand the valid values for age, name and senior_citizen are missing. Further, the income values can still be used to calculate statistics since all of the values for income are present.

A limitation of the None keyword is that it can’t be used in calculations. For example, suppose we wanted to calculate the average age for the four instances we have defined:

avg_age = (age1 + age2 + age3 + age4)/4

If we try to run our script it will throw the following error:

    avg_age = (age1 + age2 + age3 + age4)/4
TypeError: unsupported operand type(s) for +: ‘int’ and ’NoneType’

This is a type error stating that we are unable to use the ‘+’ operator (addition) between integers and None values.

We can remedy this by using the NaN (not a number) values from NumPy as our missing value placeholder:

age4 = np.nan

name4 =  np.nan

income4 = 45250.65

senior_citizen4 = np.nan

avg_age = (age1 + age2 + age3 + age4)/4

Since we have a NaN in our calculation, the result will also be NaN. Now the code will be able to run successfully. Further, this is especially useful when dealing with data structures such as dataframes, as there are methods in Python that allow you to handle NaN values directly.

In addition to defining empty variables, it is often useful to store empty data structures in variables. This has many uses but we will discuss how default empty data structures can be used for type checking. 

More About PythonHow to Use Float in Python (With Sample Code!)

 

Defining an Empty List for Initialization 

The simplest application of storing an empty list in a variable is for initializing a list that will be populated. For example, we can initialize a list for each of the attributes we defined earlier (age, name, income, senior_status):

ages = []

names = []

incomes = []

senior_citizen = []

These empty lists can then be populated using the append method:

ages.append(age1)

ages.append(age2)

ages.append(age3)

ages.append(age4)


print(“List of ages: “, ages)

We can do the same for name, income, and senior status:

names.append(name1)

names.append(name2)

names.append(name3)

names.append(name4)


print(“List of names: “, names)



incomes.append(income1)

incomes.append(income2)

incomes.append(income3)

incomes.append(income4)



print(“List of incomes: “, incomes)



senior_citizen.append(income1)

senior_citizen.append(income2)

senior_citizen.append(income3)

senior_citizen.append(income4)



print(“List of senior citizen status: “, senior_citizen)

Which returns

List of ages: [35, 42, 28, nan]
List of names: [’Fred Philips’, ‘Josh Rogers’, ‘Bill Hanson’, nan]
List of incomes: [5525.15, 65240.25, 79250.65, 45250.65]
List of senior citizen status: [False, False, False, nan]

More Info on Python Programming6 Important Things to Know About Python Functions

 

Defining an Empty Dictionary for Initialization

We can also use an empty dictionary for initialization:

demo_dict = {}

And use the list we populated earlier to populate the dictionary:

demo_dict['age'] = ages

demo_dict['name'] = names

demo_dict['income'] = incomes

demo_dict['senior_citizen'] = senior_citizen


print(“Demographics Dictionary”)

print(demo_dict)

Which returns:

Demographics Dictionary
{’age’: [35, 42, 28, nan], ’name’: [’Fred Philips’, ‘Josh Rogers’, ‘Bill Hanson’, nan], ‘income’: [55250.15, 65240.25, 79250.65, 45250.65], ‘senior_citizen’: [False, False, False, nan]}

Learn More About Python on Built In’s Expert Contributors NetworkAn Introduction to the With Statement in Python

 

Defining an Empty Dataframe for Initialization

We can also do something similar with dataframes:

import pandas as pd 


demo_df = pd.DataFrame()


demo_df['age'] = ages

demo_df['name'] = names

demo_df['income'] = incomes

demo_df['senior_citizen'] = senior_citizen


print(“Demographics Dataframe”)

print(demo_df)

Which will return:

An image of plain text showing the results of the program.
Image: Pierre Sadrach / Built In

Notice the logic for populating dictionaries and dataframes are similar. Which data structure you use depends on your needs as an engineer, analyst, or data scientist. For example, dictionaries are more useful if you like to produce JSON files and don’t need array lengths to be equal, while data frames are more useful for generating CSV files.

Python Variables - Python Tutorial for Beginners with Examples | Mosh

 

NaN Default Function Arguments

Another use for defining empty variables and data structures are for default function arguments. 

For example, consider a function that calculates income after federal tax. The tax rate for the range of incomes we’ve defined so far is around 22 percent. We can define our function as follows:

def income_after_tax(income):

    after_tax = income - 0.22*income

    return after_tax

If we call our function with income this way:

after_tax1 = income_after_tax(income1)

print(“Before: “, income1)

print(“After: “, after_tax1)

And print the results we get the following:

Before: 55250.15
After:  43095.117

This works fine for this example, but what if we have an invalid value for income like an empty string? Let’s pass in an empty string and try to call our function:

after_tax_invalid = income_after_tax(‘’)

TypeError: can’t multiple sequence by non-int of type ‘float’

We get a type error stating that we can’t multiply a sequence, which is the empty string, by a non-integer type float. The function call fails and after_tax never actually gets defined. We ideally would like to guarantee that the function runs for any value of income, and after_tax6 at least gets defined with some default value. We can do this by defining a default NaN argument for after_tax and type check the income. We only calculate after_tax if income is a float otherwise, after_tax is NaN:

def income_after_tax(income, after_tax = np.nan):

    if income is float:

        after_tax = income - 0.22*income

    return after_tax

We can then pass any invalid value for income and we will still be able to run our code successfully:

after_tax_invalid1 = income_after_tax(‘’)

after_tax_invalid2 = income_after_tax(None)

after_tax_invalid3 = income_after_tax(“income”)

after_tax_invalid4 = income_after_tax(True)

after_tax_invalid5 = income_after_tax({})


print(“after_tax_invalid1: “, after_tax_invalid1)

print(“after_tax_invalid2: “, after_tax_invalid2)

print(“after_tax_invalid3: “, after_tax_invalid3)

print(“after_tax_invalid4: “, after_tax_invalid4)

print(“after_tax_invalid5: “, after_tax_invalid5)

Returning:

after_tax_invalid1: nan
after_tax_invalid2: nan
after_tax_invalid3: nan
after_tax_invalid4: nan
after_tax_invalid5: nan

The reader may wonder why an invalid value is passed to a function to begin with. In practice, function calls are often made on thousands-to-millions of user inputs. If the user input is a free-text response, and not a dropdown menu, it is difficult to guarantee that the data types are correct unless it is explicitly enforced by the application. Because of this, we’d want to be able to process valid and invalid inputs without the application crashing or failing.

Learn More About Python Programming on Built In’s Expert Contributors NetworkA Guide to Managing Datetime Data in Python

 

Empty List Default Function Arguments

Defining empty data structures as default arguments can also be useful. Let’s consider a function that takes our list of incomes and calculates the after-tax income.

def get_after_tax_list(input_list):

    out_list = [x - 0.22*x for x in input_list]

    print(“After Tax Incomes: “, out_list)

If we call this with our incomes list, we get:

get_after_tax_list(incomes)

After Tax Incomes:  [43095.117, 50887.395000000004, 61815.507, 35295.507]

If we call this with a value that is not a list, for example an integer, we get:

get_after_tax_list(5)

     out_list = [x - 0.22*x for x in input_list]

TypeError: ‘int’ object is not iterable

Now, if we include an empty list as the default value for our output list, our script runs successfully:

get_after_tax_list(5)

After Tax Incomes: []

Read More About Python on Built InUsing Python Class Decorators

 

Empty Dictionary Default Function Arguments

Similar to defining default arguments as empty lists, it is also useful to define functions with empty dictionary default values. Let’s define a function that takes an input dictionary, like the demo_dict that we defined earlier, and it will return a new dictionary with the mean income.

def get_income_truth_values(input_dict):

    output_dict= {’avg_income’: np.mean(input_dict['income'])}

    print(output_dict)

    return output_dict

Let’s call our function with demo_dict.

get_income_truth_values(demo_dict) 

{’avg_income’: 61247.924999999996}

Now let’s try passing in an invalid value for input_dict. Let’s pass the integer value 10000:

get_income_truth_values(10000) 

    out_dict+ {’avg_income’: np.mean(input_dict[’income’])}

TypeError: ‘int’ object is not subscriptable

We get a type error stating that the integer object, 1000, is not subscriptable. We can correct this by checking if the type of our input is a dictionary, checking if the appropriate key is in the dictionary and setting a default argument for our output dictionary that will be returned if the first two conditions are not met. This way, if the conditions are not met, we can still run our code without getting an error. For our default argument we will simply specify an empty dictionary for the output_dict.

def get_income_truth_values(input_dict, output_dict={}):

    if type(input_dict) is dict and ‘income’ in input_dict:

        output_dict= {’avg_income’: np.mean(input_dict['income'])}

    print(output_dict)

    return output_dict 

And we can make the same function calls successfully.

get_income_truth_values(10000) 

We can also define a default dictionary with an NaN value for the avg_income. This way we will guarantee that we have a dictionary with the expected key, even when we call our function with an invalid input:

def get_income_truth_values(input_dict, output_dict={’avg_income’: np.nan}):

    if type(input_dict) is dict and ‘income’ in input_dict:

        output_dict= {’avg_income’: np.mean(input_dict['income'])}

    print(output_dict)

    return output_dict 


get_income_truth_values(demo_dict)      

get_income_truth_values(10000) 

Which will print:

{’avg_income’: 61247.924999999996}
{’avg_income’: nan}

Read More About Python on Built In3 Ways to Write Pythonic Conditional Statements

 

Empty DataFrame Default Function Arguments 

Similar to our examples with lists and dictionaries, a default function with a default empty dataframe can be very useful. Let’s modify the dataframe we define to include the state of residence for each person:

demo_df['state'] = ['NY', 'MA', 'NY', 'CA']

Let’s also impute the missing values for age and income using the mean:

demo_df['age'].fillna(demo_df['age'].mean(), inplace=True)

demo_df['income'].fillna(demo_df['income'].mean(), inplace=True)

Next let’s define a function that performs a groupby on the states and calculates the mean for the age and income fields. The result will give use the average age and income for each state:

def income_age_groupby(input_df):

    output_df = input_df.groupby(['state'])['age', 'income'].mean().reset_index()

    print(output_df)

    return output_df

income_age_groupby(demo_df)

The result looks like:

An image of plain text showing the results of the function
Image: Sadrach Pierre / Built In



 

As you’d likely be able to guess by this point, if we call our function with a data type that is not a dataframe, we will get an error. We get an attribute error stating that the list object has no attribute groupby. This makes sense since the groupby method belongs to dataframe objects:

income_age_groupby([1,2,3])

     output_df = input_df.groupby([’state’])[’age’, ‘income’].mean().reset_index()

AttributeError: ‘list’ object has no attribute ‘groupby’

We can define a default data frame containing NaNs for each of the expected fields and check if the necessary columns are present:

def income_age_groupby(input_df, output_df = pd.DataFrame({’state’: [np.nan], ‘age’: [np.nan], ‘income’:[np.nan]})):

    if type(input_df) is type(pd.DataFrame()) and  set(['age', 'income', 'state']).issubset(input_df.columns):

        output_df = input_df.groupby(['state'])['age', 'income'].mean().reset_index()

    print(output_df)

    return output_df

    


income_age_groupby([1,2,3])

We see that our code ran successfully with the invalid data values.

An image of plain text showing the results of the function
Image: Sadrach Pierre / Built In

While we considered examples for data we made up, these methods can be extended to a variety of data processing tasks whether it be for software engineering, data science or machine learning. I encourage you to try applying these techniques in your own data processing code!

The code in this post is available on GitHub.

More Useful Info on Dataframes on Built In’s Expert Contributors ProgramFrom Clipboard to DataFrame With Pandas: A Quick Guide

 

Other Uses for Empty Variables

As demonstrated above, defining empty variables and data structures are useful for tasks such as type checking and setting default arguments.

In terms of type checking, empty variables and data structures can be used to inform some control flow logic. For example, if presented with an empty data structure, perform “X” logic to populate that data structure.

With type checking and setting default arguments, there may be cases where the instance of an empty data structure should kick off some logic that allows a function call to succeed under unexpected conditions. For example, if you define a function that is called several times on different lists of floating point numbers and calculate the average, it will work as long as the function is provided with a list of numbers.

Conversely, if the function is provided with an empty list it will fail as it will be unable to calculate the average value of an empty list. Type checking and default arguments can then be used to try to calculate and return an average, and if it fails it returns a default value. This can help guarantee that code used for processing thousands of rows of data will always run successfully even under unexpected conditions. 

Read More About Python Programming on Built In’s Expert Contributors Network13 Python Snippets You Need to Know

Explore Job Matches.