How to Use JSON Schema to Validate JSON Documents in Python

In Python, we can use the JSON Schema library to validate a JSON document against a schema.

Written by Lynn Kwong
Published on Jul. 14, 2023
Image: Shutterstock / Built In
Image: Shutterstock / Built In
Brand Studio Logo

In Python, the JSON Schema library can be used to validate a JSON document against a schema. A JSON document can contain any number of key/value pairs. The key must be a string, but the value can be any supported type, such as string, number and boolean, etc. The value can even be complex types like an array or nested object. This makes the JSON document both very flexible and very unstructured. 

However, this can make data processing more difficult because the data team often gets data through APIs from which responses are normally in JSON format. Having a consistent data format can make the data pipelines more robust. With a uniform data input, you don’t need to worry about unexpected data types or spend too much time on data cleaning, enabling you to focus more on data analysis and work more efficiently.

Python JSON Schema Definition

JSON Schema is a language in Python that’s used to validate a JSON document. It contains key/value pairs, with each key being used to define the schema of some JSON data. JSON Schema is useful for providing readable, quality data and documentation.

In this post, we’ll introduce how to use JSON Schema to validate JSON documents. We’ll cover the essential concepts, as well as basic and advanced use cases along with simple code snippets that are easy to follow.

 

What Is JSON Schema?

A JSON Schema is a JSON document defining the schema of some JSON data. This explanation is pretty strange and elusive, but it will make more sense once we see the code. For now, we need to understand two points:

  • A JSON Schema is a valid JSON document with key/value pairs. Each key has a special meaning and is used to define the schema of some JSON data.
  • A schema is similar to the table definition in a SQL database and defines the data types of the fields in a JSON. It also defines which fields are required and which are optional.

Let’s get started with a simple JSON Schema:

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "number"},
    },
    "required": ["name"],
}

This JSON Schema specifies that the target JSON is an object with two properties, which are also commonly referred to as keys/fields and will be used accordingly, and the name property is required. Let’s dive a bit deeper into each validation keyword:

  • The type keyword specifies that the target JSON is an object. It can also be an array, which is normally an array of objects for API responses. We will discuss how to define the schema of an array field later in the article. However, in most cases, the top-level type is almost always object.
  • The properties keyword specifies the schema for each field of the JSON object. Each field of the target JSON is specified as a key/value pair, with the key being the actual field name and the value being the type of the field in the target JSON. The type keyword for each field has the same meaning as the top-level one. The type here can also be object. In this case, the corresponding field would be a nested object, as will be demonstrated later.
  • The required keyword is an array containing the properties that are required to be present. If any property specified here is missing, a ValidationError will be raised.

Besides the essential validation keywords, namely type, properties and required specified above, there are other schema keywords that can be seen in online documentation and also in the JSON Schemas some tools automatically generate.

More on Software EngineeringJSON vs. YAML: A Dive Into 2 Popular Data Serialization Languages

 

Python JSON Schema Keywords to Know

There are two schema keywords, namely $schema and $id. $schema defines the “draft” that is used for the schema. If $schema is not specified, the latest draft will be used, which is normally desired. You may get lost if you dive too deep into the drafts as a beginner. We normally don’t need to touch the $schema field, but we’ll go over this concept at the end of this post. 

On the other hand, $id defines a uniform resource identifier (URI) for the schema, which makes the current schema accessible externally by other schemas. If $id is not specified, then the current schema can only be used locally, which is typically desired for small projects. However, for bigger projects, your institution may have an in-house system for how to store the schemas and how to reference them. In this case, you can set the $id keyword accordingly.

There are two annotation keywords, namely title and description, which specify the title and description for the JSON Schema, respectively. They can be used for documentation and can make your schema easier to read and understand. They will also be displayed nicely by some graphical tools. For simplicity, they will not be specified in this post, but you should normally add them to your project for best practice.

An introduction to JSON Schema. | Video: Automation Step-by-Step

 

How to Validate a JSON Document Against a JSON Schema in Python

In Python, we can use the jsonschema library to validate a JSON instance (also referred to as a JSON document as long as it’s unambiguous) against a schema. It can be installed with pip:

$ pip install jsonschema

Let’s validate some JSON instances against the JSON Schema defined above. Technically, JSON is a string, but we need to specify the underlying data of the JSON to be validated, which is more convenient.

from jsonschema import validate

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "number"},
    },
    "required": ["name"],
}

validate(instance={"name": "John", "age": 30}, schema=schema)
# No error, the JSON is valid.

validate(instance={"name": "John", "age": "30"}, schema=schema)
# ValidationError: '30' is not of type 'number'

validate(instance={"name": "John"}, schema=schema)
# No error, the JSON is valid.

validate(instance={"age": 30}, schema=schema)
# ValidationError: 'name' is a required property

validate(instance={"name": "John", "age": 30, "job": "Engineer"}, schema=schema)
# No error, the JSON is valid. By additional fields are allowed.

It shows that the schema defined can be used to validate the JSON instances as expected. Incorrect data types or missing some required fields will trigger the ValidationError. However, it should be noted that by default additional fields are allowed, which may or may not be what you want. If you want a strict schema and only allow fields that are defined by the properties keyword, you can specify the additionalProperties to be False:

from jsonschema import validate

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "number"},
    },
    "required": ["name"],
    "additionalProperties": False
}

validate(instance={"name": "John", "age": 30, "job": "Engineer"}, schema=schema)
# ValidationError: Additional properties are not allowed ('job' was unexpected)

 

Define JSON Schema for an Array Field in Python

Even though it’s not so common to have an array as the top-level field, it’s very common to have it as a property. Let’s add an array property to our schema defined above. We need to set the type to be array and specify the type for each item with the items keyword:

from jsonschema import validate

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "number"},
        "scores": {
            "type": "array",
            "items": {"type": "number"},
        },
    },
    "required": ["name"],
}

validate(
    instance={"name": "John", "age": 30, "scores": [70, 90]}, schema=schema
)
# No error, the JSON is valid.

validate(
    instance={"name": "John", "age": 30, "scores": ["B", "A"]}, schema=schema
)
# ValidationError: 'B' is not of type 'number'.

validate(instance={"name": "John", "age": 30, "scores": []}, schema=schema)
# No error, the JSON is valid.

The type of the array elements can be checked correctly. However, empty arrays are allowed by default. To change this behavior, we can set minItems to be one, or the number you expected that makes sense for your case.

from jsonschema import validate

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "number"},
        "scores": {
            "type": "array",
            "items": {"type": "number"},
            "minItems": 1
        },
    },
    "required": ["name"],
}

validate(instance={"name": "John", "age": 30, "scores": []}, schema=schema)
# ValidationError: [] is too short

 

How to Define the JSON Schema for a Nested Object Field in Python

The type keyword of a property has the same meaning and syntax as the top-level one. Therefore, if the type of a property is object, then this property is a nested object. Let’s add an address property to our JSON data which will be a nested object:

from jsonschema import validate

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "number"},
        "scores": {
            "type": "array",
            "items": {"type": "number"},
        },
        "address": {
            "type": "object",
            "properties": {
                "street": {"type": "string"},
                "postcode": {"type": "string"},
            },
            "required": ["street"],
        },
    },
    "required": ["name"],
}

validate(
    instance={
        "name": "John",
        "age": 30,
        "scores": [70, 90],
        "address": {"street": "Wall Street 1", "postcode": "NY 10005"},
    },
    schema=schema,
)
# No error, the JSON is valid.

validate(
    instance={
        "name": "John",
        "age": 30,
        "scores": [70, 90],
        "address": {"postcode": "NY 10005"},
    },
    schema=schema,
)
# ValidationError: 'street' is a required property

The nested object field has exactly the same schema definition syntax as the top-level one. Therefore, it’s fairly straightforward to define the schemas for nested objects.

 

Use $defs to Avoid Code Duplication in JSON Schema

What if the address field needs to be used at multiple places in the same schema? If we copy the field definition wherever it’s needed, there would be code repetition, which programmers hate  because it doesn’t follow the don’t repeat yourself (DRY) principle. In JSON Schema, we can use the $defs keyword to define small subschemas that can be referenced at other places to avoid code duplication. Let’s refactor our schema above with $defs to potentially avoid code duplication:

from jsonschema import validate

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "number"},
        "scores": {
            "type": "array",
            "items": {"type": "number"},
        },
        "address": {"$ref": "#/$defs/address"},
    },
    "required": ["name"],
    "$defs": {
        "address": {
            "type": "object",
            "properties": {
                "street": {"type": "string"},
                "postcode": {"type": "string"},
            },
            "required": ["street"],
        },
    },
}

validate(
    instance={
        "name": "John",
        "age": 30,
        "scores": [70, 90],
        "address": {"street": "Wall Street 1", "postcode": "NY 10005"},
    },
    schema=schema,
)
# No error, the JSON is valid.


validate(
    instance={
        "name": "John",
        "age": 30,
        "scores": [70, 90],
        "address": {"postcode": "NY 10005"},
    },
    schema=schema,
)
# ValidationError: 'street' is a required property

The new schema using $defs to define a subschema works in the same way as before. However, it has the advantage that code duplication can be avoided if the address field needs to be used at different places of the same schema.

More on Software EngineeringHow to Use JSON.stringify() and JSON.parse() in JavaScript

 

How to Set JSON Schema for a Tuple Field in Python

What if we want the scores field to be a tuple with a fixed number of elements? Unfortunately, there is no tuple field in JSON Schema, and we need to achieve the definition of a tuple by an array. The general logic is that an array has items (items) and optionally has some positionally defined items that come before the normal items (prefixItems). For a tuple, there are only prefixItems but no items, which achieve the effect that a tuple has a fixed number of elements. And importantly, the type for each tuple element must be defined explicitly.

If you want to define the schema for a tuple field, you would need to have some understanding of JSON Schema drafts, which is a bit more advanced. A draft is a standard or specification for the JSON Schema and defines how the schema should be parsed by a validator. There are several drafts available.

Normally, we don’t need to worry about the $schema field and the draft to be used. However, when we need to define a tuple field, it’s something that we should pay attention to.

If the jsonschema library installed is the latest version (v4.9.0 at the time of writing), then the latest draft (2020–12) will be used. If this is the version that you want, you don’t need to specify the draft by the $schema keyword. However, it’s best practice to always specify the version of the draft in your JSON Schema for clarity. We omitted it at the beginning of this post for simplicity, but it’s recommended to have it in practice.

On the other hand, if you want to use a different draft version rather than the latest one, you would need to specify the $schema keyword with the draft version explicitly. Otherwise, it won’t work properly.

Let’s define the schema for scores field with drafts 2020–12 and 2019–09, respectively, and demonstrate how to use the $schema keyword and how to define a tuple field accordingly:

# Tuple defined with draft 2019-09:
schema_2019_09 = {
    "$schema": "https://json-schema.org/draft/2019-09/schema",  # This must be specified.
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "number"},
        "scores": {
            "type": "array",
            "items": [{"type": "number"}, {"type": "number"}],
            "minItems": 2,
            "additionalItems": False,
        },
    },
    "required": ["name"],
}

validate(
    instance={"name": "John", "age": 30, "scores": [70, 80, 90]},
    schema=schema_2019_09,
)
# ValidationError: Additional items are not allowed (90 was unexpected)

# Tuple defined with draft 2020-12:
schema_2020_12 = {
    "$schema": "https://json-schema.org/draft/2020-12/schema", # This is the default and thus optional.
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "number"},
        "scores": {
            "type": "array",
            "prefixItems": [{"type": "number"}, {"type": "number"}],  # Use prefixItems rather than items.
            "minItems": 2,
            "items": False, # Use items rather than additionalItems.
        },
    },
    "required": ["name"],
}

validate(
    instance={"name": "John", "age": 30, "scores": [70, 80, 90]},
    schema=schema_2020_12,
)
# ValidationError: Expected at most 2 items, but found 3

As we see, the schema definition for the tuple field with draft 2020–12 is more intuitive using the prefixItems and items keywords and thus is recommended to use. For a more detailed explanation of the changes from 2019–09 to 2020–12 regarding the tuple field definition, please review the release note.

Besides, it should be noted that even if we want the scores field to be a tuple, it must be specified as an array (list in Python) rather than a tuple for the validator. Otherwise, it won’t work.

validate(
    instance={"name": "John", "age": 30, "scores": (70, 80)},
    schema=schema_2020_12,
)

# ValidationError: (70, 80) is not of type 'array'

 

Using a Validator to Validate Multiple JSON Documents Efficiently in Python

If you have a valid JSON Schema and want to use it to validate many JSON documents, then it’s recommended to use the Validator.validate method, which is more efficient than the jsonchema.validate API. A validator is a special class implementing a specific draft. For example, there are Draft202012Validator, Draft201909Validator and Draft7Validator, etc. If no draft version is specified in the class name, Validator itself means the protocol (similar to an interface) to which all validator classes should adhere.

Besides the Validator.validate method, which works similarly to the jsonchema.validate API, you can use Validator.check_schema to check if a schema is valid against a specific draft. You can also use Validator.is_valid to check if a JSON is valid or not quietly, with no ValidationError raised if it is invalid. Let’s demonstrate the usage of these methods with some simple examples which can make them easier to understand:

from jsonschema import Draft202012Validator

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "number"},
    },
    "required": ["name"],
}

Draft202012Validator.check_schema(schema)
# No output means the schema is valid, otherwise `SchemaError` will be raised.

# Create an instance of the validator using a valid schema.
draft_202012_validator = Draft202012Validator(schema)

instance1={"name": "John", "age": 30}
instance2={"name": "John", "age": '30'}
instance3={"age": 30}

# Use the same instance of the validator to check JSON documents
# against the same schema more efficiently.
draft_202012_validator.is_valid(instance1)
# True
draft_202012_validator.is_valid(instance2)
# False
draft_202012_validator.is_valid(instance3)
# False

draft_202012_validator.validate(instance1)
# No output, the JSON is valid.
draft_202012_validator.validate(instance2)
# ValidationError: '30' is not of type 'number'
draft_202012_validator.validate(instance3)
# ValidationError: 'name' is a required property

In this post, we have introduced what a JSON Schema is and how to use it to validate different data types in a JSON document. We have covered the fundamentals for basic data types like strings and numbers, as well as complex ones like arrays and nested objects. We’ve also learned how to avoid code duplication with the $defs keyword which is used to define subschemas and can be handy for complex schemas. Last but not least, the basics of drafts are introduced.

We now know how to define the schema of a tuple field with different drafts and how to validate multiple JSON documents against the same schema more efficiently with a validator using a specific draft.

Explore Job Matches.