How to Use GraphFrames for Entity Relationships on Databricks

In this article, we’ll review a generic joins framework for entity relationship using GraphFrames on Databricks. The framework addresses the challenge of maintaining isolated data processing jobs by providing a scalable approach to entity relationships or generic joins. We demonstrate how GraphFrames can efficiently handle complex relationships between entities, supporting up to two levels of joins through a streaming architecture. The implementation significantly reduces development time for new data integration requirements while improving overall system performance.

More on Entity RelationshipsWhat Is an Entity Relationship Diagram (ERD)?

The Value of Contextualized Data

Enterprise knowledge graphs typically store a vast array of disconnected entities. When applications require contextualized data, such as calendar events with their attendees and attachments or content pages with creators and subpages, developers often resort to creating one-off solutions involving complex joins, aggregations and data sinks. These custom solutions become difficult to maintain as product requirements evolve and new entity relationship types emerge.

We have used GraphFrames, a package for Apache Spark that provides DataFrame-based graphs. GraphFrames extends the functionality of GraphX while taking advantage of Spark DataFrames. The extended capabilities include motif finding, DataFrame-based serialization and highly expressive graph queries, making it ideal for our entity relationship management needs.

Problems With Contextualized Data in Databricks

On our Databricks platform, we have persistent knowledge data, but retrieving contextualized information (e.g., calendar events with attendees) requires building individual jobs that perform joins, aggregation and output to external systems. These jobs presented several challenges:

High maintenance overhead due to isolated implementations.
Inefficiency as product requirements evolved to include more relationship types.
Redundant code across similar processing tasks.
Lack of standardization in how entity relationships are processed.
Performance issues with complex multi-level relationships.

Two-Level Generic Joins Framework

Our solution implements a generic framework using GraphFrames to efficiently handle entity relationships. The current implementation supports up to two levels of joins (though the architecture could extend further), enabling complex relationship patterns such as:

GoogleCalendarEvent with attendees and attachments
Google Document with comments and authors

Core Components

The framework consists of five primary steps implemented as a Databricks job:

5 Steps to a Generic Joins Framework

Streaming query.
Batch processing.
Motif pattern finding.
Grouped view generation.
Delta merge.

A Databricks chart — Image: Screenshot by the author.

1. Streaming Query

Processes streaming updates for nodes that have changed since the last batch.
Monitors edge changes (fromNodeId, toNodeId) since the last batch.
Joins the two streams and processes batches together to ensure consistency.

2. Batch Processing

Reads consistent nodes and edges with appropriate filters to reduce data set size.
Applies unique user ID filtering to limit the context to relevant data.
Handles data deduplication based on event timestamps and data sources.
Manages deletes as soft deleted records to publish changes related to deleted nodes.

3. Motif Pattern Finding

Constructs a graph from the processed nodes and edges using GraphFrames.
Executes motif queries to find one and two-level join patterns.
Identifies all affected paths for nodes that changed in the current batch.
Determines bidirectional impact (upstream and downstream) for comprehensive updates. For example, node one change could impact upstream nodes two through five and downstream nodes seven through 11 for two level joins.

A Databricks diagram — Image: Screenshot by the author.

4. Grouped View Generation

Filters out deleted edges and nodes from the motif findings.
Maintains a record of all impacted nodes to ensure complete update propagation.
Generates a consolidated view for all identified graph IDs.
Processes isolated nodes (those with no connections) separately.
Publishes both connected and isolated nodes to ensure comprehensive updates.

5. Delta Merge

Collects the grouped view into a unified contextualized entity format.
Merges this data to a Delta table with upsert semantics.
Ensures exactly one row per node ID in the final output.
Applies change detection logic to only persist actual changes.
Optimizes downstream data consumption by eliminating redundant updates.

Data in the NewsWhat TikTok’s Uncertain Future Means for Data Privacy and Platform Regulation

Implementation Details

1. Graph Construction

The framework builds a graph representation using nodes (entities) and edges (relationships) where:

Nodes represent distinct entities (e.g., calendar events, contacts, documents).
Edges represent relationships between entities (e.g., “has attendee,” “created by”).

2. Motif Queries

GraphFrames provides powerful motif finding capabilities that we leverage for relationship discovery:

Level One Join

“(a)-[e0]->(b)”

This finds direct relationships (e.g., GoogleCalendarEvent a” that has attachment b”).

Level Two Join

“(a)-[e0]->(b); (b)-[e1]->(c)”

This discovers two-step relationships (e.g., GoogleCalendarEvent A that has attachment B created by GoogleContact C).

3. Change Propagation

When an entity or relationship changes, the framework:

Identifies the affected node(s).
Traces all paths up to two levels deep that include the changed elements.
Updates the joined entity view to reflect the current state.

4. Grouped View Processing

The grouped view processing is critical for ensuring comprehensive updates:

Filtering is applied after motif finding rather than during graph creation to ensure all impacted neighbors are identified.
This approach captures the complete set of affected entities, even when relationships are being removed.
The view consolidates all relationships for each node, providing a holistic representation.
Both connected entities and isolated nodes are processed, ensuring no updates are missed.

5. Delta Table Management

The Delta merge step optimizes persistence and downstream consumption:

Data is merged into a Delta table using upsert semantics to maintain a single source of truth.
Change detection logic ensures only actual changes trigger downstream updates.
The unified contextualized entity format standardizes how relationship data is stored.
This approach significantly reduces the processing burden on downstream consumers.

Performance Considerations

Several optimizations ensure efficient processing:

1. Streaming Updates

Only processing changed nodes and edges rather than full data set scans.

2. Selective Loading

Filtering nodes and edges based on context to reduce the working data set.

3. Bidirectional Impact Analysis

Identifying only affected paths rather than rebuilding entire graphs.

4. Batch Processing

Grouping updates for efficient processing.

5. Change-Based Merging

Only persisting actual changes to minimize IO operations.

Results and Benefits

The implementation of this framework has yielded several significant benefits:

1. Reduced Development Time

New entity relationship requirements can be onboarded without custom join logic.

2. Standardized Processing

It introduces a consistent approach to entity relationships across the platform.

3. Improved Maintenance

The centralized framework reduces the number of specialized jobs.

4. Enhanced Performance

Optimized graph operations outperform traditional relational joins for complex relationships.

5. Flexibility

Support for different entity types and relationship patterns without code changes.

6. Efficient Downstream Consumption

Minimized updates reduce processing load for consumers.

Data VisualizationPython Data Visualization With Seaborn and Matplotlib

Future Work

While the current implementation supports up to two levels of joins, future enhancements could include extending to deeper relationships (three or more levels), adding support for more complex graph patterns and filtering and creating visualization tools for entity relationship exploration.

The generic joins framework using GraphFrames provides an efficient and maintainable solution for entity relationship management in our knowledge graph. By using graph-based processing instead of traditional relational joins, we’ve created a scalable architecture that can adapt to evolving product requirements with minimal development overhead.

The streaming architecture ensures that changes propagate quickly through related entities, maintaining data consistency while optimizing processing resources. The grouped view generation and Delta merge steps further enhance the solution by providing optimized data persistence and consumption patterns. This approach not only solves the immediate challenge of maintaining isolated data processing jobs but also establishes a foundation for more sophisticated knowledge graph applications in the future.