Data Engineers, Here’s How LLMs Can Make Your Lives Easier

Large language models make data engineering easier, from the simple tasks of the early stages of data projects to creating better frameworks for entire data teams.

Working with hundreds of data-driven businesses worldwide, I’m excited to witness how quickly and creatively businesses implemented LLMs into their workflows.

Let’s discuss a few common examples of using LLMs for data processing and enrichment to demystify the use of LLMs and highlight relatively simple yet incredibly time-saving methods for data-driven businesses.

3 Main Limitations of LLMs for Data Enrichment

The extent of the context
The size of your input
The resources you have

More on AI3 Steps to Building a Culture of Learning and Innovation Around AI

LLMs Speed Up the Engineering Process

LLM technology has made a huge impact on data engineering. As data engineering comprises a variety of actions you can take with data, there are different levels of using LLMs for it.

One of the most foundational aspects of the job is research. Implementing new data engineering solutions often requires reading various papers and documented use cases.

But now, you can ask an LLM to suggest a solution to your problem, and it will offer different architectures that you can try. Then, you can request help implementing the one you like with step-by-step instructions. This allows you to get to the actual engineering faster.

LLMs Can Organize Unstructured Data

Now, let’s discuss data processing. Data engineering often involves large amounts of unstructured data, which needs to be tidied up and stored correctly to be ready for querying.

LLMs can help you with that. For example, parsing product names and prices from HTML documents extracted from e-commerce websites requires a custom parser, the basis of which can now be written by an LLM.

Also, some less complex use cases allow information to be extracted from unstructured data without parsing. GPT Researcher, for example, is a tool designed for online research that can extract specific information from online websites on demand.

Of course, the scope of your project can limit the use of such tools. Still, the assistance that LLM-based technology can provide for smaller-scale projects is undeniably valuable.

Basically, LLMs have become helpful in different parts of the data engineering pipeline. The results they provide are not always 100 percent accurate, but they are still transforming the way and the speed at which we can get things done when working with data.

LLMs Streamline B2B Data Enrichment

LLMs are also excellent tools for data cleaning and enrichment. Let’s take unstructured addresses or static location data as an example.

Suppose you have a data set of 1,000 company profiles containing data with free user input fields. One of them is “location.” Some companies might have entered a state (e.g., Texas) as their address, while others used a city (e.g., Dallas). Such data must be structured for analysis.

You can upload the data set to the LLM and formulate a prompt to unify this data. For example: “Find ‘location’ values with city names and change them to the name of the state where the city is located.”

Here’s another example. Getting accurate information about what companies specialize in can be complicated, because most public company descriptions are meant for marketing efforts, with buzzwords like “driving innovation” or “transforming the field of x.” But you need to know exactly what they specialize in — especially in the B2B sector.

An LLM can process company descriptions and label them based on specific criteria or extract and summarize relevant facts.

How does it work? Let’s look at automating a categorization with the help of an LLM.

You have that same data set of 1,000 company profiles and a list of potential clients. Say you’re building a tool for companies that use or are likely to use AI. You’d like to approach companies that fit your ideal customer profile with your services.

Company descriptions are extracted from company listings on publicly available social networks, meaning you’re working with descriptions generated by companies. You could instruct an LLM to analyze which companies use AI and present the results in a table, infographic or textual summary.

LLMs Can Retrieve Hidden Data

Typically, the most reliable option for data enrichment is to use an LLM fine-tuned for your specific needs, especially if you’re working with big data. This is a costly option that’s not easily accessible to companies restricted by resources. I’d encourage you, however, to try at least performing tests with easily accessible LLM solutions.

When talking about using LLMs for data enrichment, the key benefit is extracting information from data in a way that typically requires a human or human-like intellect. Such tasks require understanding context and the ability to make conclusions.

Some may say that extracting information like “free trial” from the source data is not enrichment, but in my experience, it is a higher-level task than data cleaning or simply finding a keyword. LLMs understand context to the extent that they extract information from data without using the exact phrase mentioned in the source. This results in precious, hard-to-get data.

Limitations of Using LLMs for Enrichment

When your business needs to grow, LLMs can become expensive. But you can always use open-source options. They are not as good as the paid option, but they still open many transformational opportunities for businesses.

Many open-source options are limited by the size of the context the LLM can understand, though. Context window determines the scope of context a language model can comprehend when preparing prompt responses. To put it into perspective, the context scope for complex use cases can be a whole book.

The larger the context window is required the more advanced model you need. And larger models consume more resources. For example, analyzing such data as long product or job descriptions means more extensive input and will likely require larger models.

You can always reduce your input, but in most cases, the less information you feed to the LLM, the poorer the results will be. That’s a challenging circle to break, but solutions like Google’s Gemini 1.5 already show that LLMs don’t have to be limited by context. Gemini 1.5 can process 1 million tokens, which equals 700,000 words of context in one go.

So, while working with LLMs, you’ll always aim to use them as effectively as possible, striving to balance the price of service (or running your LLM) and input size. Otherwise, you get enough quality, but it’s too difficult/expensive to run it, and the other way around.

More on LLMsHow to Get Better Results From an LLM

The Future of LLMs

It’s hard to tell what the future of LLMs and AI technology will look like. Still, one of the positives I have already noticed is that humans will likely be able to focus on vision, allowing artificial intelligence to help find a solution to materialize it — an extension rather than a replacement of expertise.

I’d expect more focus on practical tools for developers, such as programming assistants and component-based solutions, which will interconnect. Businesses will likely keep using LLMs to save resources or create new business ideas to help other companies or individuals save theirs.