In 1958, Hans-Peter Luhn published “A Business Intelligence System” in the IBM Journal of Research and Development. The paper detailed a system that utilized “present-day data processing machines” to automatically abstract, encode, store and disseminate information found in physical documents. While Luhn was ahead of his time with regard to the rise and importance of business intelligence, it would take a few more decades before the tech world turned its attention toward unstructured data.
Fast forward to today: The International Data Corporation estimates that 80 percent of the world’s data will be unstructured by 2025. For people who work in the field, it’s an unsurprising statistic.
“Most of the data in the world comes to a computer system unstructured,” said Senthil Padmanabhan, a VP and technical fellow at eBay who leads user experience engineering across the company’s platforms. “Unstructured data is a problem that has been present in the software engineering world since the late 1950s.”
While unstructured data isn’t a new phenomenon, what is new is the rate at which it’s being created. When eBay was founded in 1995, most people accessed the internet using a desktop or laptop computer, and sharing photos, videos and documents was a time-consuming process. Today, everything from phones to watches and refrigerators are connected to the internet and constantly generating data.
Still, that doesn’t mean all social media, Word document and smart device data is valuable. One of the biggest challenges unstructured data presents to a company, in addition to finding a place to store it all, is identifying what valuable insights can be gleaned from it in the first place.
WHERE IS DATA STORED?
- Structured data is commonly stored in a relational database, which is similar to a collection of related spreadsheets, or a data warehouse at scale. It can also be stored in a non-relational database. Users communicate with relational databases using SQL (structured query language).
- Unstructured data is often stored in non-relational databases, also known as NoSQL databases, and alongside structured data in a data lake at scale. Unlike relational databases, there’s no one language used for NoSQL database queries. In recent years, extracting information from large sets of unstructured data has become possible with the help of machine learning, which is well suited to analyzing large, unstructured data sets.
- Semi-structured data can either be stored in a non-relational database as a complete unit or its metadata can be stored separately in a relational database.
EBay knows all about the challenges and opportunities unstructured data presents. The company estimates that “many” of its 1.3 billion listings are unstructured. That swell of unstructured data dates back to a decision from the early 2000s. In an effort to create a seller-friendly environment, eBay allowed users to list items with limited descriptions and details. Once the site evolved from the go-to place to sell rare Pez dispensers into a full-fledged e-commerce platform, eBay realized it had a problem on its hands.
“When there were only a few million listings, it was okay to show the same listing in many different ways across the board,” said Padmanabhan. “But as more and more products were listed, users were not okay with seeing the same listing in 20 different ways or even a thousand different ways. That’s when we thought, ‘Okay, we need to make sense of these unstructured listings.’”
That was over a decade ago. Today, eBay still faces a massive challenge, but advances in machine learning technology have helped the company organize its listings and leverage unstructured data in new ways.
Making Machines Do the Heavy Lifting
EBay went public in 1998, and its growth through the early 2000s was nearly unparalleled. But that growth also loosened its handle on the company’s product catalog. At a traditional retailer like Walmart, the product catalog is organized, edited and updated in-house. At eBay, sellers build out the company’s product catalog in real time. As eBay grew, Padmanabhan said the company implemented rule-based systems to ensure that product catalogs were accurate and incomplete listings cataloged appropriately.
When eBay was younger, the conditions of these rules didn’t need to be complex. Imagine a condition that listings with “bear” in the title or description go into the Beanie Baby section of the product catalog. Such a condition would ensure that a listing for a “Princess Diana Bear” with the description “new in box” would find its way into the right section of eBay’s product catalog.
“We want the machine to do most of the heavy lifting so that when sellers give us as little information as possible, we’re able to give them an idea of how their listing will be structured.”
This setup works perfectly fine when the majority of users are selling Beanie Babies. But what happens when someone lists a Chicago Bears jersey or custom grizzly bear art for sale? In either case, the rules would need to be manually adjusted to ensure items are accurately cataloged.
“Slowly, we saw that a rule-based system was not scalable due to the rapid rate at which data was being generated,” said Padmanabhan. “Everyone in the world has been talking about applying machine learning to unstructured data for the past five or six years, but we started quite sometime before that.”
Natural language processing, or NLP, is a subset of machine learning used by eBay to bring structure to its listings and create more accurate product catalogs. NLP is the same technology that enables computers to understand the context of human speech or writing. It’s only possible for Siri to tell jokes and for Grammarly to determine that the tone of an email is a bit too aggressive because of NLP.
Over the years, eBay sellers have created their own unique lingo filled with acronyms to describe a product’s condition. NLP algorithms built and trained by humans help ensure that terms like “MIB,” or mint in box, aren’t lost when pages are translated. NLP is also used to speed up the listing process. All sellers are given the option to base a listing off an existing product for sale on eBay. If a seller selects the “sell one like this” option, the listing is populated with information from the existing listing, with the seller needing only to set the price and upload photos.
“We want the machine to do most of the heavy lifting so that when sellers give us as little information as possible, we’re able to give them an idea of how their listing will be structured,” said Padmanabhan. “We do the groundwork for them, and they just have to validate everything.”
Aligning BUSINESS AND Computer VISION
While NLP makes the lives of sellers easier, part of eBay’s work with computer vision is aimed at leveraging one of its largest sources of unstructured data, photographs, in a new way. Computer vision, or CV, is a subset of artificial intelligence that focuses on enabling computers to see and understand images as humans do. Put another way, CV seeks to give computers the ability to see a picture of a puppy, not just pixels.
Computer vision enables cars with self-driving technology to determine where lane lines are. It’s also what makes it possible to hold a phone up to a sign in a foreign language and receive a real-time translation from Google. In all these instances, CV leans on unstructured data to create and decipher meaning.
In eBay’s case, computer vision is used to create a unique user experience powered by unstructured data. Since late 2017, shoppers can search for items on the eBay app using photos they found online or took themselves. The technology that powers this sophisticated image search is a mixture of computer vision and deep learning, more specifically convolutional neural networks.
In addition to creating a new way to search for products, advances in CV could also help eBay better organize its product catalog.
“Let’s say there is a new big breakthrough in image recognition or vision, said Padmanabhan. “We can start using that to cluster our listings because every listing has an image associated with it.”
Getting the Next Generation Involved
According to Padmanabhan, unstructured data will continue to present a challenge to companies in the future. While progress is being made, he said there’s simply “no home run technology” on the horizon. Change, he said, happens incrementally as more research papers publish and new technologies develop. When it comes to technology, Padmanabhan said eBay wants to build models that can be processed more efficiently to take advantage of advances in hardware capabilities.
The company is also working to build excitement around unstructured data in the e-commerce industry among college students. EBay recently launched a machine learning challenge open to students at NYU, Stanford, the State University of New York at Buffalo and the University of Texas at Dallas. Teams were given a dataset containing 1 million unlabeled listings and tasked with developing models to identify listings for the same products.
Effectively, the students in the challenge were to step into the world of unstructured data and build a product catalog.
While Padmanabhan isn’t counting on any of the teams in the competition to develop groundbreaking technology, he does hope that the challenge piques enough interest to make them think more about applications of machine learning in e-commerce.
“I go to a lot of recruitment events, and whenever I talk to students studying machine learning, they only think about vision and text NLP,” said Padmanabhan. “But think about trade and e-commerce. There is a big opportunity to solve complex machine learning problems, which make an immediate and real impact on customers.”