With Quality and Control, Web Data Enters the Enterprise Mainstream

The web is the greatest repository of knowledge humankind has ever created, and it’s about to unlock some previously unheard-of innovations for enterprises who are willing to see quality data amid the chaos.

With Quality and Control, Web Data Enters the Enterprise Mainstream
With Quality and Control, Web Data Enters the Enterprise Mainstream

In this article, you’ll learn about:

  • The trends that are driving Web Data Integration (WDI) adoption in the enterprise
  • The stages of ADI life cycle
  • Best practices and uses cases for WDI
  • Real examples of how Import.io’s WDI solution is being used in the travel, finance, and retail industries WDI use cases

Content Summary

Where Quality, Control, and Scalability Converge
Import.io steps up to the challenge
Discovering Wins Within the Web Data
Gaining a competitive advantage in the online travel industry
WDI Delivers Competitive Edge
Making smarter algorithms for financial success
The fast-paced world of retail
Conclusion

The web has fundamentally changed the way we interact with one another, do business, and learn about the world around us. We educate ourselves via Wikipedia or buy our coffee tables from vast digital “shopping malls” hosted by Amazon or Walmart. We use the web to collaborate with colleagues and track our daily to-do list, and nearly every business uses the web to find prospective customers.

Some ambitious enterprises see the web another way: The largest dataset humanity has ever compiled.

The challenge in getting value from this enormous dataset isn’t seeing or viewing it. The web is perfectly transparent—everyone can view a page’s underlying HTML to see the full structure and content. Rather, the challenge is in accessing the information in a way that permits you to collect the web data in a refined manner that emphasizes quality and control at every step. Even though the data is all there, it’s still locked behind the complexity that most enterprises haven’t yet figured out how to untangle.

The traditional form of collecting web data is a process called web scraping, which consists of downloading information from the HTML structure behind many pages at once. Web scraping projects tend to be resource-intensive because they require a unique combination of programming and data science skills—a recipe that can cost enterprises a great deal of money. On top of that, someone must regularly maintain web scraping code as target sites enhance the web technologies used for the site, which very often changes the presentation and structure of the data.

While responsible web scraping is legal, those who do it must not break another website’s Terms of Service (ToS) or generally behave in unethical ways. Improperly coded scrapers can also send too many requests to a target site at once, either slowing them down or crashing the site in an illegal (albeit accidental) denial-of-service (DoS) attack. Most enterprises just aren’t willing to create legal exposure if it’s expensive and they can’t immediately see the value.

Enter Web Data Integration (WDI)—a new approach to acquiring the web’s vast amount of invaluable information. A WDI solution lets an enterprise programmatically aggregate web data from any number of websites in a single, homogeneous, and repeatable workflow. They can collect high-quality data with low resource requirements and little to no business risk. Data scientists can put the same amount of confidence in web data as they do their highly-controlled internal datasets, and then use complex tools to integrate their findings with other business applications to generate insights and explore new opportunities.

Enterprises across verticals are finding value in a WDI approach. Payment providers are using WDI to enhance a consumer’s buying experience by providing rich images and suggesting similar items to purchase as well. “As more payment processors integrate web data into their mobile and web solutions, more consumers will benefit from the ‘frictionless finance’ achieved through customized digital user experiences while retailers build brand loyalty, increase customer touch-points and create the personalized experience consumers desire.”

Manufacturers are also using WDI to monitor product pricing to enforce their established minimum advertised pricing (MAP) agreements with retailers, and these retailers are in turn using it to collect customer sentiment on the many brands they’re selling and monitor the prices of competitive products.

Online travel companies are using it to pick up on-demand for upcoming travel seasons or monitor location-specific promotions and reviews. And property management companies that offer vacation rentals are using web data to discover which vacation markets have the best metrics (occupancy, average daily rates, etc.), which booking sites have more properties in specific target markets, and which competitors are running promotions in those markets.

The WDI industry is growing precisely because of its flexibility and the differentiated value it brings. According to a 2019 Opimas report, the WDI industry is set to grow from $2.5 billion in 2017 to more than $7 billion by 2020, for a compound annual growth rate (CAGR) of nearly 30 percent. By comparison, the customer relationship management (CRM) industry, while predicted to grow to $35 billion by 2023, has a much slower CAGR of only 6 percent. Despite the spending growth, former Gartner analyst Doug Laney states in his book, Infonomics, “The incremental value of exogenous data is largely untapped by most organizations.”

The web is already the most powerful dataset ever created, and it’s always growing. There will be new opportunities for enterprises who understand the value of web data—if they know how to leverage it.

The Stages of a WDI Life Cycle:

  • Identify: Find the page, and where on the page, web data is located.
  • Extract: Collect either displayed or hidden content, even if it’s behind a login or require user interactions.
  • Prepare: Cleanse and normalize web data using functions and formulas.
  • Integrate: Bring prepared web data into other business applications for faster insights.
  • Consume: Use graphs, charts, and other visualizations to see how web data changes over time.

Where Quality, Control, and Scalability Converge

As ambitious enterprises wade into WDI, they’ll start looking for an enterprise-level solution that allows for the visual discovery, selection, and manipulation of data that’s just not possible with traditional web scraping technologies.

They’re tired of coding and want speedy and repeatable automation of web data capture and aggregation. They want a visual environment that allows them to design workflows that can be automatically deployed repeatedly, even if it requires that they authenticate a user session and navigate parts of a website that’s not meant to be readable by unauthenticated guests. And they want the ability to emulate a web browser’s interactivity based on location—how else would they extract the real price of an item when the retailer makes them add it to their cart first?

To maintain the quality and control that’s fundamental to WDI, enterprises are also demanding better tools for profiling, cleansing, and normalizing data. Complex functions in easy to- use formats help them clean data with more efficiency, and historical data allows them to watch trends, support audit trails, and perform back-testing. And automated, up-to-date reports on the health of a web data pipeline help data scientists identify low-quality extractions before that data reaches their downstream integrations.

Speaking of integrations—this high-quality data must be portable to an enterprises’ other systems, whether that’s a CRM tool, a proprietary financial model, or a custom application that allows data analysts to forecast manufacturing needs. The minimally viable feature is the ability to download data in JSON format or Excel spreadsheets, but better WDI solutions offer an application programming interface (API) and/or webhooks for coders to automatically push normalized data into other applications.

Without this level of deep integration, any WDI project is destined for the pitfalls of traditional web scraping.

The Varied Use Cases for WDI:

  • Market research
  • Competitive intelligence
  • Dynamic pricing
  • Price monitoring
  • Identifying whitespace
  • Spotting product trends and mix
  • Alternative data & equity research
  • Machine learning training
  • Back-testing investment models
  • MAP compliance
  • Customer sentiment
  • Third-party supply chain monitoring
  • Image/text capture en masse
  • Content marketing research
  • Fraud detection
  • SEO monitoring
  • Academic research
  • Data journalism

Import.io steps up to the challenge

Offering all these features, options, and opportunities is no easy task, but Import.io is trying to make a WDI solution that puts the ability to gather, integrate, and analyze web data onto a unified platform. By promising to treat web data as a first-class citizen in the enterprise data environment, the company is not just trying to solve the problem of complex, expensive web scraping projects. It’s helping enterprises—large and small—understand how valuable the web can be.

And, at the same time, Import.io is helping enterprises escape the pitfalls most commonly associated with ad-hoc web scraping projects. Instead of poor-quality data, Import.io treats web data like enterprise data through the entire pipeline, which means it’s high-quality, clean, and reliable. Import.io’s customers even get a data quality guarantee.

For organizations looking to employ web data at scale to support critical business functions, Import.io offers two enterprise-level options—a SaaS-based Web Data Integration platform and a Premium web data service. The Web Data Integration platform is an integrated development environment for developers to design data pipelines, build data sources, apply data quality, and monitor all projects. With the Premium web data service, Import.io’s people do all the work, so there’s nothing left to do but deliver insights from target websites.

Import.io aims to reduce the risk associated with web scraping, too. The company is dedicated to never causing harm to the target site or its traffic. No one wants to set off alarms or encourage the target site to more heavily restrict its web data. And if there is a legal stop request, Import.io will be the target—not the enterprise—because it’s doing all the heavy lifting of interacting with target sites and downloading data.

Because of this emphasis on quality and control, WDI isn’t just an experiment for forward-thinking enterprises. It’s already netting big returns in many places.

Discovering Wins Within the Web Data

The use cases for WDI efforts are as varied as the content of the web itself, but early adopters tend to be in industries that use websites with large amounts of heavily categorized information, like retail or online travel. But as enterprises become increasingly creative in their approaches, a variety of alternative paths emerge.

Gaining a competitive advantage in the online travel industry

How does a travel booking website get visitors to book on their platform instead of others? By using every competitive advantage they can dream of.

One enterprise in the online travel industry uses Import.io to track sitemaps on target websites and extract details on new properties, whether that’s the price, the number of bedrooms, or aggregate customer reviews. The company can even get booking-level data, which is usually hidden behind user interactions. Logic may then be created around entities moving between available to unavailable, or price changes, to better understand how competitors are positioning their properties. Finally, the company can create comprehensive tables, on-demand, of all properties, and when they’re available for booking (see sidebar).

With that insight, the travel site can respond with more intelligent, data-driven promotions, which just might be enough to convince a visitor to book with them over the competition.

WDI Delivers Competitive Edge

An online travel company uses Import.io to gain intelligence from competitors’ sites, which it then uses to deliver better deals to its customers.

WDI Delivers Competitive Edge
WDI Delivers Competitive Edge

Making smarter algorithms for financial success

One New York-based hedge fund found itself in a predicament: Its internal web scraping scripts needed constant maintenance, and the data they extracted wasn’t reliable enough to feed directly into the company’s algorithmic trading applications. The fund needed to ditch traditional web scraping and find a solution that could keep up with change and an enormous volume of data.

The company opted for Import.io’s managed solution, which meant it didn’t have to spend any of its employees’ valuable time on collecting data from 40,000 news and 500 retail sites. With this system, the fund could use Import.io’s SaaS product for ad-hoc queries while deriving insights from the bulk of incoming web data directly in Import.io’s Insights application.

With a constant flow of timely web data, the firm could power its ML-based investment models with confidence. Improved data accuracy created more intelligent decisions, and allowed the hedge fund to improve upon benchmark indexes while reducing risk.

The fast-paced world of retail

As retailers compete to sell the same products, manufacturers sometimes need to enforce their minimum advertised price (MAP)—if prices go too low, the manufacturer’s value degrades. Before partnering with Import.io, one furniture maker could only reliably monitor 200 products—a mere 3 percent of its offerings—despite using multiple tools.

Using Import.io, the company tracks pricing for thousands of products—100 percent of its product line—across 20 distinct web properties. The company uses alarm thresholds to trigger internal processes when a product’s price falls below a certain level, and regularly leverages visualization capabilities to identify anomalies that are only obvious over time.

The furniture maker’s MAP tracking coverage has moved from 3 percent to 100 percent.

With these new tracking and integration capabilities, the company can better monitor how retailers change prices and start discussions when those fall beneath the MAP. It also regularly tracks pricing on competitive products to further drive pricing strategy and plan more targeted promotions.

Conclusion

No web scraper could reliably capture product images and links from 10,000 merchants, totaling more than 50 million transactions every month, and then feed all that data into other business tools using RESTful APIs and webhooks but, a WDI solution can—that’s just one example of why enterprises are flocking to WDI.

Through quality and control, enterprises are maximizing revenue, improving relationships with their customers, gathering insights into their competitors, and reacting faster to changes in their industry.

Anyone looking into web data should think about some of the trends and best practices in the WDI industry:

  • Web scraping can have its place in the enterprise, but WDI provides a single-platform approach to working with web data, a lifecycle that retains quality and control.
  • Don’t limit WDI efforts to the first “level” of data, such as what appears in a search for “coffee tables.” Experiment with user interactions that may reveal more information about a product or the website’s underlying structure.
  • WDI tools should be both easy to use and complex: Easy to extract data and complex in its cleaning, harmonization, and transforming capabilities.
  • Look for harmonization tools that will help eliminate duplicates, resolve conflicts, and then format the data in the most concise and flexible form possible for the use case in question.
  • Insist on automation! After all, WDI projects are meant to reduce the amount of time and effort spent on extracting, integrating, and analyzing web data. Automated extraction, for example, not only saves time but also reduces the risk of simple mistakes propagating through the enterprise.

With some of these habits in mind, embarking on a WDI project should feel like a smarter, less complex, and time-saving effort compared to web scraping efforts of the past. With more integration, less legal risk, and tooling that appeals to business users without decades of coding experience, web data should be a mainstay of enterprise data.

The web is the greatest repository of knowledge humankind has ever created, and it’s about to unlock some previously unheard-of innovations for enterprises who are willing to see quality data amid the chaos.

Source: Import.io

Thomas Apel Published by Thomas Apel

, a dynamic and self-motivated information technology architect, with a thorough knowledge of all facets pertaining to system and network infrastructure design, implementation and administration. I enjoy the technical writing process and answering readers' comments included.