Data harvesting


Konnect Soft provide solutions for the collection and harvesting of data to “feed” your information requirement.

Advanced technology powers the automated harvesting of data - often combined with traditional research processes to deliver high quality new and appended content.

Benefits of data harvesting

Data harvesting - also known as “data scraping” or “data mining” - can offer drastically reduced data acquisition and updating costs.

The benefit of using harvesting to gather new data or to append and/or confirm existing data is that it decreases the amount of time required to do traditional data acquisition/appending/updating work.

There is no doubt that there is a wealth of information available on the web and, as the amount has increased, the potential benefit in effectively harvesting this data has also increased. Specifically, in terms of corporate and biographic research, the increase in the number of companies with web sites, people with micro-sites (LinkedIn, Facebook, blogs), government online services, and news sources able to post content for “free” (i.e., on an ad-supported basis) has made the amount of information potentially “available” exponentially greater than even a few years ago.

Factors affecting data harvesting

The success of a specific harvesting effort, however, is dependent upon several critical factors.

What sources are you harvesting?

  • The internet in general
  • Specific corporate URLs
  • Aggregated news archives
  • Structured online directories
  • Primary source documents in Word, PDF or other formats

What data are you passing to the source to harvest?

  • Company names
  • People’s names
  • Industry or geographic parameters
  • Format parameters

What data are you getting back from the harvesting effort?

  • Are you getting the records you want?
  • Are you getting the fields of data you want?
  • Is the data you get consistent and reliable, or is it contradictory and duplicative?
  • Does the format of the returned data require conversion?


Harvesters may soon experience the fact you can easily get consistently formatted and fielded data, but new records may not be in the “universe” desired, the content to be added, appended or updated can be old, and there may be copyright issues in using the data.  Also, if the data is in the right universe, “fresh,” and without usage issues, it may well have major fielding and formatting challenges or come from sources that are difficult to “mine.”

Gathering new records

Harvesting new records is often the best use of data harvesting, either from the internet in general or public records in electronic format. In this case the data is gathered via an automated mechanism and then the traditional research process begins. Since the initial harvesting cost is far less than purchasing a list with re-use rights or purchasing a list and mailing a questionnaire, there are immediate cost savings. Of course, the use of a really inaccurate source would inflate the expense of a traditional research effort (which should involve an initial qualification review, internet research, telephone research, and quality assurance), but otherwise it is a sound way to gather new records.

Also, it is important to note the large difference between gathering records in a completely new area (new industry segment, new geographical area) where you have no pre-existing records, which is easy, and the cost associated with expanding the penetration in an existing area (e.g. increasing coverage from 70% to 95% for all companies in a given segment). The former is relatively easy and the latter gets progressively more difficult as the penetration level approaches 100%.

In the end, it is manual research (internet and telephonic), preferably by staff with domain experience, that is the only way to get close to 100% coverage.

Data appending

Information providers typically have highly accurate databases and are constantly trying to append additional data to those records. Typical appends are telephone numbers, emails, URLs, branch offices, business descriptions, and contact names. This strong base helps the harvesting effort because in order to get accurate information you need to start by passing a strong piece of information to the sources to be harvested. This can be a full legal company name, a telephone number, a personal name, or a URL.

Particularly effective strategies for finding information potentially worth appending are:

  • Harvesting contact names and other information from a known URL by searching for text adjacent to certain text strings on the spidered pages (“contact us”; “email”; “about us”; “@”);
  • Searching the internet for certain values in the <title> tag of a page (legal company name, full personal name);
  • Searching the internet by telephone numbers and/or “fuzzy” name (not the exact legal name but all similar variants); and,
  • Searching structured online directories by street address or multiple parameters (exact post/zip code, first x digits of company name).

Learn more

To learn more contact us now for a no-fee consultation to discuss with our sepcialists how your business can profit from data harvesting.