Data harvesting
Overview
Konnect Soft provide solutions for the collection and harvesting of data to “feed” your information requirement.
Advanced technology powers the automated harvesting of data - often combined with traditional research processes to deliver high quality new and appended content.
Benefits of data harvesting
Data harvesting - also known as “data scraping” or
“data mining” - can offer drastically reduced data acquisition and
updating costs.
The benefit of using harvesting to gather new data or to append
and/or confirm existing data is that it decreases the amount of time
required to do traditional data acquisition/appending/updating work.
There is no doubt that there is a wealth of information available on
the web and, as the amount has increased, the
potential benefit in effectively harvesting this data has also
increased. Specifically, in terms of corporate and biographic research,
the increase in the number of companies with web sites, people with
micro-sites (LinkedIn, Facebook, blogs), government online services, and
news sources able to post content for “free” (i.e., on an ad-supported
basis) has made the amount of information potentially “available”
exponentially greater than even a few years ago.
Factors affecting data harvesting
The success of a specific harvesting effort, however, is dependent
upon several critical factors.
What sources are you harvesting?
- The internet in general
- Specific corporate URLs
- Aggregated news archives
- Structured online
directories
- Primary source documents
in Word, PDF or other formats
What data are you passing to the source to harvest?
- Company names
- People’s names
- Industry or geographic parameters
- Format parameters
What data are you getting back from the harvesting effort?
- Are you getting the records you
want?
- Are you getting the fields of
data you want?
- Is the data you get consistent
and reliable, or is it contradictory and duplicative?
- Does the format of the returned
data require conversion?
Issues
Harvesters may soon experience the fact you
can easily get consistently formatted and fielded data, but new records
may not be in the “universe” desired, the content to be added, appended or
updated can be old, and there may be copyright issues in
using the data. Also, if the data is in the right universe, “fresh,” and
without usage issues, it may well have major fielding and formatting
challenges or come from sources that are difficult to “mine.”
Gathering new records
Harvesting new records is often the best use of data harvesting, either from the
internet in general or public records in electronic format. In this case
the data is gathered via an automated mechanism and then the traditional
research process begins. Since the initial harvesting cost is far less
than purchasing a list with re-use rights or purchasing a list and
mailing a questionnaire, there are immediate cost savings. Of course,
the use of a really inaccurate source would inflate the expense of a
traditional research effort (which should involve an initial
qualification review, internet research, telephone research, and quality
assurance), but otherwise it is a sound way to gather new records.
Also, it is important to note the large difference between gathering
records in a completely new area (new industry segment, new geographical
area) where you have no pre-existing records, which is easy, and the
cost associated with expanding the penetration in an existing area
(e.g. increasing coverage from 70% to 95% for all companies in a given
segment). The former is relatively easy and the latter gets
progressively more difficult as the penetration level approaches 100%.
In the end, it is manual research (internet and telephonic),
preferably by staff with domain experience, that is the only way to get
close to 100% coverage.
Data appending
Information providers typically have highly accurate databases and are
constantly trying to append additional data to those records. Typical
appends are telephone numbers, emails, URLs, branch offices, business
descriptions, and contact names. This strong base helps the harvesting
effort because in order to get accurate information you need to start by
passing a strong piece of information to the sources to be harvested.
This can be a full legal company name, a telephone number, a personal
name, or a URL.
Particularly effective strategies for finding information potentially
worth appending are:
- Harvesting contact names and
other information from a known URL by searching for text adjacent to
certain text strings on the spidered pages (“contact us”; “email”;
“about us”; “@”);
- Searching the internet for certain values in the <title> tag of a page (legal company name, full personal name);
- Searching the internet by telephone numbers and/or “fuzzy” name (not the exact legal name but all similar variants); and,
- Searching structured online directories by street address or multiple parameters (exact
post/zip code, first x digits of company name).
Learn more
To learn more contact
us now for a no-fee consultation to discuss
with our sepcialists how your business can profit from data harvesting.