Data extraction is the act or process of retrieving data out of (usually unstructured or poorly structured) data sources for further data processing or data storage. The import into the intermediate extracting system is thus usually followed by data transformation and possibly the addition of metadata prior to export to another stage in the data workflow.

Typical unstructured data sources include web pages, emails, documents, PDFs, scanned text, mainframe reports, spool files etc. Extracting data from these unstructured sources has grown into a considerable technical challenge. Whereas historically data extraction has had to deal with changes in physical hardware formats, the majority of current data extraction deals with extracting data from these unstructured data sources, and from different software formats. This growing process of data extraction from the web is referred to as Web scraping.

Business Value

Data extraction typically takes one of three approaches: (1) Using text pattern matching such as regular expressions to identify small or large-scale structure; (2) Using a table-based approach to identify common sections within a limited domain e.g. in emailed resumes, identifying skills, previous work experience, qualifications etc. using a standard set of commonly used headings; or (3) Using text analytics to attempt to understand the text and link it to other information.

I'm interested in:
I want to submit a:

Company Name Product Name Type
Averbis Information Discovery Commercial
Everteam Enterprise Data Integration Commercial
Hyland Document Filters Commercial
iManage iManage RAVN Extract Commercial
Information Builders Data Management Platform Commercial
NetOwl NetOwl Extractor Commercial
SAP SAP HANA Commercial
SAP SAP Data Services Commercial
SAP SAP Data Services Text Data Processing Commercial
SAP SAP Business Intelligence (BI) Solutions Commercial