Datafication.WebConnector

A web scraping connector for .NET that provides seamless integration between web pages and the Datafication.Core DataBlock API.

Description

Datafication.WebConnector is a specialized connector library that bridges web content and the Datafication.Core ecosystem. It provides robust HTML parsing with multiple extraction strategies: HTML tables, CSS selectors, links, images, and page metadata (Open Graph, Twitter Cards, JSON-LD). The connector supports both HTTP-only and headless browser rendering via Puppeteer, enabling scraping of JavaScript-heavy single-page applications. Additional sinks enable exporting DataBlocks as PDF documents or screenshot images.

Key Features

HTML Table Extraction: Parse HTML tables with automatic header detection, column merging, and metadata tracking
CSS Selector Scraping: Flexibly extract any structured data using CSS selectors with sub-selectors for nested content
Link Extraction: Extract and filter links with support for internal/external filtering, regex patterns, and deduplication
Image Extraction: Extract images with lazy-load detection, size filtering, and srcset/background image support
Page Metadata: Extract Open Graph, Twitter Cards, JSON-LD structured data, and standard meta tags for SEO analysis
Browser Rendering: Optional Puppeteer-based rendering for JavaScript-heavy pages and SPAs
HTTP Configuration: Customizable User-Agent, headers, cookies, timeouts, and redirect handling
Screenshot Export: Render DataBlocks as PNG/JPEG images with customizable styling
PDF Export: Generate PDF reports from DataBlocks with title pages and formatting
Factory Pattern: Create connectors using the WebConnectorFactory for simplified instantiation
Error Handling: Global error handler configuration for graceful exception management
Validation: Built-in configuration validation ensures correct setup before processing
Cross-Platform: Works on Windows, Linux, and macOS

Description
- Key Features
Installation
Usage Examples
Configuration Reference
API Reference
Common Patterns
Performance Tips
License

Installation

Note: Datafication.WebConnector is currently in pre-release. The packages are now available on nuget.org.

dotnet add package Datafication.WebConnector

Running the Samples:

cd samples/WebTableExtraction
dotnet run

Dependencies:

AngleSharp - for HTML parsing
PuppeteerSharp - for browser rendering (optional, required for UseBrowser mode)

Usage Examples

Extracting HTML Tables

Extract HTML tables from web pages with automatic header detection:

using Datafication.Connectors.WebConnector.Connectors;

// Create configuration for table extraction
var configuration = new HtmlTableConnectorConfiguration
{
    Source = new Uri("https://example.com/data-tables.html"),
    TableSelector = "table.data-table",  // Target specific tables
    FirstRowIsHeader = true,
    IncludeTableMetadata = true,  // Adds TableIndex, TableId, RowIndex columns
    TrimCellValues = true
};

// Create connector and extract data
var connector = new HtmlTableConnector(configuration);
var data = await connector.GetDataAsync();

Console.WriteLine($"Extracted {data.RowCount} rows from table");
Console.WriteLine(await data.TextTableAsync());

Merging Multiple Tables:

var configuration = new HtmlTableConnectorConfiguration
{
    Source = new Uri("https://example.com/multi-table-page.html"),
    TableSelector = "table",  // Match all tables
    MergeTables = true,  // Combine all tables into one DataBlock
    IncludeTableMetadata = true  // Track which table each row came from
};

var connector = new HtmlTableConnector(configuration);
var allTables = await connector.GetDataAsync();

// Analyze data by source table
var tableStats = allTables
    .GroupBy("TableIndex")
    .Info();
Console.WriteLine(await tableStats.TextTableAsync());

CSS Selector Scraping

Use CSS selectors to extract any structured content:

using Datafication.Connectors.WebConnector.Connectors;

// Scrape product listings
var configuration = new CssSelectorConnectorConfiguration
{
    Source = new Uri("https://example.com/products"),
    Selector = ".product-card",  // Each matched element becomes a row

    // Extract text content from nested elements
    SubSelectors = new Dictionary<string, string>
    {
        { "ProductName", "h2.title" },
        { "Price", ".price-value" },
        { "Description", ".product-description" }
    },

    // Extract attributes from nested elements
    AttributeSubSelectors = new Dictionary<string, string>
    {
        { "ImageUrl", "img|src" },          // selector|attribute format
        { "ProductLink", "a.details|href" },
        { "DataId", ".|data-product-id" }   // "." means current element
    },

    IncludeElementIndex = true,
    IncludeInnerText = false,  // We're using SubSelectors instead
    TrimValues = true,
    MaxElements = 100  // Limit results
};

var connector = new CssSelectorConnector(configuration);
var products = await connector.GetDataAsync();

Console.WriteLine($"Found {products.RowCount} products");
Console.WriteLine(await products.TextTableAsync());

Extracting Article Listings:

var configuration = new CssSelectorConnectorConfiguration
{
    Source = new Uri("https://news.example.com/"),
    Selector = "article.news-item",
    SubSelectors = new Dictionary<string, string>
    {
        { "Headline", "h2" },
        { "Summary", ".summary" },
        { "Author", ".author-name" },
        { "Date", "time" }
    },
    AttributeSubSelectors = new Dictionary<string, string>
    {
        { "ArticleUrl", "a|href" },
        { "PublishDate", "time|datetime" }
    },
    IncludeOuterHtml = false,
    IncludeInnerHtml = false
};

var connector = new CssSelectorConnector(configuration);
var articles = await connector.GetDataAsync();

Extracting Links

Extract and filter links from web pages:

using Datafication.Connectors.WebConnector.Connectors;

// Extract all internal links
var configuration = new LinkExtractorConnectorConfiguration
{
    Source = new Uri("https://example.com/"),
    InternalLinksOnly = true,  // Only links to same domain
    RemoveDuplicates = true,
    IncludeLinkText = true,
    IncludeTitle = true,
    IncludeRel = true,
    ExcludePatterns = new List<string>
    {
        @"^javascript:",
        @"^mailto:",
        @"^tel:",
        @"^#$",
        @"/login",
        @"/logout"
    }
};

var connector = new LinkExtractorConnector(configuration);
var links = await connector.GetDataAsync();

Console.WriteLine($"Found {links.RowCount} internal links");

Filter by URL Pattern:

// Extract only PDF document links
var configuration = new LinkExtractorConnectorConfiguration
{
    Source = new Uri("https://example.com/resources"),
    UrlPattern = @"\.pdf$",  // Regex pattern
    IncludeLinkText = true
};

var connector = new LinkExtractorConnector(configuration);
var pdfLinks = await connector.GetDataAsync();

Extract External Links Only:

var configuration = new LinkExtractorConnectorConfiguration
{
    Source = new Uri("https://example.com/"),
    ExternalLinksOnly = true,
    LinkSelector = "article a[href]",  // Scope to article content only
    IncludeRel = true  // Check for nofollow, sponsored, etc.
};

var connector = new LinkExtractorConnector(configuration);
var externalLinks = await connector.GetDataAsync();

// Analyze external link patterns
var domainCounts = externalLinks
    .GroupBy("Href")
    .Info();

Extracting Images

Extract images and their metadata:

using Datafication.Connectors.WebConnector.Connectors;

var configuration = new ImageExtractorConnectorConfiguration
{
    Source = new Uri("https://example.com/gallery"),
    ImageSelector = "img",
    IncludeDataSrc = true,  // Detect lazy-loaded images
    IncludeSrcset = true,   // Include responsive image info
    IncludeParentInfo = true,
    RemoveDuplicates = true,
    ResolveUrls = true,  // Convert relative URLs to absolute
    MinWidth = 100,      // Filter out small images/icons
    MinHeight = 100,
    ExcludedExtensions = new List<string> { ".svg", ".ico" }
};

var connector = new ImageExtractorConnector(configuration);
var images = await connector.GetDataAsync();

Console.WriteLine($"Found {images.RowCount} images");
Console.WriteLine(await images.Head(10).TextTableAsync());

Include Background Images:

var configuration = new ImageExtractorConnectorConfiguration
{
    Source = new Uri("https://example.com/"),
    IncludeBackgroundImages = true,
    BackgroundImageSelector = ".hero, .banner, .featured",  // Scope search
    AllowedExtensions = new List<string> { ".jpg", ".jpeg", ".png", ".webp" }
};

var connector = new ImageExtractorConnector(configuration);
var allImages = await connector.GetDataAsync();

Extracting Page Metadata

Extract SEO metadata, Open Graph, Twitter Cards, and JSON-LD:

using Datafication.Connectors.WebConnector.Connectors;

var configuration = new PageMetadataConnectorConfiguration
{
    Source = new Uri("https://example.com/article/123"),
    ExtractStandardMeta = true,   // title, description, keywords, etc.
    ExtractOpenGraph = true,      // og:title, og:image, etc.
    ExtractTwitterCard = true,    // twitter:card, twitter:title, etc.
    ExtractJsonLd = true,         // Schema.org structured data
    ExtractLinkTags = true,       // canonical, favicon, feeds, etc.
    SingleRowResult = true        // One row with all metadata as columns
};

var connector = new PageMetadataConnector(configuration);
var metadata = await connector.GetDataAsync();

Console.WriteLine("Page Metadata:");
Console.WriteLine(await metadata.TextTableAsync());

// Access specific metadata values
var ogTitle = metadata.GetColumn("OG_title").GetValue<string>(0);
var ogImage = metadata.GetColumn("OG_image").GetValue<string>(0);
var jsonLd = metadata.GetColumn("JsonLd").GetValue<string>(0);

Name/Value Pair Format:

var configuration = new PageMetadataConnectorConfiguration
{
    Source = new Uri("https://example.com/"),
    SingleRowResult = false,  // Returns Name/Value pairs instead
    ExtractAllMetaTags = true  // Include all meta tags
};

var connector = new PageMetadataConnector(configuration);
var metadata = await connector.GetDataAsync();

// Result has Name and Value columns
Console.WriteLine(await metadata.TextTableAsync());

Browser Rendering for SPAs

Use Puppeteer for JavaScript-rendered content:

var configuration = new HtmlTableConnectorConfiguration
{
    Source = new Uri("https://spa-example.com/data"),
    UseBrowser = true,  // Enable Puppeteer rendering

    BrowserOptions = new BrowserOptions
    {
        Headless = true,
        ViewportWidth = 1920,
        ViewportHeight = 1080,
        PageLoadTimeoutMs = 60000,
        WaitStrategy = BrowserWaitStrategy.NetworkIdle,  // Wait for AJAX
        PostLoadDelayMs = 2000,  // Additional wait for animations

        // Execute JavaScript after page load
        PostLoadScript = @"
            // Click 'Load More' button if present
            const btn = document.querySelector('.load-more');
            if (btn) btn.click();
        "
    }
};

var connector = new HtmlTableConnector(configuration);
var data = await connector.GetDataAsync();

Browser with Custom Headers and Cookies:

var configuration = new CssSelectorConnectorConfiguration
{
    Source = new Uri("https://authenticated-site.com/dashboard"),
    Selector = ".data-item",
    UseBrowser = true,

    HttpOptions = new WebRequestOptions
    {
        UserAgent = "Custom Bot 1.0",
        Headers = new Dictionary<string, string>
        {
            { "Authorization", "Bearer token123" }
        },
        Cookies = new Dictionary<string, string>
        {
            { "session_id", "abc123" },
            { "user_pref", "dark_mode" }
        }
    },

    BrowserOptions = new BrowserOptions
    {
        WaitStrategy = BrowserWaitStrategy.NetworkIdle
    }
};

var connector = new CssSelectorConnector(configuration);
var data = await connector.GetDataAsync();

Using the Factory Pattern

Use WebConnectorFactory for simplified connector creation:

using Datafication.Connectors.WebConnector.Factories;

// Quick creation with defaults
var tableConnector = WebConnectorFactory.CreateHtmlTableConnector(
    new Uri("https://example.com/table.html"));
var tables = await tableConnector.GetDataAsync();

var linkExtractor = WebConnectorFactory.CreateLinkExtractor(
    new Uri("https://example.com/"));
var links = await linkExtractor.GetDataAsync();

var cssConnector = WebConnectorFactory.CreateCssSelectorConnector(
    new Uri("https://example.com/products"),
    ".product-item");
var products = await cssConnector.GetDataAsync();

var metadataConnector = WebConnectorFactory.CreatePageMetadataConnector(
    new Uri("https://example.com/article"));
var metadata = await metadataConnector.GetDataAsync();

var imageExtractor = WebConnectorFactory.CreateImageExtractor(
    new Uri("https://example.com/gallery"));
var images = await imageExtractor.GetDataAsync();

Factory with Configuration:

using Datafication.Core.Connectors;

// Use factory interface for dynamic connector creation
IDataConnectorFactory factory = new WebConnectorFactory();

var config = new HtmlTableConnectorConfiguration
{
    Source = new Uri("https://example.com/data"),
    MergeTables = true
};

IDataConnector connector = factory.CreateDataConnector(config);
var data = await connector.GetDataAsync();

Error Handling

Configure global error handling:

var configuration = new HtmlTableConnectorConfiguration
{
    Source = new Uri("https://example.com/tables"),
    ErrorHandler = (exception) =>
    {
        Console.WriteLine($"Web Connector Error: {exception.Message}");
        // Log to file, send alert, etc.
    }
};

var connector = new HtmlTableConnector(configuration);

try
{
    var data = await connector.GetDataAsync();
}
catch (HttpRequestException ex)
{
    Console.WriteLine($"HTTP error: {ex.Message}");
}
catch (TimeoutException ex)
{
    Console.WriteLine($"Request timed out: {ex.Message}");
}

Screenshot Export

Render DataBlocks as images:

using Datafication.Sinks.Connectors.WebConnector;

// Create or load a DataBlock
var data = await connector.GetDataAsync();

// Quick screenshot (async)
var imageBytes = await data.ScreenshotAsync(rowLimit: 50, title: "Data Preview");
await File.WriteAllBytesAsync("output.png", imageBytes);

// Save directly to file
await data.ScreenshotToFileAsync(
    "report.png",
    rowLimit: 100,
    title: "Sales Report",
    format: ScreenshotFormat.Png);

// JPEG with quality setting
await data.ScreenshotToFileAsync(
    "report.jpg",
    format: ScreenshotFormat.Jpeg,
    quality: 85);

// High-resolution (2x scale)
var highResBytes = await data.ScreenshotHighResAsync(
    rowLimit: 50,
    title: "High-Res Preview");

Advanced Screenshot Options:

var sink = new ScreenshotSink
{
    RowLimit = 100,
    Title = "Quarterly Report",
    Subtitle = "Q4 2024 Sales Data",
    Format = ScreenshotFormat.Png,
    ViewportWidth = 1400,
    FullPage = true,
    DeviceScaleFactor = 2,  // High DPI
    BackgroundColor = "#f5f5f5",
    HeaderBackgroundColor = "#2563eb",
    HeaderTextColor = "#ffffff",
    UseAlternateRowColor = true,
    CustomCss = @"
        .title { color: #1e40af; }
        td { font-size: 12px; }
    "
};

var imageBytes = await sink.Transform(dataBlock);

PDF Export

Generate PDF reports from DataBlocks:

using Datafication.Sinks.Connectors.WebConnector;

var data = await connector.GetDataAsync();

var pdfSink = new PdfSink
{
    Title = "Data Export Report",
    Author = "Datafication System",
    Description = "Automated data extraction from web sources",
    RowLimit = 1000,
    LandscapeOrientation = true
};

using var pdfStream = await pdfSink.Transform(data);
using var fileStream = File.Create("report.pdf");
await pdfStream.CopyToAsync(fileStream);

Console.WriteLine("PDF report generated");

Configuration Reference

WebConnectorConfigurationBase

Base configuration shared by all web connectors.

Properties:

Source (Uri, required): The URL to scrape
- Must be an absolute HTTP or HTTPS URL
- File URIs are not supported
Id (string, auto-generated): Unique identifier for the configuration
UseBrowser (bool, default: false): Whether to use Puppeteer for rendering
- false: HTTP-only fetching (faster, no JavaScript)
- true: Full browser rendering (supports SPAs)
HttpOptions (WebRequestOptions): HTTP request configuration
BrowserOptions (BrowserOptions): Browser rendering configuration
ErrorHandler (Action?, optional): Global exception handler

WebRequestOptions

HTTP request configuration used when UseBrowser is false.

Properties:

UserAgent (string): User-Agent header
- Default: Modern Chrome user agent string
TimeoutSeconds (int, default: 30): Request timeout
FollowRedirects (bool, default: true): Follow HTTP redirects
MaxRedirects (int, default: 10): Maximum redirects to follow
Headers (Dictionary<string, string>): Custom HTTP headers
Cookies (Dictionary<string, string>): Cookies to send
Accept (string): Accept header value
AcceptLanguage (string?): Accept-Language header value

BrowserOptions

Puppeteer browser configuration used when UseBrowser is true.

Properties:

Headless (bool, default: true): Run browser without visible window
ViewportWidth (int, default: 1920): Browser viewport width
ViewportHeight (int, default: 1080): Browser viewport height
PageLoadTimeoutMs (int, default: 30000): Page load timeout
WaitStrategy (BrowserWaitStrategy): When to consider page "loaded"
- Load: Wait for load event (fastest)
- DOMContentLoaded: Wait for DOM ready
- NetworkIdle: Wait for network idle (most reliable)
PostLoadDelayMs (int, default: 0): Additional delay after page load
PostLoadScript (string?): JavaScript to execute after load
ExecutablePath (string?): Custom browser executable path
LaunchArgs (string[]?): Additional browser launch arguments

HtmlTableConnectorConfiguration

Configuration for HTML table extraction.

Properties:

TableSelector (string, default: "table"): CSS selector for tables
TableIndex (int?, default: null): Extract specific table by index (0-based)
FirstRowIsHeader (bool, default: true): Treat first row as headers
UseTheadForHeaders (bool, default: true): Use <thead> for headers when available
IncludeTableMetadata (bool, default: true): Add TableIndex, TableId, TableClass, RowIndex columns
MergeTables (bool, default: false): Combine all matched tables
SkipEmptyRows (bool, default: true): Skip rows with all empty cells
TrimCellValues (bool, default: true): Trim whitespace from cell values

CssSelectorConnectorConfiguration

Configuration for CSS selector-based scraping.

Properties:

Selector (string, default: "*"): Primary CSS selector for elements
SubSelectors (Dictionary<string, string>): Map column names to CSS selectors for text extraction
AttributeSubSelectors (Dictionary<string, string>): Map column names to "selector|attribute" for attribute extraction
Attributes (List): Attribute names to extract from matched elements
IncludeInnerText (bool, default: true): Include text content column
IncludeInnerHtml (bool, default: false): Include inner HTML column
IncludeOuterHtml (bool, default: false): Include outer HTML column
IncludeTagName (bool, default: true): Include tag name column
IncludeElementIndex (bool, default: true): Include element index column
MaxElements (int?, default: null): Limit number of elements
TrimValues (bool, default: true): Trim whitespace from values

LinkExtractorConnectorConfiguration

Configuration for link extraction.

Properties:

LinkSelector (string, default: "a[href]"): CSS selector for links
InternalLinksOnly (bool, default: false): Only include same-domain links
ExternalLinksOnly (bool, default: false): Only include different-domain links
UrlPattern (string?, default: null): Regex pattern to filter URLs
RemoveDuplicates (bool, default: true): Remove duplicate URLs
ExcludePatterns (List): Regex patterns to exclude (default excludes javascript:, mailto:, tel:, #)
IncludeLinkText (bool, default: true): Include anchor text
IncludeTitle (bool, default: true): Include title attribute
IncludeRel (bool, default: true): Include rel attribute
IncludeTarget (bool, default: true): Include target attribute
IncludeAnchorId (bool, default: false): Include anchor id attribute
IncludeAnchorClass (bool, default: false): Include anchor class attribute

ImageExtractorConnectorConfiguration

Configuration for image extraction.

Properties:

ImageSelector (string, default: "img"): CSS selector for images
IncludeBackgroundImages (bool, default: false): Extract CSS background images
BackgroundImageSelector (string, default: "*"): Scope for background image search
MinWidth (int?, default: null): Minimum image width filter
MinHeight (int?, default: null): Minimum image height filter
AllowedExtensions (List): Only include these extensions
ExcludedExtensions (List): Exclude these extensions
IncludeDataSrc (bool, default: true): Detect lazy-loaded images
IncludeSrcset (bool, default: true): Include srcset and sizes
IncludeParentInfo (bool, default: true): Include parent element info
RemoveDuplicates (bool, default: true): Remove duplicate URLs
ResolveUrls (bool, default: true): Convert relative URLs to absolute

PageMetadataConnectorConfiguration

Configuration for page metadata extraction.

Properties:

ExtractStandardMeta (bool, default: true): Extract standard meta tags
ExtractOpenGraph (bool, default: true): Extract Open Graph tags
ExtractTwitterCard (bool, default: true): Extract Twitter Card tags
ExtractJsonLd (bool, default: true): Extract JSON-LD structured data
ExtractLinkTags (bool, default: true): Extract link tags (canonical, favicon, etc.)
CustomMetaTags (Dictionary<string, string>): Custom meta tags to extract
ExtractAllMetaTags (bool, default: false): Extract all meta tags
SingleRowResult (bool, default: true): Return single row (vs name/value pairs)

API Reference

For complete API documentation, see the Datafication.Connectors.WebConnector API Reference.

Connector Classes

HtmlTableConnector

Constructor: HtmlTableConnector(HtmlTableConnectorConfiguration configuration)
Methods:
- Task<DataBlock> GetDataAsync() - Extract tables as DataBlock
- Task<IStorageDataBlock> GetStorageDataAsync(IStorageDataBlock target, int batchSize = 10000)
- string GetConnectorId() - Get unique identifier

CssSelectorConnector

Constructor: CssSelectorConnector(CssSelectorConnectorConfiguration configuration)
Methods: Same as HtmlTableConnector

LinkExtractorConnector

Constructor: LinkExtractorConnector(LinkExtractorConnectorConfiguration configuration)
Methods: Same as HtmlTableConnector
Output Columns: Href, IsExternal, Text, Title, Rel, Target, AnchorId, AnchorClass

ImageExtractorConnector

Constructor: ImageExtractorConnector(ImageExtractorConnectorConfiguration configuration)
Methods: Same as HtmlTableConnector
Output Columns: ElementIndex, Src, Alt, Title, Width, Height, Loading, IsBackground, Srcset, Sizes, DataSrc, ParentTag, ParentClass, FileExtension

PageMetadataConnector

Constructor: PageMetadataConnector(PageMetadataConnectorConfiguration configuration)
Methods: Same as HtmlTableConnector

Sink Classes

ScreenshotSink (namespace: Datafication.Sinks.Connectors.WebConnector)

Implements: IDataSink<byte[]>
Properties:
- int RowLimit (default: 100)
- ScreenshotFormat Format (default: Png)
- int ViewportWidth (default: 1200)
- bool FullPage (default: true)
- string BackgroundColor (default: "#ffffff")
- int? Quality (JPEG only)
- double DeviceScaleFactor (default: 1)
- string? Title, string? Subtitle
- string? CustomCss
- bool UseAlternateRowColor (default: true)
- string HeaderBackgroundColor, string HeaderTextColor
Methods:
- Task<byte[]> Transform(DataBlock dataBlock)

PdfSink (namespace: Datafication.Sinks.Connectors.WebConnector)

Implements: IDataSink<Stream>
Properties:
- int RowLimit (default: 1000)
- bool LandscapeOrientation (default: true)
- string? Title, string? Author, string? Description
Methods:
- Task<Stream> Transform(DataBlock dataBlock)

Extension Methods

ScreenshotSinkExtension (namespace: Datafication.Sinks.Connectors.WebConnector)

// Async methods
Task<byte[]> ScreenshotAsync(this DataBlock dataBlock, int rowLimit = 100, string? title = null)
Task ScreenshotToFileAsync(this DataBlock dataBlock, string filePath, int rowLimit = 100, string? title = null, ScreenshotFormat format = Png, int? quality = null)
Task<byte[]> ScreenshotHighResAsync(this DataBlock dataBlock, int rowLimit = 100, string? title = null)

// Synchronous methods
byte[] Screenshot(this DataBlock dataBlock, int rowLimit = 100, string? title = null)
void ScreenshotToFile(this DataBlock dataBlock, string filePath, ...)

Factory Methods

WebConnectorFactory (namespace: Datafication.Connectors.WebConnector.Factories)

// Instance method (IDataConnectorFactory interface)
IDataConnector CreateDataConnector(IDataConnectorConfiguration configuration)

// Static factory methods
static HtmlTableConnector CreateHtmlTableConnector(Uri source)
static LinkExtractorConnector CreateLinkExtractor(Uri source)
static CssSelectorConnector CreateCssSelectorConnector(Uri source, string selector)
static PageMetadataConnector CreatePageMetadataConnector(Uri source)
static ImageExtractorConnector CreateImageExtractor(Uri source)

Common Patterns

SEO Analysis Pipeline

// Extract metadata from multiple pages
var urls = new[]
{
    "https://example.com/",
    "https://example.com/about",
    "https://example.com/products"
};

var allMetadata = new DataBlock();
allMetadata.AddColumn(new DataColumn("Url", typeof(string)));
allMetadata.AddColumn(new DataColumn("Title", typeof(string)));
allMetadata.AddColumn(new DataColumn("Description", typeof(string)));
allMetadata.AddColumn(new DataColumn("OG_title", typeof(string)));
allMetadata.AddColumn(new DataColumn("OG_image", typeof(string)));
allMetadata.AddColumn(new DataColumn("Canonical", typeof(string)));
allMetadata.AddColumn(new DataColumn("HasJsonLd", typeof(bool)));

foreach (var url in urls)
{
    var config = new PageMetadataConnectorConfiguration
    {
        Source = new Uri(url),
        SingleRowResult = true
    };

    var connector = new PageMetadataConnector(config);
    var metadata = await connector.GetDataAsync();

    allMetadata.AddRow(new object[]
    {
        url,
        metadata.GetColumn("Title").GetValue<string>(0) ?? "",
        metadata.GetColumn("Description").GetValue<string>(0) ?? "",
        metadata.GetColumn("OG_title").GetValue<string>(0) ?? "",
        metadata.GetColumn("OG_image").GetValue<string>(0) ?? "",
        metadata.GetColumn("Canonical").GetValue<string>(0) ?? "",
        !string.IsNullOrEmpty(metadata.GetColumn("JsonLd").GetValue<string>(0))
    });
}

Console.WriteLine("SEO Analysis:");
Console.WriteLine(await allMetadata.TextTableAsync());

Product Catalog Scraping

// Scrape e-commerce product listings
var config = new CssSelectorConnectorConfiguration
{
    Source = new Uri("https://store.example.com/products?page=1"),
    Selector = ".product-card",
    SubSelectors = new Dictionary<string, string>
    {
        { "Name", ".product-name" },
        { "Price", ".price" },
        { "OriginalPrice", ".original-price" },
        { "Rating", ".rating-value" },
        { "ReviewCount", ".review-count" }
    },
    AttributeSubSelectors = new Dictionary<string, string>
    {
        { "ProductUrl", "a.product-link|href" },
        { "ImageUrl", "img|src" },
        { "ProductId", ".|data-product-id" }
    },
    UseBrowser = true,  // For JavaScript-rendered content
    BrowserOptions = new BrowserOptions
    {
        WaitStrategy = BrowserWaitStrategy.NetworkIdle
    }
};

var connector = new CssSelectorConnector(config);
var products = await connector.GetDataAsync();

// Clean and transform data
var cleanedProducts = products
    .Compute("PriceValue", "REPLACE(Price, '$', '')")
    .Compute("HasDiscount", "OriginalPrice != ''")
    .Select("ProductId", "Name", "PriceValue", "HasDiscount", "Rating", "ProductUrl", "ImageUrl");

Console.WriteLine($"Found {cleanedProducts.RowCount} products");
Console.WriteLine(await cleanedProducts.Head(10).TextTableAsync());

Link Audit and Broken Link Detection

// Extract all links for auditing
var config = new LinkExtractorConnectorConfiguration
{
    Source = new Uri("https://example.com/"),
    RemoveDuplicates = true,
    IncludeLinkText = true,
    IncludeRel = true
};

var connector = new LinkExtractorConnector(config);
var links = await connector.GetDataAsync();

// Categorize links
var internalLinks = links.Where("IsExternal", false);
var externalLinks = links.Where("IsExternal", true);
var nofollowLinks = links.Where("Rel", "nofollow", ComparisonOperator.Contains);

Console.WriteLine($"Internal links: {internalLinks.RowCount}");
Console.WriteLine($"External links: {externalLinks.RowCount}");
Console.WriteLine($"Nofollow links: {nofollowLinks.RowCount}");

// Export for further analysis
var linksCsv = await links.CsvStringSinkAsync();
await File.WriteAllTextAsync("link_audit.csv", linksCsv);

Image Asset Inventory

// Create inventory of all images on a site
var config = new ImageExtractorConnectorConfiguration
{
    Source = new Uri("https://example.com/"),
    IncludeDataSrc = true,
    IncludeSrcset = true,
    IncludeParentInfo = true,
    RemoveDuplicates = true,
    ResolveUrls = true
};

var connector = new ImageExtractorConnector(config);
var images = await connector.GetDataAsync();

// Analyze image usage
var byExtension = images
    .GroupBy("FileExtension")
    .Info();

var missingAlt = images
    .Where("Alt", "", ComparisonOperator.Equals)
    .Select("Src", "ParentTag", "ParentClass");

Console.WriteLine("Images by extension:");
Console.WriteLine(await byExtension.TextTableAsync());

Console.WriteLine($"\nImages missing alt text: {missingAlt.RowCount}");
Console.WriteLine(await missingAlt.Head(10).TextTableAsync());

Performance Tips

Use HTTP-Only When Possible: Browser rendering (UseBrowser = true) is significantly slower than HTTP-only fetching. Only enable browser mode when the page requires JavaScript rendering.
Choose the Right Wait Strategy: When using browser mode, BrowserWaitStrategy.NetworkIdle is most reliable but slowest. Use DOMContentLoaded or Load for faster extraction when content doesn't require AJAX.

Limit Result Sets: Use MaxElements in CSS selector configuration or filter results early to reduce memory usage:

var config = new CssSelectorConnectorConfiguration
{
    MaxElements = 100  // Limit to first 100 elements
};

Scope Your Selectors: Use specific CSS selectors to reduce parsing overhead:

TableSelector = "table#data-table"  // Faster than "table"
LinkSelector = "article a[href]"    // Faster than "a[href]"

Remove Unnecessary Columns: Disable metadata columns if not needed:

IncludeTableMetadata = false
IncludeElementIndex = false
IncludeParentInfo = false

Enable Deduplication: Use RemoveDuplicates = true for links and images to reduce data volume.

Set Appropriate Timeouts: Adjust timeouts based on expected page load times:

HttpOptions = new WebRequestOptions
{
    TimeoutSeconds = 15  // Fail fast on slow pages
}

Cache Browser Instance: Puppeteer downloads Chromium on first use. The connector caches this, but initial runs will be slower. Consider pre-warming in production environments.

Dispose DataBlocks: For large scraping operations, dispose intermediate DataBlocks:

using (var rawData = await connector.GetDataAsync())
{
    var processed = rawData.Where(...).Select(...);
    // rawData disposed after processing
}

Use Factory for Simple Cases: When using default configurations, the factory methods provide cleaner code with less overhead:
```
var connector = WebConnectorFactory.CreateHtmlTableConnector(uri);
```

License

This library is licensed under the Datafication SDK License Agreement. See the LICENSE file for details.

Summary:

Free Use: Organizations with fewer than 5 developers AND annual revenue under $500,000 USD may use the SDK without a commercial license
Commercial License Required: Organizations with 5+ developers OR annual revenue exceeding $500,000 USD must obtain a commercial license
Open Source Exemption: Open source projects meeting specific criteria may be exempt from developer count limits

For commercial licensing inquiries, contact support@datafication.co.

Third-Party Libraries:

AngleSharp - MIT License
PuppeteerSharp - MIT License

Datafication.WebConnector - Seamlessly connect web content to the Datafication ecosystem.

For more examples and documentation, visit our samples directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datafication.WebConnector

Description

Key Features

Table of Contents

Installation

Usage Examples

Extracting HTML Tables

CSS Selector Scraping

Extracting Links

Extracting Images

Extracting Page Metadata

Browser Rendering for SPAs

Using the Factory Pattern

Error Handling

Screenshot Export

PDF Export

Configuration Reference

WebConnectorConfigurationBase

WebRequestOptions

BrowserOptions

HtmlTableConnectorConfiguration

CssSelectorConnectorConfiguration

LinkExtractorConnectorConfiguration

ImageExtractorConnectorConfiguration

PageMetadataConnectorConfiguration

API Reference

Connector Classes

Sink Classes

Extension Methods

Factory Methods

Common Patterns

SEO Analysis Pipeline

Product Catalog Scraping

Link Audit and Broken Link Detection

Image Asset Inventory

Performance Tips

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Datafication.WebConnector

Description

Key Features

Table of Contents

Installation

Usage Examples

Extracting HTML Tables

CSS Selector Scraping

Extracting Links

Extracting Images

Extracting Page Metadata

Browser Rendering for SPAs

Using the Factory Pattern

Error Handling

Screenshot Export

PDF Export

Configuration Reference

WebConnectorConfigurationBase

WebRequestOptions

BrowserOptions

HtmlTableConnectorConfiguration

CssSelectorConnectorConfiguration

LinkExtractorConnectorConfiguration

ImageExtractorConnectorConfiguration

PageMetadataConnectorConfiguration

API Reference

Connector Classes

Sink Classes

Extension Methods

Factory Methods

Common Patterns

SEO Analysis Pipeline

Product Catalog Scraping

Link Audit and Broken Link Detection

Image Asset Inventory

Performance Tips

License