A high-performance Excel file connector for .NET that provides seamless integration between Excel workbooks (XLSX, XLS) and the Datafication.Core DataBlock API.
Datafication.ExcelConnector is a specialized connector library that bridges Excel files and the Datafication.Core ecosystem. Built on the robust ExcelDataReader and ClosedXML libraries, it provides comprehensive Excel file support with flexible sheet selection, column filtering, row limiting, header customization, and both in-memory and streaming batch operations. The connector handles complex Excel workbooks with multiple sheets while maintaining high performance and ease of use.
- Excel Format Support: Read and write Excel files in both XLSX and XLS formats
- Multiple Source Types: Load Excel from local files, relative paths, or remote URLs (HTTP/HTTPS)
- Flexible Sheet Selection: Select sheets by name or index, defaults to first sheet
- Advanced Configuration: Header row offset, row skipping, column filtering, and row limiting
- Pandas-Style API: Configuration options inspired by pandas.read_excel() for familiarity
- Streaming Support: Efficient batch loading for large Excel files with
GetStorageDataAsync - Shorthand API: Simple one-line methods for common Excel loading scenarios
- Excel Export: Convert DataBlocks to professional XLSX files with
ExcelSink - Auto-Sized Columns: Exported Excel files have intelligently sized columns
- Type Preservation: Maintains proper data types for numbers, dates, booleans, and strings
- Error Handling: Global error handler configuration for graceful exception management
- Validation: Built-in configuration validation ensures correct setup before processing
- Cross-Platform: Works on Windows, Linux, and macOS
- Description
- Installation
- Usage Examples
- Loading Excel Files (Shorthand)
- Loading Excel with Configuration
- Selecting Specific Sheets
- Loading from Remote URLs
- Column Filtering with UseColumns
- Row Limiting with NRows
- Skipping Rows and Custom Headers
- Excel Files without Headers
- Streaming Large Excel Files to Storage
- Writing DataBlocks to Excel
- Error Handling
- Working with Excel Data
- Configuration Reference
- API Reference
- Common Patterns
- Performance Tips
- License
Note: Datafication.ExcelConnector is currently in pre-release. The packages are now available on nuget.org.
dotnet add package Datafication.ExcelConnectorRunning the Samples:
cd samples/ExcelBasicLoad
dotnet runDependencies:
- ExcelDataReader (MIT License) - for reading Excel files
- ClosedXML (MIT License) - for writing Excel files
The simplest way to load an Excel file is using the shorthand extension methods:
using Datafication.Core.Data;
using Datafication.Extensions.Connectors.ExcelConnector;
// Load Excel from local file (async) - uses first sheet
var employees = await DataBlock.Connector.LoadExcelAsync("data/employees.xlsx");
Console.WriteLine($"Loaded {employees.RowCount} employees from Excel");
// Synchronous version
var departments = DataBlock.Connector.LoadExcel("data/departments.xlsx");For more control over Excel parsing, use the full configuration:
using Datafication.Connectors.ExcelConnector;
// Create configuration with custom settings
var configuration = new ExcelConnectorConfiguration
{
Source = new Uri("file:///data/employees.xlsx"),
HasHeader = true,
SheetName = "Employees"
};
// Create connector and load data
var connector = new ExcelDataConnector(configuration);
var data = await connector.GetDataAsync();
Console.WriteLine($"Loaded {data.RowCount} rows with {data.Schema.Count} columns");Load data from specific sheets by name or index:
// Load by sheet name
var configuration = new ExcelConnectorConfiguration
{
Source = new Uri("file:///data/sales_report.xlsx"),
SheetName = "Q1 Sales", // Select sheet by name
HasHeader = true
};
var connector = new ExcelDataConnector(configuration);
var q1Data = await connector.GetDataAsync();
// Load by sheet index (0-based)
var config2 = new ExcelConnectorConfiguration
{
Source = new Uri("file:///data/sales_report.xlsx"),
SheetIndex = 1, // Select second sheet
HasHeader = true
};
var connector2 = new ExcelDataConnector(config2);
var q2Data = await connector2.GetDataAsync();
// Note: If both SheetName and SheetIndex are set, SheetName takes precedence
// If neither is set, the first sheet is loadedLoad Excel files directly from HTTP/HTTPS URLs:
// Load from remote URL (async)
var remoteData = await DataBlock.Connector.LoadExcelAsync(
"https://example.com/data/sales.xlsx"
);
// Or with full configuration
var configuration = new ExcelConnectorConfiguration
{
Source = new Uri("https://example.com/reports/monthly.xlsx"),
HasHeader = true,
SheetName = "Summary"
};
var connector = new ExcelDataConnector(configuration);
var webData = await connector.GetDataAsync();
Console.WriteLine($"Downloaded and loaded {webData.RowCount} rows");Load only specific columns from an Excel file:
// Load only CustomerID and CompanyName columns
var configuration = new ExcelConnectorConfiguration
{
Source = new Uri("file:///data/customers.xlsx"),
HasHeader = true,
UseColumns = "CustomerID, CompanyName" // Comma-separated column names
};
var connector = new ExcelDataConnector(configuration);
var data = await connector.GetDataAsync();
Console.WriteLine($"Loaded {data.Schema.Count} columns"); // Should be 2
Assert.IsTrue(data.HasColumn("CustomerID"));
Assert.IsTrue(data.HasColumn("CompanyName"));
// UseColumns is case-insensitive and trims whitespace
var config2 = new ExcelConnectorConfiguration
{
Source = new Uri("file:///data/products.xlsx"),
HasHeader = true,
UseColumns = "productid,productname,unitprice" // Can use lowercase
};Limit the number of rows loaded from an Excel file:
// Load only first 100 rows (after header)
var configuration = new ExcelConnectorConfiguration
{
Source = new Uri("file:///data/large_dataset.xlsx"),
HasHeader = true,
NRows = 100 // Limit to 100 data rows
};
var connector = new ExcelDataConnector(configuration);
var sample = await connector.GetDataAsync();
Assert.AreEqual(100, sample.RowCount);
// Useful for quick data preview or testing
var previewConfig = new ExcelConnectorConfiguration
{
Source = new Uri("file:///data/sales.xlsx"),
HasHeader = true,
NRows = 5
};
var preview = await DataBlock.Connector.LoadExcelAsync(previewConfig);
Console.WriteLine("First 5 rows preview:");
Console.WriteLine(await preview.TextTableAsync());Handle Excel files with blank rows or non-standard header placement:
// Skip first 2 rows, then treat row 3 as header
var configuration = new ExcelConnectorConfiguration
{
Source = new Uri("file:///data/report.xlsx"),
HasHeader = true,
SkipRows = 2, // Skip first 2 rows
HeaderRow = 0 // Header is the first row after skipping
};
var connector = new ExcelDataConnector(configuration);
var data = await connector.GetDataAsync();
// Advanced: Skip rows and offset header
var config2 = new ExcelConnectorConfiguration
{
Source = new Uri("file:///data/complex_report.xlsx"),
HasHeader = true,
SkipRows = 3, // Skip first 3 rows
HeaderRow = 1 // Header is 1 row after that (row 5 in the file)
};
// SkipRows + HeaderRow gives fine-grained control over where to find headersLoad Excel files that don't have a header row:
var configuration = new ExcelConnectorConfiguration
{
Source = new Uri("file:///data/raw_data.xlsx"),
HasHeader = false // No header row
};
var connector = new ExcelDataConnector(configuration);
var data = await connector.GetDataAsync();
// Columns will be named using Excel conventions: Column1, Column2, Column3, etc.
foreach (var columnName in data.Schema.GetColumnNames())
{
Console.WriteLine($"Column: {columnName}");
}For large Excel files, stream data directly to VelocityDataBlock in batches:
using Datafication.Storage.Velocity;
// Create VelocityDataBlock for efficient large-scale storage
var velocityBlock = new VelocityDataBlock("data/large_dataset.dfc");
// Configure Excel source
var configuration = new ExcelConnectorConfiguration
{
Source = new Uri("file:///data/1_million_rows.xlsx"),
HasHeader = true,
SheetName = "Data"
};
// Stream Excel data in batches of 10,000 rows
var connector = new ExcelDataConnector(configuration);
await connector.GetStorageDataAsync(velocityBlock, batchSize: 10000);
Console.WriteLine($"Streamed {velocityBlock.RowCount} rows to storage");
await velocityBlock.FlushAsync();Convert DataBlocks to professional Excel files:
using Datafication.Sinks.Connectors.ExcelConnector;
// Create or load a DataBlock
var data = new DataBlock();
data.AddColumn(new DataColumn("Name", typeof(string)));
data.AddColumn(new DataColumn("Age", typeof(int)));
data.AddColumn(new DataColumn("Salary", typeof(decimal)));
data.AddColumn(new DataColumn("HireDate", typeof(DateTime)));
data.AddRow(new object[] { "Alice", 30, 75000m, new DateTime(2020, 1, 15) });
data.AddRow(new object[] { "Bob", 25, 65000m, new DateTime(2021, 3, 10) });
data.AddRow(new object[] { "Carol", 35, 85000m, new DateTime(2019, 7, 20) });
// Convert to Excel file (as byte array) - async
var excelBytes = await data.ExcelSinkAsync();
// Write to file
await File.WriteAllBytesAsync("output/employees.xlsx", excelBytes);
// Synchronous version
var excelOutput = data.ExcelSink();
File.WriteAllBytes("output/employees_sync.xlsx", excelOutput);
// The exported Excel file includes:
// - Header row with column names
// - Auto-sized columns based on content width
// - Proper data types (numbers, dates, booleans, strings)
// - Professional formattingConfigure global error handling for Excel operations:
var configuration = new ExcelConnectorConfiguration
{
Source = new Uri("file:///data/employees.xlsx"),
HasHeader = true,
ErrorHandler = (exception) =>
{
Console.WriteLine($"Excel Error: {exception.Message}");
// Log to file, send alert, etc.
}
};
var connector = new ExcelDataConnector(configuration);
try
{
var data = await connector.GetDataAsync();
}
catch (Exception ex)
{
Console.WriteLine($"Failed to load Excel: {ex.Message}");
}Once loaded, use the full DataBlock API for data manipulation:
// Load Excel file
var sales = await DataBlock.Connector.LoadExcelAsync("data/sales_data.xlsx");
// Filter, transform, and analyze
var result = sales
.Where("Region", "West")
.Where("Revenue", 10000m, ComparisonOperator.GreaterThan)
.Compute("Profit", "Revenue - Cost")
.Compute("Margin", "Profit / Revenue")
.Select("ProductName", "Revenue", "Profit", "Margin")
.Sort(SortDirection.Descending, "Profit")
.Head(10);
Console.WriteLine("Top 10 profitable products in West region:");
Console.WriteLine(await result.TextTableAsync());
// Export results back to Excel
var resultExcel = await result.ExcelSinkAsync();
await File.WriteAllBytesAsync("output/top_products.xlsx", resultExcel);Configuration class for Excel data sources with pandas-inspired options.
Properties:
-
Source(Uri, required): Location of the Excel data source- File path:
new Uri("file:///C:/data/file.xlsx") - HTTP/HTTPS URL:
new Uri("https://example.com/data.xlsx") - Relative path:
new Uri("data/file.xlsx", UriKind.Relative)
- File path:
-
SheetName(string?, optional): Name of the sheet to load- Default:
null(loads first sheet) - Example:
"Employees","Q1 Sales","Summary" - Takes precedence over
SheetIndexif both are set
- Default:
-
SheetIndex(int?, optional): Zero-based index of the sheet to load- Default:
null(loads first sheet) - Example:
0(first sheet),1(second sheet) - Ignored if
SheetNameis also specified
- Default:
-
HasHeader(bool, default: true): Whether the first row contains column headerstrue: First row (after skipping) is treated as column namesfalse: Columns are namedColumn1,Column2, etc.
-
HeaderRow(int, default: 0): Offset for the header row after skipping0: First row afterSkipRowsis the header1: Skip one additional row before reading header- Works in combination with
SkipRows
-
SkipRows(int, default: 0): Number of rows to skip before reading any data0: Start reading from first row2: Skip first 2 rows- Applied before
HeaderRowoffset
-
UseColumns(string?, optional): Comma-separated list of column names to include- Default:
null(load all columns) - Example:
"CustomerID, CompanyName, ContactName" - Case-insensitive matching
- Whitespace is trimmed
- Default:
-
NRows(int?, optional): Maximum number of data rows to read- Default:
null(read all rows) - Example:
100(read only first 100 data rows) - Counted after header row
- Default:
-
Id(string, auto-generated): Unique identifier for the configuration- Automatically generated as GUID if not specified
-
ErrorHandler(Action?, optional): Global exception handler- Provides centralized error handling for Excel operations
Example:
var config = new ExcelConnectorConfiguration
{
Source = new Uri("file:///data/sales.xlsx"),
SheetName = "Q1 Sales",
HasHeader = true,
SkipRows = 2,
UseColumns = "ProductID, ProductName, Revenue",
NRows = 1000,
Id = "q1-sales-connector",
ErrorHandler = ex => Console.WriteLine($"Error: {ex.Message}")
};For complete API documentation, see the Datafication.Connectors.ExcelConnector API Reference.
ExcelDataConnector
- Constructor
ExcelDataConnector(ExcelConnectorConfiguration configuration)- Creates connector with validation
- Methods
Task<DataBlock> GetDataAsync()- Loads Excel file into memory as DataBlockTask<IStorageDataBlock> GetStorageDataAsync(IStorageDataBlock target, int batchSize = 10000)- Streams Excel data in batchesstring GetConnectorId()- Returns unique connector identifier
- Properties
ExcelConnectorConfiguration Configuration- Current configuration
ExcelConnectorConfiguration
- Properties
Uri Source- Excel source locationstring? SheetName- Sheet name to loadint? SheetIndex- Sheet index to load (0-based)bool HasHeader- Header row flag (default: true)int HeaderRow- Header row offset (default: 0)int SkipRows- Rows to skip (default: 0)string? UseColumns- Column filter (comma-separated)int? NRows- Row limitstring Id- Unique identifierAction<Exception>? ErrorHandler- Error handler
ExcelSink
- Implements
IDataSink<byte[]> - Methods
Task<byte[]> Transform(DataBlock dataBlock)- Converts DataBlock to XLSX file (byte array)
- Features
- Auto-sized columns based on content width
- Proper type handling (numbers, dates, booleans, strings)
- Professional formatting
ExcelConnectorValidator
- Validates
ExcelConnectorConfigurationinstances - Methods
ValidationResult Validate(IDataConnectorConfiguration configuration)- Validates configuration
ExcelConnectorExtensions (namespace: Datafication.Extensions.Connectors.ExcelConnector)
// Async shorthand methods
Task<DataBlock> LoadExcelAsync(this ConnectorExtensions ext, string source)
Task<DataBlock> LoadExcelAsync(this ConnectorExtensions ext, ExcelConnectorConfiguration config)
// Synchronous shorthand methods
DataBlock LoadExcel(this ConnectorExtensions ext, string source)
DataBlock LoadExcel(this ConnectorExtensions ext, ExcelConnectorConfiguration config)ExcelSinkExtension (namespace: Datafication.Sinks.Connectors.ExcelConnector)
// Convert DataBlock to Excel
Task<byte[]> ExcelSinkAsync(this DataBlock dataBlock)
byte[] ExcelSink(this DataBlock dataBlock)// Extract: Load Excel
var rawData = await DataBlock.Connector.LoadExcelAsync("input/sales_data.xlsx");
// Transform: Clean and process
var transformed = rawData
.DropNulls(DropNullMode.Any)
.Where("Status", "Cancelled", ComparisonOperator.NotEquals)
.Compute("NetRevenue", "Revenue - Discount")
.Compute("ProfitMargin", "NetRevenue / Revenue")
.Select("OrderId", "ProductName", "NetRevenue", "ProfitMargin");
// Load: Export to Excel
var outputExcel = await transformed.ExcelSinkAsync();
await File.WriteAllBytesAsync("output/processed_sales.xlsx", outputExcel);
Console.WriteLine($"Processed {transformed.RowCount} orders");using Datafication.Storage.Velocity;
// Load Excel configuration
var excelConfig = new ExcelConnectorConfiguration
{
Source = new Uri("file:///data/large_sales.xlsx"),
HasHeader = true,
SheetName = "Sales Data"
};
// Create VelocityDataBlock with primary key
var velocityBlock = VelocityDataBlock.CreateEnterprise(
"data/sales.dfc",
primaryKeyColumn: "OrderId"
);
// Stream Excel data directly to VelocityDataBlock
var connector = new ExcelDataConnector(excelConfig);
await connector.GetStorageDataAsync(velocityBlock, batchSize: 50000);
await velocityBlock.FlushAsync();
Console.WriteLine($"Loaded {velocityBlock.RowCount} rows into VelocityDataBlock");
// Now query efficiently
var recentSales = velocityBlock
.Where("OrderDate", DateTime.Now.AddDays(-30), ComparisonOperator.GreaterThan)
.GroupByAggregate("Region", "Revenue", AggregationType.Sum, "total_revenue")
.Execute();// Load data from multiple sheets in the same workbook
var filePath = "data/annual_report.xlsx";
// Load Q1 data
var q1Config = new ExcelConnectorConfiguration
{
Source = new Uri(filePath),
SheetName = "Q1",
HasHeader = true
};
var q1Data = await DataBlock.Connector.LoadExcelAsync(q1Config);
// Load Q2 data
var q2Config = new ExcelConnectorConfiguration
{
Source = new Uri(filePath),
SheetName = "Q2",
HasHeader = true
};
var q2Data = await DataBlock.Connector.LoadExcelAsync(q2Config);
// Combine and analyze
var combined = q1Data.Clone();
combined.AppendRowsBatch(q2Data);
var summary = combined
.GroupBy("Region")
.Aggregate(
new[] { "Revenue" },
new Dictionary<string, AggregationType>
{
{ "Revenue", AggregationType.Sum }
}
);
Console.WriteLine("Half-Year Summary:");
Console.WriteLine(await summary.TextTableAsync());
// Export combined analysis
var reportExcel = await summary.ExcelSinkAsync();
await File.WriteAllBytesAsync("output/h1_summary.xlsx", reportExcel);// Load source data
var employees = await DataBlock.Connector.LoadExcelAsync("data/employees.xlsx");
// Generate department salary report
var salaryReport = employees
.GroupBy("Department")
.Aggregate(
new[] { "Salary" },
new Dictionary<string, AggregationType>
{
{ "Salary", AggregationType.Mean }
}
)
.Sort(SortDirection.Descending, "mean_Salary");
// Generate high performers report
var highPerformers = employees
.Where("PerformanceRating", 4, ComparisonOperator.GreaterThanOrEqual)
.Sort(SortDirection.Descending, "Salary")
.Select("Name", "Department", "Salary", "PerformanceRating");
// Export both reports to Excel
var salaryReportExcel = await salaryReport.ExcelSinkAsync();
await File.WriteAllBytesAsync("reports/salary_by_department.xlsx", salaryReportExcel);
var performersExcel = await highPerformers.ExcelSinkAsync();
await File.WriteAllBytesAsync("reports/high_performers.xlsx", performersExcel);
Console.WriteLine("Reports generated successfully");-
Use Streaming for Large Files: For Excel files with hundreds of thousands of rows, use
GetStorageDataAsyncto stream data directly to VelocityDataBlockawait connector.GetStorageDataAsync(velocityBlock, batchSize: 100000);
-
Filter Columns Early: Use
UseColumnsto load only needed columns, reducing memory usage and improving speedUseColumns = "OrderID, CustomerID, Revenue" // Load only 3 columns
-
Limit Rows for Testing: Use
NRowswhen developing and testing to work with smaller datasetsNRows = 100 // Quick preview of first 100 rows
-
Sheet Selection: When working with multi-sheet workbooks, use
SheetIndexorSheetNameto avoid loading unnecessary sheetsSheetIndex = 0 // Faster than searching by name if you know the position
-
Adjust Batch Size: Tune the batch size based on available memory and row complexity
- Small rows (few columns): Use larger batch sizes (50,000 - 100,000)
- Wide rows (many columns): Use smaller batch sizes (10,000 - 25,000)
-
Remote File Caching: When loading from URLs repeatedly, download once and cache locally:
if (!File.Exists("cache/data.xlsx")) { var webData = await DataBlock.Connector.LoadExcelAsync("https://example.com/data.xlsx"); var bytes = await webData.ExcelSinkAsync(); await File.WriteAllBytesAsync("cache/data.xlsx", bytes); } var data = await DataBlock.Connector.LoadExcelAsync("cache/data.xlsx");
-
Skip Empty Rows: The connector handles empty rows gracefully. Use
SkipRowsif you know there are blank rows at the beginning -
Type Detection: Excel files maintain type information (unlike CSV), so data types are preserved automatically for better performance
-
Dispose DataBlocks: For large Excel processing pipelines, dispose intermediate DataBlocks to free memory:
using (var rawData = await DataBlock.Connector.LoadExcelAsync("large_file.xlsx")) { var processed = rawData.Where(...).Select(...); // rawData automatically disposed here }
-
Excel Export Optimization: When exporting large DataBlocks to Excel, be aware that Excel has a row limit (1,048,576 rows in XLSX). Consider splitting large datasets or using CSV for extremely large exports.
This library is licensed under the Datafication SDK License Agreement. See the LICENSE file for details.
Summary:
- Free Use: Organizations with fewer than 5 developers AND annual revenue under $500,000 USD may use the SDK without a commercial license
- Commercial License Required: Organizations with 5+ developers OR annual revenue exceeding $500,000 USD must obtain a commercial license
- Open Source Exemption: Open source projects meeting specific criteria may be exempt from developer count limits
For commercial licensing inquiries, contact support@datafication.co.
Third-Party Libraries:
- ExcelDataReader - MIT License
- ClosedXML - MIT License
Datafication.ExcelConnector - Seamlessly connect Excel workbooks to the Datafication ecosystem.
For more examples and documentation, visit our samples directory.