CommonCrawl.Net

🇬🇧 English

Overview

CommonCrawl.Net is a comprehensive .NET solution for interacting with the Common Crawl dataset. It provides tools to navigate the dataset index, download files with resume support, and parse WARC (Web ARChive) files efficiently.

This repository contains the following components:

CommonCrawl: The core library handling dataset metadata, download management, and WARC file parsing.
CommonCrawl.Parquet: A specialized library for reading Common Crawl's parquet index files.
CommonCraw.ConsoleApp: A console application demonstrating the usage of the libraries.

Features

Dataset Discovery: Easily fetch the latest available crawl versions.
Resilient Downloads: Built-in support for HTTP Range requests to resume interrupted downloads.
WARC Parsing: High-performance, streaming GZIP decompression and parsing of WARC records.
Parquet Support: Tools to read and process Common Crawl index files stored in Parquet format.

Getting Started

Prerequisites

.NET 10.0 SDK or later.

Installation

You can install the packages via NuGet:

CommonCrawl.Net:
CommonCrawl.Parquet:

dotnet add package CommonCrawl.Net
dotnet add package CommonCrawl.Parquet

Or you can build the project from source:

git clone https://github.com/m67186636/CommonCrawl.Net.git
cd CommonCrawl.Net
dotnet build

Usage Examples

1. Core Library (CommonCrawl)

Refer to the Core Library README for detailed documentation.

using CommonCrawl.Handlers;
using CommonCrawl.Readers;

// Get latest crawl info
var latestInfo = await DataSetHandler.Instance.GetLatestAsync();

// Read a WARC file stream
await foreach (var record in GzWarcReader.Instance.ReadAsAsyncEnumerable("https://example.com/sample.warc.gz"))
{
    Console.WriteLine($"Record: {record.Type}");
}

2. Parquet Reader (CommonCrawl.Parquet)

using CommonCrawl.Readers;

// Read records from a local Parquet file
await foreach (var record in ParquetReader.Instance.ReadAsAsyncEnumerable<IndexTableRecord>("cc-index.parquet"))
{
    Console.WriteLine($"URL: {record.Url}");
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
CommonCraw.ConsoleApp		CommonCraw.ConsoleApp
src		src
.gitignore		.gitignore
CommonCrawl.Net.slnx		CommonCrawl.Net.slnx
Directory.Build.props		Directory.Build.props
Directory.Packages.props		Directory.Packages.props
LICENSE		LICENSE
README.fr-FR.md		README.fr-FR.md
README.ja-JP.md		README.ja-JP.md
README.md		README.md
README.zh-CN.md		README.zh-CN.md
configureawait.props		configureawait.props
logo.png		logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CommonCrawl.Net

🇬🇧 English

Overview

Features

Getting Started

Prerequisites

Installation

Usage Examples

1. Core Library (CommonCrawl)

2. Parquet Reader (CommonCrawl.Parquet)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CommonCrawl.Net

🇬🇧 English

Overview

Features

Getting Started

Prerequisites

Installation

Usage Examples

1. Core Library (CommonCrawl)

2. Parquet Reader (CommonCrawl.Parquet)

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages