Use file content heuristics to decide file reader. by Dimi1010 · Pull Request #1962 · seladb/PcapPlusPlus

Dimi1010 · 2025-09-12T09:40:58Z

The PR adds heuristics based on the file content that is more robust than deciding based on the file extension.

The new decision model scans the start of the file for its magic number signature. It then compares it to the signatures of supported file types [1] and constructs a reader instance based on the result.

A new function createReader and tryCreateReader has been added due to changes in the public API of the factory.
The functions differ in the error handling scheme, as createReader throws and tryCreateReader returns nullptr on error.

Method behaviour changes during erroneous scenarios:

Scenario	`getReader`	`createReader`	`tryCreateReader`
File not found	N/A	Throws exception	Return `nullptr`
Unsupported format	Return `PcapFileDeviceReader`	Throws exception	Return `nullptr`

…sed on the magic number.

…le-selection

… tied to it.

…ics detection method.

codecov · 2025-09-12T09:59:59Z

Codecov Report

❌ Patch coverage is 88.53868% with 40 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.95%. Comparing base (0b83e76) to head (c7eff8a).

Files with missing lines	Patch %	Lines
Pcap++/src/PcapFileDevice.cpp	86.00%	24 Missing and 4 partials ⚠️
Tests/Pcap++Test/Tests/FileTests.cpp	91.42%	5 Missing and 7 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##              dev    #1962      +/-   ##
==========================================
+ Coverage   83.93%   83.95%   +0.01%     
==========================================
  Files         316      316              
  Lines       57024    57330     +306     
  Branches    11797    11872      +75     
==========================================
+ Hits        47863    48131     +268     
- Misses       7941     7984      +43     
+ Partials     1220     1215       -5

Flag	Coverage Δ
alpine320	`76.61% <77.00%> (-0.02%)`	⬇️
fedora42	`76.16% <78.28%> (-0.02%)`	⬇️
macos-14	`82.01% <80.48%> (-0.02%)`	⬇️
macos-15	`82.01% <81.40%> (-0.02%)`	⬇️
mingw32	`70.73% <75.42%> (+0.01%)`	⬆️
mingw64	`70.68% <75.42%> (+0.05%)`	⬆️
npcap	`?`
rhel94	`75.99% <76.76%> (-0.02%)`	⬇️
ubuntu2004	`59.73% <62.60%> (-0.02%)`	⬇️
ubuntu2004-zstd	`59.83% <61.67%> (-0.04%)`	⬇️
ubuntu2204	`76.00% <76.76%> (-0.05%)`	⬇️
ubuntu2204-icpx	`59.06% <58.62%> (-0.05%)`	⬇️
ubuntu2404	`76.30% <76.64%> (+<0.01%)`	⬆️
ubuntu2404-arm64	`76.29% <77.00%> (-0.02%)`	⬇️
unittest	`83.95% <88.53%> (+0.01%)`	⬆️
windows-2022	`85.55% <85.44%> (+0.08%)`	⬆️
windows-2025	`85.60% <85.44%> (+0.10%)`	⬆️
winpcap	`85.60% <85.44%> (-0.10%)`	⬇️
xdp	`52.40% <3.91%> (-0.31%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

seladb · 2025-09-15T08:08:40Z

-	PTF_ASSERT_NOT_NULL(dynamic_cast<pcpp::PcapNgFileReaderDevice*>(genericReader));
-	PTF_ASSERT_TRUE(genericReader->open());
+	// ------- IFileReaderDevice::createReader() Factory
+	// TODO: Move to a separate unit test.


We should add the following to get more coverage:

Open a snoop file

Open a file that is not any of the options

Open pcap files with different magic numbers

Assuming we add a version check for snoop and pcap file: create temp files with bogus data that has the magic number but wrong versions

3d713ab adds the following tests:

Pcap, PcapNG, Zst file with correct content + extension

Pcap, PcanNG file with correct content + wrong extension

Bogus content file with correct extension (pcap, pcapng, zst)

Bogus content file with wrong extension (txt)

Haven't found a snoop file to add. Do we have any?

Open pcap files with different magic numbers

Do you mean Pcap content that has just its magic number changed? Because IMO it is reasonable to consider that invalid format and fail as regular bogus data.

Assuming we add a version check for snoop and pcap file: create temp files with bogus data that has the magic number but wrong versions

Pending on #1962 (comment) .

Move it out if it needs to be reused somewhere.

Libpcap supports reading this format since 0.9.1. The heuristics detection will identify such magic number as pcap and leave final support decision to the pcap backend infrastructure.

seladb · 2025-09-21T08:10:16Z

@Dimi1010 some CI tests fail...

…le-selection

…n tryCreateDevice.

seladb · 2025-12-31T08:15:25Z

@Dimi1010 I'm working on parsing pcap files without libpcap: #2034
Maybe we can rework this PR after my PR is merged?

…le-selection # Conflicts: # Pcap++/src/PcapFileDevice.cpp # Tests/Pcap++Test/Tests/FileTests.cpp

…ternal parser.

…le-selection

Dimi1010 · 2026-02-20T21:37:59Z

@seladb can we merge this? It has been sitting for a while.

seladb · 2026-02-22T08:09:20Z

+		};
+
+		/// @brief Heuristic file format detector that scans the magic number of the file format header.
+		class CaptureFileFormatDetector


Since we're not parsing all formats (maybe except Zstd) in PcapPlusPlus, we can reuse the logic we already have. Maybe it can run the open() method (or extract a portion of it) for each reader type until it can to find the right type?

WDYM, we are not parsing all formats? Did you mean "now"?

Also, the necessary logic to detect the file format is already extracted in this class. Tbh, the open() call should probably delegate the format detection to this class if more comprehensive magic number format validation is needed.

IMO, how the file is processed after format detection that is a separate concern. In the device selection that is to be handled in the createReader device factory, thus allowing looser coupling between actual device classes and format detection. (e.g it is as simple to swapping if PcapNG creates PcapDevice or PcapNGDevice as swapping a case statement).

I think integrating the functionality into open() would be suboptimal for the following reasons:

It potentially adds more responsibilities to the function that just "open the device".

Looping through all the devices would involve iterating through a loop of more complicated operations.
Constructing the device and possibly repeated file open / close for each open() call as it is designed to function independently.

An open() call can fail for multiple other reasons, not affiliated with the file format specifically.

WDYM, we are not parsing all formats? Did you mean "now"?

Yes, I meant "now", sorry for the typo 🤦

IMO, how the file is processed after format detection that is a separate concern. In the device selection that is to be handled in the createReader device factory, thus allowing looser coupling between actual device classes and format detection. (e.g it is as simple to swapping if PcapNG creates PcapDevice or PcapNGDevice as swapping a case statement).

I think integrating the functionality into open() would be suboptimal for the following reasons:

Having duplicate logic to determine if the file is of a certain format in both the device and CaptureFileFormatDetector is not great because if we fix a bug in one of them, we might miss the other. I think this logic should be in one place: either CaptureFileFormatDetector calls open() (might be the easiest option), or we can extract the detection logic and use it in both places

Hmm, it should be possible. It will require expanding the CaptureFileFormatDetector a bit. Currently it only returns the format, but pcap for instance uses the magic number to also detect native or swapped byte order.

Depending on how specific we want to get it might involve a double read of the magic number, once by the format detector and once during the actual file header structure read. Impact should be minimal tho, as fstream is buffered by default.

@seladb Tried a WIP implementation. It is possible to have open() call the format detector, tho I am not perfectly happy with the current iteration I have.

Can we do that merge of functionality in another PR, since those changes would also modify the PcapReader/Writer and SnoopReader and it goes out of scope of this PR?

PS: The WIP API would is something like this:

/// @brief An enumeration representing different capture file formats. enum class CaptureFileFormat { Unknown, Pcap, // regular pcap with microsecond precision PcapMod, // Alexey Kuznetzov's "modified" pcap format PcapNano, // regular pcap with nanosecond precision PcapNG, // uncompressed pcapng Snoop, // solaris snoop ZstArchive, // zstd compressed archive }; /// @brief Specifies the byte order (endianness) of a capture file relative to the host system. enum class CaptureFileByteOrder { Unknown, // Unknown format. Magic number is palindrome. Native, // Byte order is native to the host system. Swapped // Byte order is swapped to the host system. }; /// @brief Heuristic file format detector that scans the magic number of the file format header. class CaptureFileFormatDetector { public: /// @brief Checks a content stream for the magic number and determines the type. /// /// The function optionally detects the byte order of the file if it can be determined by the magic number. /// The byte order is not updated if no supported format is detected. /// /// @param[in] content A stream that contains the file content. /// @param[out] byteOrder Optional location to store the detected byte order. /// @return A CaptureFileFormat value with the detected content type. CaptureFileFormat detectFormat(std::istream& content, CaptureFileByteOrder* byteOrder = nullptr) const; /// @brief Checks a content stream for the magic number and determines if it is a Pcap file. /// /// The function optionally detects the byte order of the file if it can be determined by the magic number. /// The byte order is not updated if no supported format is detected. /// /// @param[in] content A stream that contains the file content. /// @param[out] byteOrder Optional location to store the detected byte order. /// @return A CaptureFileFormat value with the detected Pcap format or Unknown if the file is not pcap. CaptureFileFormat detectPcapFile(std::istream& content, CaptureFileByteOrder* byteOrder = nullptr) const; /// @brief Checks a content stream for the magic number and determines if it is a PcapNG file. /// @param[in] content A stream that contains the file content. /// @return True if the content stream is PcapNG file, false otherwise. bool isPcapNgFile(std::istream& content) const; /// @brief Checks a content stream for the magic number and determines if it is a Snoop file. /// @param[in] content A stream that contains the file content. /// @param[out] byteOrder Optional location to store the detected byte order. /// @return True if the content stream is Snoop file, false otherwise. bool isSnoopFile(std::istream& content, CaptureFileByteOrder* byteOrder = nullptr) const; /// @brief Checks a content stream for the magic number and determines if it is a Zstd archive. /// /// The function optionally detects the byte order of the file if it can be determined by the magic number. /// The byte order is not updated if no supported format is detected. /// /// @param[in] content A stream that contains the file content. /// @param[out] byteOrder Optional location to store the detected byte order. /// @return True if the content stream is Snoop file, false otherwise. bool isZstdArchive(std::istream& content, CaptureFileByteOrder* byteOrder = nullptr) const; };

I think that each file device should own its own format check, especially since we no longer use libpcap. We do it anyway in the different file device. PcapNG with/without Zstd is a bit of an exception because we use LightPacpNg to parse these files - but I still think it'd be better if PcapNgFileReaderDevice owns the format detection of PcapNG and Zstd

I think that each file device should own its own format check, especially since we no longer use libpcap. We do it anyway in the different file device. PcapNG with/without Zstd is a bit of an exception because we use LightPacpNg to parse these files - but I still think it'd be better if PcapNgFileReaderDevice owns the format detection of PcapNG and Zstd

@seladb minor note to that. I did look over how it might happen. It would lead to some code duplication in PcapReader and PcapWriter (append mode), since they both need to validate a pcap file magic signature + byte order.

Also do we really need to tie them to the public API? The isPcapFile in the example is part of the public API while the current implementation is entirely private. The main API entry point is createReader anyway, so the user shouldn't really need all the isXXXFile, necessarily?

@seladb minor note to that. I did look over how it might happen. It would lead to some code duplication in PcapReader and PcapWriter (append mode), since they both need to validate a pcap file magic signature + byte order.

Where would be the code duplication in this case?

Also do we really need to tie them to the public API? The isPcapFile in the example is part of the public API while the current implementation is entirely private. The main API entry point is createReader anyway, so the user shouldn't really need all the isXXXFile, necessarily?

I agree it's not great, but is there a workaround which will keep the parsing logic in each class and make it available for the reader creator? 🤔

@seladb minor note to that. I did look over how it might happen. It would lead to some code duplication in PcapReader and PcapWriter (append mode), since they both need to validate a pcap file magic signature + byte order.

Where would be the code duplication in this case?

See #2082. That PR solves the duplication.

Also do we really need to tie them to the public API? The isPcapFile in the example is part of the public API while the current implementation is entirely private. The main API entry point is createReader anyway, so the user shouldn't really need all the isXXXFile, necessarily?

I agree it's not great, but is there a workaround which will keep the parsing logic in each class and make it available for the reader creator? 🤔

Eh. Not really. :/ Tho, if it isn't going to be public api, it doesn't nessesarily need to be a member function exposed in the headers, no? It can alternatively be a quasi adjacent free function helper in the cpp.

See #2082. That PR solves the duplication.

I saw this PR, not sure we really need to solve it, please see my comment there

Eh. Not really. :/ Tho, if it isn't going to be public api, it doesn't nessesarily need to be a member function exposed in the headers, no? It can alternatively be a quasi adjacent free function helper in the cpp.

I don't think it's so bad if it's part of the public API, even though ideally it shouldn't be. We have such cases in other places as well

…le-selection

# Conflicts: # Pcap++/src/PcapFileDevice.cpp

…le-selection # Conflicts: # Pcap++/src/PcapFileDevice.cpp

Dimi1010 · 2026-04-22T07:21:55Z

Ok, somewhat reworked the internal details on this.

CaptureFileFormatDetector was dropped along with all the is[Pcap|PcapNG|Snoop|Zst]File methods.

They were replaced with more basic internal helpers is***Magic which take uint32_t (uint64_t for snoop) for the suspected magic value and return format and byte order (if it can be determined).

That allows them to be reused in both ***Device::open checks and detectFileFormat without overlap.

Dimi1010 added 4 commits September 12, 2025 12:03

Added heuristics file content detector that determines the content ba…

02de760

…sed on the magic number.

Merge remote-tracking branch 'upstream/dev' into feature/heuristic-fi…

d2b6339

…le-selection

Moved stream checkpoint outside format detector as it is not directly…

685dd9f

… tied to it.

Added a new factory function createReader that uses the new heurist…

40dee69

…ics detection method.

Dimi1010 added the enhancement label Sep 12, 2025

Add <algorithm> include.

f1e3e18

Dimi1010 added 2 commits September 12, 2025 13:17

Added unit tests.

8da1790

Deprecated old factory function.

3ad51e2

Dimi1010 added the API deprecation Pull requests that deprecate parts of the public interface. label Sep 12, 2025

Dimi1010 added 3 commits September 12, 2025 14:08

Add byte-swapped zstd magic number.

15c2000

Lint

17af8d4

Move enum closer to first usage.

46418ec

Dimi1010 marked this pull request as ready for review September 12, 2025 11:36

Dimi1010 requested a review from seladb as a code owner September 12, 2025 11:36

Dimi1010 requested review from clementperon, egecetin and tigercosmos September 12, 2025 11:36

tigercosmos approved these changes Sep 12, 2025

View reviewed changes

seladb reviewed Sep 15, 2025

View reviewed changes

Dimi1010 added 4 commits September 15, 2025 15:45

Added unit tests for file reader device factory.

3d713ab

Revert indentation.

a2391ec

Fixed StreamCheckpoint to restore original stream state.

ea328d7

Merge branch 'dev' into feature/heuristic-file-selection

db86c3e

Dimi1010 commented Sep 19, 2025

View reviewed changes

Comment thread Pcap++/src/PcapFileDevice.cpp Outdated

Dimi1010 added 3 commits September 20, 2025 12:59

Merge branch 'dev' into feature/heuristic-file-selection

4aed9bd

Moved isStreamSeekable helper to inside CaptureFileFormatDetector.

a83ae2b

Move it out if it needs to be reused somewhere.

Added pcap magic number for Alexey Kuznetzov's modified pcap format.

916e872

Libpcap supports reading this format since 0.9.1. The heuristics detection will identify such magic number as pcap and leave final support decision to the pcap backend infrastructure.

Merge remote-tracking branch 'upstream/dev' into feature/heuristic-fi…

022529f

…le-selection

Dimi1010 added 9 commits November 23, 2025 11:42

Fix exception message assert.

a15f529

Merge remote-tracking branch 'upstream/dev' into feature/heuristic-fi…

385528c

…le-selection

Merge remote-tracking branch 'upstream/dev' into feature/heuristic-fi…

9f5b5f1

…le-selection

Refactored format tests to utilize the createReader factory.

b0674bd

Fix nanoprecision test issues.

e614176

Remove openDevice flag. Update create procedure to avoid exceptions o…

bb9917b

…n tryCreateDevice.

Docs update + Lint

9a2a390

Docs fix.

df0a5a8

Lint.

92c80d9

Dimi1010 requested a review from seladb December 30, 2025 10:38

Dimi1010 added 3 commits January 17, 2026 17:09

Merge remote-tracking branch 'upstream/dev' into feature/heuristic-fi…

692df58

…le-selection # Conflicts: # Pcap++/src/PcapFileDevice.cpp # Tests/Pcap++Test/Tests/FileTests.cpp

Remove nano support checks as it should always be supported by the in…

fdf8c89

…ternal parser.

Merge remote-tracking branch 'upstream/dev' into feature/heuristic-fi…

5a62e9b

…le-selection

seladb reviewed Feb 22, 2026

View reviewed changes

Dimi1010 added 13 commits February 24, 2026 12:54

Merge remote-tracking branch 'upstream/dev' into feature/heuristic-fi…

1f1fb30

…le-selection

Extend format detector to capture byte order from magic number.

2b24c24

Add initial merge logic.

5024f9e

Add methods to check magic numbers directly.

d1640cb

Merge remote-tracking branch 'upstream/dev' into feature/heuristic-fi…

a18e3dd

…le-selection

Comment out unused variables.

68fb476

Merge branch 'dev' into feature/heuristic-file-selection

d9f6f0f

# Conflicts: # Pcap++/src/PcapFileDevice.cpp

Add helper to convert format to pcap precision.

a535316

Merge remote-tracking branch 'upstream/dev' into feature/heuristic-fi…

4d70c43

…le-selection # Conflicts: # Pcap++/src/PcapFileDevice.cpp

Simplify code.

ec8fa93

Fixed errors.

0600890

Disable pcap "modified" file format detection tests.

b16b932

Lint

c7eff8a

Conversation

Dimi1010 commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

seladb commented Sep 21, 2025

Uh oh!

seladb commented Dec 31, 2025

Uh oh!

Dimi1010 commented Feb 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dimi1010 Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dimi1010 Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dimi1010 Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dimi1010 commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Dimi1010 commented Sep 12, 2025 •

edited

Loading

codecov Bot commented Sep 12, 2025 •

edited

Loading

Dimi1010 Feb 22, 2026 •

edited

Loading

Dimi1010 Feb 24, 2026 •

edited

Loading

Dimi1010 Mar 25, 2026 •

edited

Loading