Use file content heuristics to decide file reader.#1962
Use file content heuristics to decide file reader.#1962Dimi1010 wants to merge 99 commits intoseladb:devfrom
Conversation
…sed on the magic number.
…ics detection method.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #1962 +/- ##
==========================================
+ Coverage 83.93% 83.95% +0.01%
==========================================
Files 316 316
Lines 57024 57330 +306
Branches 11797 11872 +75
==========================================
+ Hits 47863 48131 +268
- Misses 7941 7984 +43
+ Partials 1220 1215 -5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| PTF_ASSERT_NOT_NULL(dynamic_cast<pcpp::PcapNgFileReaderDevice*>(genericReader)); | ||
| PTF_ASSERT_TRUE(genericReader->open()); | ||
| // ------- IFileReaderDevice::createReader() Factory | ||
| // TODO: Move to a separate unit test. |
There was a problem hiding this comment.
We should add the following to get more coverage:
- Open a snoop file
- Open a file that is not any of the options
- Open pcap files with different magic numbers
- Assuming we add a version check for snoop and pcap file: create temp files with bogus data that has the magic number but wrong versions
There was a problem hiding this comment.
3d713ab adds the following tests:
- Pcap, PcapNG, Zst file with correct content + extension
- Pcap, PcanNG file with correct content + wrong extension
- Bogus content file with correct extension (pcap, pcapng, zst)
- Bogus content file with wrong extension (txt)
Haven't found a snoop file to add. Do we have any?
Open pcap files with different magic numbers
Do you mean Pcap content that has just its magic number changed? Because IMO it is reasonable to consider that invalid format and fail as regular bogus data.
Assuming we add a version check for snoop and pcap file: create temp files with bogus data that has the magic number but wrong versions
Pending on #1962 (comment) .
Move it out if it needs to be reused somewhere.
Libpcap supports reading this format since 0.9.1. The heuristics detection will identify such magic number as pcap and leave final support decision to the pcap backend infrastructure.
|
@Dimi1010 some CI tests fail... |
…n tryCreateDevice.
…le-selection # Conflicts: # Pcap++/src/PcapFileDevice.cpp # Tests/Pcap++Test/Tests/FileTests.cpp
|
@seladb can we merge this? It has been sitting for a while. |
| }; | ||
|
|
||
| /// @brief Heuristic file format detector that scans the magic number of the file format header. | ||
| class CaptureFileFormatDetector |
There was a problem hiding this comment.
Since we're not parsing all formats (maybe except Zstd) in PcapPlusPlus, we can reuse the logic we already have. Maybe it can run the open() method (or extract a portion of it) for each reader type until it can to find the right type?
There was a problem hiding this comment.
WDYM, we are not parsing all formats? Did you mean "now"?
Also, the necessary logic to detect the file format is already extracted in this class. Tbh, the open() call should probably delegate the format detection to this class if more comprehensive magic number format validation is needed.
IMO, how the file is processed after format detection that is a separate concern. In the device selection that is to be handled in the createReader device factory, thus allowing looser coupling between actual device classes and format detection. (e.g it is as simple to swapping if PcapNG creates PcapDevice or PcapNGDevice as swapping a case statement).
I think integrating the functionality into open() would be suboptimal for the following reasons:
- It potentially adds more responsibilities to the function that just "open the device".
- Looping through all the devices would involve iterating through a loop of more complicated operations.
Constructing the device and possibly repeated file open / close for eachopen()call as it is designed to function independently. - An
open()call can fail for multiple other reasons, not affiliated with the file format specifically.
There was a problem hiding this comment.
WDYM, we are not parsing all formats? Did you mean "now"?
Yes, I meant "now", sorry for the typo 🤦
IMO, how the file is processed after format detection that is a separate concern. In the device selection that is to be handled in the
createReaderdevice factory, thus allowing looser coupling between actual device classes and format detection. (e.g it is as simple to swapping if PcapNG creates PcapDevice or PcapNGDevice as swapping acasestatement).I think integrating the functionality into
open()would be suboptimal for the following reasons:
Having duplicate logic to determine if the file is of a certain format in both the device and CaptureFileFormatDetector is not great because if we fix a bug in one of them, we might miss the other. I think this logic should be in one place: either CaptureFileFormatDetector calls open() (might be the easiest option), or we can extract the detection logic and use it in both places
There was a problem hiding this comment.
Hmm, it should be possible. It will require expanding the CaptureFileFormatDetector a bit. Currently it only returns the format, but pcap for instance uses the magic number to also detect native or swapped byte order.
Depending on how specific we want to get it might involve a double read of the magic number, once by the format detector and once during the actual file header structure read. Impact should be minimal tho, as fstream is buffered by default.
There was a problem hiding this comment.
@seladb Tried a WIP implementation. It is possible to have open() call the format detector, tho I am not perfectly happy with the current iteration I have.
Can we do that merge of functionality in another PR, since those changes would also modify the PcapReader/Writer and SnoopReader and it goes out of scope of this PR?
PS: The WIP API would is something like this:
/// @brief An enumeration representing different capture file formats.
enum class CaptureFileFormat
{
Unknown,
Pcap, // regular pcap with microsecond precision
PcapMod, // Alexey Kuznetzov's "modified" pcap format
PcapNano, // regular pcap with nanosecond precision
PcapNG, // uncompressed pcapng
Snoop, // solaris snoop
ZstArchive, // zstd compressed archive
};
/// @brief Specifies the byte order (endianness) of a capture file relative to the host system.
enum class CaptureFileByteOrder
{
Unknown, // Unknown format. Magic number is palindrome.
Native, // Byte order is native to the host system.
Swapped // Byte order is swapped to the host system.
};
/// @brief Heuristic file format detector that scans the magic number of the file format header.
class CaptureFileFormatDetector
{
public:
/// @brief Checks a content stream for the magic number and determines the type.
///
/// The function optionally detects the byte order of the file if it can be determined by the magic number.
/// The byte order is not updated if no supported format is detected.
///
/// @param[in] content A stream that contains the file content.
/// @param[out] byteOrder Optional location to store the detected byte order.
/// @return A CaptureFileFormat value with the detected content type.
CaptureFileFormat detectFormat(std::istream& content, CaptureFileByteOrder* byteOrder = nullptr) const;
/// @brief Checks a content stream for the magic number and determines if it is a Pcap file.
///
/// The function optionally detects the byte order of the file if it can be determined by the magic number.
/// The byte order is not updated if no supported format is detected.
///
/// @param[in] content A stream that contains the file content.
/// @param[out] byteOrder Optional location to store the detected byte order.
/// @return A CaptureFileFormat value with the detected Pcap format or Unknown if the file is not pcap.
CaptureFileFormat detectPcapFile(std::istream& content, CaptureFileByteOrder* byteOrder = nullptr) const;
/// @brief Checks a content stream for the magic number and determines if it is a PcapNG file.
/// @param[in] content A stream that contains the file content.
/// @return True if the content stream is PcapNG file, false otherwise.
bool isPcapNgFile(std::istream& content) const;
/// @brief Checks a content stream for the magic number and determines if it is a Snoop file.
/// @param[in] content A stream that contains the file content.
/// @param[out] byteOrder Optional location to store the detected byte order.
/// @return True if the content stream is Snoop file, false otherwise.
bool isSnoopFile(std::istream& content, CaptureFileByteOrder* byteOrder = nullptr) const;
/// @brief Checks a content stream for the magic number and determines if it is a Zstd archive.
///
/// The function optionally detects the byte order of the file if it can be determined by the magic number.
/// The byte order is not updated if no supported format is detected.
///
/// @param[in] content A stream that contains the file content.
/// @param[out] byteOrder Optional location to store the detected byte order.
/// @return True if the content stream is Snoop file, false otherwise.
bool isZstdArchive(std::istream& content, CaptureFileByteOrder* byteOrder = nullptr) const;
};There was a problem hiding this comment.
I think that each file device should own its own format check, especially since we no longer use libpcap. We do it anyway in the different file device. PcapNG with/without Zstd is a bit of an exception because we use LightPacpNg to parse these files - but I still think it'd be better if PcapNgFileReaderDevice owns the format detection of PcapNG and Zstd
There was a problem hiding this comment.
I think that each file device should own its own format check, especially since we no longer use libpcap. We do it anyway in the different file device. PcapNG with/without Zstd is a bit of an exception because we use
LightPacpNgto parse these files - but I still think it'd be better ifPcapNgFileReaderDeviceowns the format detection of PcapNG and Zstd
@seladb minor note to that. I did look over how it might happen. It would lead to some code duplication in PcapReader and PcapWriter (append mode), since they both need to validate a pcap file magic signature + byte order.
Also do we really need to tie them to the public API? The isPcapFile in the example is part of the public API while the current implementation is entirely private. The main API entry point is createReader anyway, so the user shouldn't really need all the isXXXFile, necessarily?
There was a problem hiding this comment.
@seladb minor note to that. I did look over how it might happen. It would lead to some code duplication in
PcapReaderandPcapWriter(append mode), since they both need to validate a pcap file magic signature + byte order.
Where would be the code duplication in this case?
Also do we really need to tie them to the public API? The
isPcapFilein the example is part of the public API while the current implementation is entirely private. The main API entry point iscreateReaderanyway, so the user shouldn't really need all theisXXXFile, necessarily?
I agree it's not great, but is there a workaround which will keep the parsing logic in each class and make it available for the reader creator? 🤔
There was a problem hiding this comment.
@seladb minor note to that. I did look over how it might happen. It would lead to some code duplication in
PcapReaderandPcapWriter(append mode), since they both need to validate a pcap file magic signature + byte order.Where would be the code duplication in this case?
See #2082. That PR solves the duplication.
Also do we really need to tie them to the public API? The
isPcapFilein the example is part of the public API while the current implementation is entirely private. The main API entry point iscreateReaderanyway, so the user shouldn't really need all theisXXXFile, necessarily?I agree it's not great, but is there a workaround which will keep the parsing logic in each class and make it available for the reader creator? 🤔
Eh. Not really. :/ Tho, if it isn't going to be public api, it doesn't nessesarily need to be a member function exposed in the headers, no? It can alternatively be a quasi adjacent free function helper in the cpp.
There was a problem hiding this comment.
See #2082. That PR solves the duplication.
I saw this PR, not sure we really need to solve it, please see my comment there
Eh. Not really. :/ Tho, if it isn't going to be public api, it doesn't nessesarily need to be a member function exposed in the headers, no? It can alternatively be a quasi adjacent free function helper in the cpp.
I don't think it's so bad if it's part of the public API, even though ideally it shouldn't be. We have such cases in other places as well
# Conflicts: # Pcap++/src/PcapFileDevice.cpp
…le-selection # Conflicts: # Pcap++/src/PcapFileDevice.cpp
|
Ok, somewhat reworked the internal details on this.
They were replaced with more basic internal helpers That allows them to be reused in both |
The PR adds heuristics based on the file content that is more robust than deciding based on the file extension.
The new decision model scans the start of the file for its magic number signature. It then compares it to the signatures of supported file types [1] and constructs a reader instance based on the result.
A new function
createReaderandtryCreateReaderhas been added due to changes in the public API of the factory.The functions differ in the error handling scheme, as
createReaderthrows andtryCreateReaderreturnsnullptron error.Method behaviour changes during erroneous scenarios:
getReadercreateReadertryCreateReadernullptrPcapFileDeviceReadernullptr