Environment
bagit-java version: v5.1.1
- Java version: OpenJDK Runtime Environment Zulu17.30+15-CA (build 17.0.1+12-LTS)
- OS: macOS 12.6.1 Monterey
Details
Given
- I have a bag containing non-Latin characters
- And the file is correctly hashed and listed in the manifest
When
Then
PayloadVerifier should not raise FileNotInManifestException
Discussion
The underlying issue appears to be that sun.nio.fs.UnixPath compares paths based on their internal byte representation rather than their string representation, and that somehow, the internal byte representation of the Path produced by PayloadVerifier.getAllFilesListedInManifests() differs from that in the manifest.
I created a file with the name contrôle.txt, where as near as I can tell, the ô is represented as the single Unicode codepoint 0xf4. Certainly that's what it is in the manifest, where if I open the file in a hex editor I can clearly see that it's correctly represented as the two-byte UTF-8 sequence 0xc3 0xb4.
However, when getAllFilesListedInManifests() creates a Path object, it somehow ends up representing it as a plain o (0x6f) plus a combining circumflex accent (codepoint x0302), so the internal byte representation has the UTF-8 bytes 0x6f 0xcc 0x82, the comparison fails, and the file appears not to be in the manifest.
I'm not sure at just what point the conversion from ô to o-plus-combining ^ happens -- probably some kind of Unicode normalization that one is going through and the other isn't
However, oddly, when I call toString(), the string representations of the paths are equal. So one workaround would be to convert the Path objects to Strings before making the comparison.
Environment
bagit-javaversion: v5.1.1Details
Given
When
Then
PayloadVerifiershould not raiseFileNotInManifestExceptionDiscussion
The underlying issue appears to be that
sun.nio.fs.UnixPathcompares paths based on their internal byte representation rather than their string representation, and that somehow, the internal byte representation of thePathproduced byPayloadVerifier.getAllFilesListedInManifests()differs from that in the manifest.I created a file with the name
contrôle.txt, where as near as I can tell, theôis represented as the single Unicode codepoint0xf4. Certainly that's what it is in the manifest, where if I open the file in a hex editor I can clearly see that it's correctly represented as the two-byte UTF-8 sequence0xc3 0xb4.However, when
getAllFilesListedInManifests()creates aPathobject, it somehow ends up representing it as a plaino(0x6f) plus a combining circumflex accent (codepointx0302), so the internal byte representation has the UTF-8 bytes0x6f 0xcc 0x82, the comparison fails, and the file appears not to be in the manifest.I'm not sure at just what point the conversion from
ôtoo-plus-combining^happens -- probably some kind of Unicode normalization that one is going through and the other isn'tHowever, oddly, when I call
toString(), the string representations of the paths are equal. So one workaround would be to convert thePathobjects toStringsbefore making the comparison.