Quando a biblioteca é usada para contar tokens em um dataset em que os shards não estão no diretório raiz, o seguinte erro acontece:
Traceback (most recent call last):
File "/home/gregassagraf/workspaces/token-counter/.venv/bin/token-counter", line 8, in <module>
sys.exit(main())
~~~~^^
File "/home/gregassagraf/workspaces/token-counter/src/token_counter/cli.py", line 676, in main
split_completed = _observe_rows(
values,
...<9 lines>...
stop_requested=lambda: stop_requested,
)
File "/home/gregassagraf/workspaces/token-counter/src/token_counter/cli.py", line 315, in _observe_rows
for row_number, raw_value in enumerate(values, start=1):
~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/home/gregassagraf/workspaces/token-counter/src/token_counter/cli.py", line 216, in _iter_dataset_split_values
stream = load_dataset(dataset_id, config, **kwargs)
File "/home/gregassagraf/workspaces/token-counter/.venv/lib/python3.14/site-packages/datasets/load.py", line 1698, in load_dataset
builder_instance = load_dataset_builder(
path=path,
...<10 lines>...
**config_kwargs,
)
File "/home/gregassagraf/workspaces/token-counter/.venv/lib/python3.14/site-packages/datasets/load.py", line 1325, in load_dataset_builder
dataset_module = dataset_module_factory(
path,
...<5 lines>...
cache_dir=cache_dir,
)
File "/home/gregassagraf/workspaces/token-counter/.venv/lib/python3.14/site-packages/datasets/load.py", line 1211, in dataset_module_factory
raise e1 from None
File "/home/gregassagraf/workspaces/token-counter/.venv/lib/python3.14/site-packages/datasets/load.py", line 1192, in dataset_module_factory
).get_module()
~~~~~~~~~~^^
File "/home/gregassagraf/workspaces/token-counter/.venv/lib/python3.14/site-packages/datasets/load.py", line 655, in get_module
module_name, default_builder_kwargs = infer_module_for_data_files(
~~~~~~~~~~~~~~~~~~~~~~~~~~~^
data_files=data_files,
^^^^^^^^^^^^^^^^^^^^^^
path=self.name,
^^^^^^^^^^^^^^^
download_config=self.download_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/home/gregassagraf/workspaces/token-counter/.venv/lib/python3.14/site-packages/datasets/load.py", line 316, in infer_module_for_data_files
raise DataFilesNotFoundError("No (supported) data files found" + (f" in {path}" if path else ""))
datasets.exceptions.DataFilesNotFoundError: No (supported) data files found in gregassagraf/wikipedia_rephrase_pt
Quando a biblioteca é usada para contar tokens em um dataset em que os shards não estão no diretório raiz, o seguinte erro acontece: