Skip to content

Erro ao lidar com datasets em the os shards não estão no diretório raiz. #2

Description

@galdutro13

Quando a biblioteca é usada para contar tokens em um dataset em que os shards não estão no diretório raiz, o seguinte erro acontece:

Traceback (most recent call last):
  File "/home/gregassagraf/workspaces/token-counter/.venv/bin/token-counter", line 8, in <module>
    sys.exit(main())
             ~~~~^^
  File "/home/gregassagraf/workspaces/token-counter/src/token_counter/cli.py", line 676, in main
    split_completed = _observe_rows(
        values,
    ...<9 lines>...
        stop_requested=lambda: stop_requested,
    )
  File "/home/gregassagraf/workspaces/token-counter/src/token_counter/cli.py", line 315, in _observe_rows
    for row_number, raw_value in enumerate(values, start=1):
                                 ~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/gregassagraf/workspaces/token-counter/src/token_counter/cli.py", line 216, in _iter_dataset_split_values
    stream = load_dataset(dataset_id, config, **kwargs)
  File "/home/gregassagraf/workspaces/token-counter/.venv/lib/python3.14/site-packages/datasets/load.py", line 1698, in load_dataset
    builder_instance = load_dataset_builder(
        path=path,
    ...<10 lines>...
        **config_kwargs,
    )
  File "/home/gregassagraf/workspaces/token-counter/.venv/lib/python3.14/site-packages/datasets/load.py", line 1325, in load_dataset_builder
    dataset_module = dataset_module_factory(
        path,
    ...<5 lines>...
        cache_dir=cache_dir,
    )
  File "/home/gregassagraf/workspaces/token-counter/.venv/lib/python3.14/site-packages/datasets/load.py", line 1211, in dataset_module_factory
    raise e1 from None
  File "/home/gregassagraf/workspaces/token-counter/.venv/lib/python3.14/site-packages/datasets/load.py", line 1192, in dataset_module_factory
    ).get_module()
      ~~~~~~~~~~^^
  File "/home/gregassagraf/workspaces/token-counter/.venv/lib/python3.14/site-packages/datasets/load.py", line 655, in get_module
    module_name, default_builder_kwargs = infer_module_for_data_files(
                                          ~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        data_files=data_files,
        ^^^^^^^^^^^^^^^^^^^^^^
        path=self.name,
        ^^^^^^^^^^^^^^^
        download_config=self.download_config,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/gregassagraf/workspaces/token-counter/.venv/lib/python3.14/site-packages/datasets/load.py", line 316, in infer_module_for_data_files
    raise DataFilesNotFoundError("No (supported) data files found" + (f" in {path}" if path else ""))
datasets.exceptions.DataFilesNotFoundError: No (supported) data files found in gregassagraf/wikipedia_rephrase_pt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions