Skip to content

Latest commit

 

History

History
369 lines (271 loc) · 8.86 KB

File metadata and controls

369 lines (271 loc) · 8.86 KB

File Filtering Rules

Overview

Vectorizer Sync implements comprehensive file filtering to exclude unnecessary files from synchronization and workspace export. This document describes the filtering rules and how they are applied.

Filtering Strategy

Files are filtered at multiple stages:

  1. Initial Scan: Files are filtered during project directory scanning
  2. Workspace Export: Files are filtered when generating workspace.yml
  3. Cloud Sync: Files are filtered before uploading to HiveHub Cloud

Exclusion Rules

1. Module Directories

The following directories are ALWAYS excluded:

  • node_modules/ - Node.js dependencies
  • vendor/ - PHP dependencies
  • packages/ - Package directories
  • .venv/ - Python virtual environment
  • venv/ - Python virtual environment
  • env/ - Python virtual environment
  • __pycache__/ - Python cache
  • target/ - Rust build output
  • .cargo/ - Rust cache (except config)

Pattern: **/node_modules/**, **/vendor/**, etc.

2. Build Artifacts

The following build directories are ALWAYS excluded:

  • dist/ - Distribution/build output
  • build/ - Build output
  • out/ - Output directory
  • .next/ - Next.js build
  • .nuxt/ - Nuxt.js build
  • out/ - Output directory
  • bin/ - Binary output (if build artifact)
  • obj/ - Object files (.NET)

Pattern: **/dist/**, **/build/**, etc.

3. File Size Limit

Files larger than the configured maximum size are excluded:

  • Default: 100KB (102,400 bytes)
  • Configurable: User can change in settings
  • Maximum Configurable: 10MB

Implementation:

function isFileTooLarge(filePath: string, maxSize: number): boolean {
  const stats = fs.statSync(filePath);
  return stats.size > maxSize;
}

4. Hidden Files and Directories

Files and directories starting with . are excluded, EXCEPT:

Included Hidden Files:

  • .gitignore
  • .env.example
  • .eslintrc.*
  • .prettierrc.*
  • .editorconfig
  • .npmrc
  • .yarnrc

Excluded Hidden Files:

  • .git/ (entire directory)
  • .DS_Store
  • .vscode/ (IDE settings)
  • .idea/ (IDE settings)
  • .*.swp (editor swap files)
  • .*.swo (editor swap files)
  • .cache/

Pattern: **/.* (with exceptions)

5. Binary Files

Common binary file extensions are excluded:

Excluded Extensions:

  • .exe - Windows executables
  • .dll - Windows libraries
  • .so - Linux shared libraries
  • .dylib - macOS dynamic libraries
  • .bin - Binary files
  • .o - Object files
  • .a - Archive files
  • .pyc - Python bytecode
  • .pyo - Python optimized bytecode
  • .class - Java bytecode
  • .jar - Java archives
  • .war - Web archives
  • .ear - Enterprise archives

Note: This list is configurable and can be extended.

6. Temporary Files

Temporary and cache files are excluded:

  • *.tmp
  • *.temp
  • *.cache
  • *.log (unless in logs directory for documentation)
  • *.swp
  • *.swo
  • *~ (backup files)

Pattern: **/*.tmp, **/*.temp, etc.

7. Database Files

Database files are excluded:

  • *.db
  • *.sqlite
  • *.sqlite3
  • *.db-shm
  • *.db-wal

Exception: If database files are part of the project (e.g., test fixtures), they can be explicitly included.

8. Media Files (Optional)

Large media files can be excluded (configurable):

  • *.mp4, *.avi, *.mov - Video files
  • *.mp3, *.wav, *.flac - Audio files
  • *.jpg, *.jpeg, *.png, *.gif - Images (if large)

Note: Small images and media files in documentation may be included.

User-Configurable Exclusions

Users can configure additional exclusion patterns:

Custom Patterns

Users can add custom glob patterns:

interface UserExclusions {
  patterns: string[];  // e.g., ["**/test/**", "**/*.spec.ts"]
}

Examples

  • "**/test/**" - Exclude all test directories
  • "**/*.spec.ts" - Exclude test files
  • "**/coverage/**" - Exclude coverage reports
  • "**/.github/**" - Exclude GitHub workflows (if desired)

Inclusion Rules

Always Included

The following files are ALWAYS included (even if they match exclusion patterns):

  • README.md - Project readme
  • LICENSE - License file
  • package.json - Node.js package file
  • Cargo.toml - Rust package file
  • pyproject.toml - Python project file
  • go.mod - Go module file
  • workspace.yml - Vectorizer workspace file (if exists)

Source Code Files

All source code files are included (unless excluded by size):

  • *.ts, *.tsx - TypeScript
  • *.js, *.jsx - JavaScript
  • *.rs - Rust
  • *.py - Python
  • *.go - Go
  • *.java - Java
  • *.cpp, *.c, *.h - C/C++
  • *.rb - Ruby
  • *.php - PHP
  • And other common source file extensions

Documentation Files

Documentation files are included:

  • *.md - Markdown
  • *.txt - Text files
  • *.rst - reStructuredText
  • *.adoc - AsciiDoc

Configuration Files

Configuration files are included:

  • *.json - JSON configs
  • *.yaml, *.yml - YAML configs
  • *.toml - TOML configs
  • *.ini - INI configs
  • *.conf - Config files
  • *.config - Config files

Filtering Implementation

File Filter Class

class FileFilter {
  private maxFileSize: number;
  private excludedPatterns: string[];
  private excludedExtensions: string[];
  private excludedDirectories: string[];

  isExcluded(filePath: string, stats: fs.Stats): {
    excluded: boolean;
    reason?: string;
  } {
    // Check size
    if (stats.size > this.maxFileSize) {
      return { excluded: true, reason: 'File too large' };
    }

    // Check patterns
    if (this.matchesPattern(filePath, this.excludedPatterns)) {
      return { excluded: true, reason: 'Matches exclusion pattern' };
    }

    // Check extensions
    if (this.hasExcludedExtension(filePath)) {
      return { excluded: true, reason: 'Excluded file extension' };
    }

    // Check directories
    if (this.isInExcludedDirectory(filePath)) {
      return { excluded: true, reason: 'In excluded directory' };
    }

    return { excluded: false };
  }
}

Pattern Matching

Uses glob pattern matching:

import { minimatch } from 'minimatch';

function matchesPattern(path: string, patterns: string[]): boolean {
  return patterns.some(pattern => minimatch(path, pattern));
}

Workspace.yml Integration

Excluded files are reflected in workspace.yml exclude_patterns:

collections:
  - name: source-code
    include_patterns:
      - "src/**/*.ts"
    exclude_patterns:
      - "node_modules/**"
      - "dist/**"
      - "**/*.log"
      - "**/*.tmp"

Performance Considerations

Efficient Filtering

  1. Early Exit: Check most common exclusions first
  2. Caching: Cache exclusion decisions for unchanged files
  3. Batch Processing: Filter files in batches
  4. Parallel Processing: Filter multiple files in parallel

Optimization

// Cache exclusion results
const exclusionCache = new Map<string, boolean>();

function isExcludedCached(filePath: string): boolean {
  if (exclusionCache.has(filePath)) {
    return exclusionCache.get(filePath)!;
  }

  const result = fileFilter.isExcluded(filePath, stats);
  exclusionCache.set(filePath, result.excluded);
  return result.excluded;
}

Testing

Test Cases

  1. Module Directories: Verify node_modules/ is excluded
  2. Build Directories: Verify dist/ is excluded
  3. File Size: Verify files > 100KB are excluded
  4. Hidden Files: Verify .git/ is excluded but .gitignore is included
  5. Binary Files: Verify .exe files are excluded
  6. Custom Patterns: Verify user patterns work correctly

Example Tests

describe('FileFilter', () => {
  it('should exclude node_modules', () => {
    const result = filter.isExcluded('project/node_modules/file.js', stats);
    expect(result.excluded).toBe(true);
    expect(result.reason).toBe('In excluded directory');
  });

  it('should exclude large files', () => {
    const largeStats = { size: 200000 }; // 200KB
    const result = filter.isExcluded('project/large-file.txt', largeStats);
    expect(result.excluded).toBe(true);
    expect(result.reason).toBe('File too large');
  });

  it('should include source files', () => {
    const result = filter.isExcluded('project/src/index.ts', stats);
    expect(result.excluded).toBe(false);
  });
});

User Interface

Settings UI

Users can configure:

  1. Maximum File Size: Slider or input field
  2. Custom Exclusion Patterns: Text area with pattern list
  3. Excluded Extensions: Checkbox list or text input
  4. Excluded Directories: Checkbox list

Preview

Show preview of what will be excluded:

  • List of excluded files (with reasons)
  • Count of excluded vs included files
  • Total size of excluded files

Future Enhancements

  1. Smart Exclusions: Learn from user behavior
  2. Project Templates: Pre-configured exclusions for project types
  3. Exclusion Analytics: Show statistics on excluded files
  4. Selective Inclusion: Allow including specific files even if excluded