Code Vectorizer API Documentation

Overview

The Code Vectorizer API is a RESTful service that allows you to vectorize codebases and perform semantic search on code. It supports multiple users with isolated data storage using dynamic database schemas.

Base URL

http://localhost:8000

Authentication

Currently, the API uses username-based identification. Each user's data is isolated in separate database schemas.

API Endpoints

1. Health Check

GET /api/health

Check if the API is running.

Response:

{
  "status": "healthy",
  "timestamp": "2024-01-15T10:30:00.000Z"
}

2. Vectorize Repository

POST /api/vectorize

Start vectorizing a Git repository.

Request Body:

{
  "repo_url": "https://github.com/username/repo-name",
  "username": "john_doe",
  "repo_name": "my-repo",
  "github_token": "ghp_xxxxxxxxxxxx",
  "github_username": "github_username",
  "chunk_size": 1000,
  "chunk_overlap": 200,
  "max_file_size": 1048576
}

Parameters:

repo_url (required): GitHub repository URL
username (required): User identifier for data isolation
repo_name (optional): Custom repository name (defaults to extracted from URL)
github_token (optional): GitHub token for private repositories
github_username (optional): GitHub username
chunk_size (optional): Maximum tokens per chunk (default: 1000)
chunk_overlap (optional): Token overlap between chunks (default: 200)
max_file_size (optional): Maximum file size in bytes (default: 1048576)

Response:

{
  "job_id": "a1b2c3d4e5f6",
  "status": "pending",
  "message": "Vectorization job started",
  "created_at": "2024-01-15T10:30:00.000Z"
}

3. Get Job Status

GET /api/job/{job_id}

Get the status and progress of a vectorization job.

Response:

{
  "job_id": "a1b2c3d4e5f6",
  "status": "processing",
  "progress": {
    "step": "generating_embeddings",
    "files_discovered": 150,
    "files_processed": 150,
    "chunks_created": 1200,
    "chunks_with_embeddings": 800,
    "chunks_saved": 0,
    "current_file": "main.py"
  },
  "created_at": "2024-01-15T10:30:00.000Z",
  "updated_at": "2024-01-15T10:35:00.000Z",
  "error": null
}

Status Values:

pending: Job is queued
processing: Job is running
completed: Job finished successfully
failed: Job failed with error

4. Search Code

POST /api/search

Search for code using semantic similarity.

Request Body:

{
  "query": "function to parse JSON",
  "username": "john_doe",
  "repo_name": "my-repo",
  "limit": 10,
  "similarity_threshold": 0.7
}

Parameters:

query (required): Search query text
username (required): User identifier
repo_name (optional): Search in specific repository only
limit (optional): Maximum results to return (default: 10)
similarity_threshold (optional): Minimum similarity score (default: 0.7)

Response:

{
  "results": [
    {
      "content": "def parse_json(data):\n    return json.loads(data)",
      "start_line": 15,
      "end_line": 16,
      "token_count": 45,
      "file_path": "utils.py",
      "file_name": "utils.py",
      "repo_name": "my-repo",
      "similarity": 0.892
    }
  ],
  "total": 1,
  "query": "function to parse JSON"
}

5. Get User Repositories

GET /api/user/{username}/repos

Get all repositories for a user.

Response:

{
  "username": "john_doe",
  "repositories": [
    {
      "repo_name": "my-repo",
      "repo_url": "https://github.com/username/repo-name",
      "status": "completed",
      "created_at": "2024-01-15T10:30:00.000Z",
      "updated_at": "2024-01-15T10:45:00.000Z",
      "file_count": 150,
      "chunk_count": 1200,
      "schema_name": "user_john_doe_repo_my_repo"
    }
  ]
}

6. Delete Repository

DELETE /api/user/{username}/repo/{repo_name}

Delete a repository and all its vectorized data.

Response:

{
  "message": "Repository my-repo deleted successfully"
}

Database Schema

Each user's repositories are stored in separate PostgreSQL schemas with the naming pattern:

user_{username}_repo_{repo_name}

Tables in each schema:

repositories: Repository metadata
code_files: File information and content hashes
code_chunks: Code chunks with vector embeddings

Usage Examples

Python Client

import requests

# Start vectorization
response = requests.post("http://localhost:8000/api/vectorize", json={
    "repo_url": "https://github.com/username/repo",
    "username": "john_doe"
})

job_id = response.json()["job_id"]

# Check status
status = requests.get(f"http://localhost:8000/api/job/{job_id}").json()

# Search code
search_results = requests.post("http://localhost:8000/api/search", json={
    "query": "authentication function",
    "username": "john_doe"
}).json()

cURL Examples

# Vectorize repository
curl -X POST "http://localhost:8000/api/vectorize" \
  -H "Content-Type: application/json" \
  -d '{
    "repo_url": "https://github.com/username/repo",
    "username": "john_doe"
  }'

# Check job status
curl "http://localhost:8000/api/job/a1b2c3d4e5f6"

# Search code
curl -X POST "http://localhost:8000/api/search" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "database connection",
    "username": "john_doe"
  }'

Error Handling

The API returns standard HTTP status codes:

200: Success
400: Bad Request (invalid parameters)
404: Not Found (job or repository not found)
500: Internal Server Error

Error responses include a detail message:

{
  "detail": "Repository not found"
}

Rate Limiting

Currently, there are no rate limits implemented. Consider implementing rate limiting for production use.

Security Considerations

Authentication: Implement proper authentication (JWT, API keys, etc.)
Authorization: Add user authorization checks
Input Validation: Validate all input parameters
Rate Limiting: Implement rate limiting for production
HTTPS: Use HTTPS in production
Secrets Management: Store API keys and tokens securely

Performance

Vectorization jobs run asynchronously in the background
Database queries use indexes for optimal performance
Vector similarity search uses pgvector's IVFFlat index
Large repositories are processed in chunks to manage memory

Monitoring

Monitor the following metrics:

Job completion rates
Processing times
Error rates
Database performance
API response times

Deployment

Development

make server-dev

Production

# Using uvicorn
uvicorn server:app --host 0.0.0.0 --port 8000 --workers 4

# Using gunicorn
gunicorn server:app -w 4 -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000

Docker

# Build image
docker build -t code-vectorizer .

# Run container
docker run -p 8000:8000 code-vectorizer

API Documentation UI

Access interactive API documentation at:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code Vectorizer API Documentation

Overview

Base URL

Authentication

API Endpoints

1. Health Check

2. Vectorize Repository

3. Get Job Status

4. Search Code

5. Get User Repositories

6. Delete Repository

Database Schema

Tables in each schema:

Usage Examples

Python Client

cURL Examples

Error Handling

Rate Limiting

Security Considerations

Performance

Monitoring

Deployment

Development

Production

Docker

API Documentation UI

FilesExpand file tree

API_DOCUMENTATION.md

Latest commit

History

API_DOCUMENTATION.md

File metadata and controls

Code Vectorizer API Documentation

Overview

Base URL

Authentication

API Endpoints

1. Health Check

2. Vectorize Repository

3. Get Job Status

4. Search Code

5. Get User Repositories

6. Delete Repository

Database Schema

Tables in each schema:

Usage Examples

Python Client

cURL Examples

Error Handling

Rate Limiting

Security Considerations

Performance

Monitoring

Deployment

Development

Production

Docker

API Documentation UI