Skip to content

Commit f9d8e9a

Browse files
Merge pull request #2 from wmde/add_linter
Add Ruff Lint
2 parents b0ef509 + 45c3f7b commit f9d8e9a

13 files changed

Lines changed: 578 additions & 308 deletions

File tree

.github/workflows/lint.yml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
name: "Ruff Lint"
2+
3+
on:
4+
pull_request:
5+
branches: ["main"]
6+
7+
permissions:
8+
contents: read
9+
10+
jobs:
11+
ruff:
12+
runs-on: ubuntu-latest
13+
steps:
14+
- uses: actions/checkout@v4
15+
16+
- name: Install uv
17+
uses: astral-sh/setup-uv@v5
18+
19+
- name: Set up Python
20+
run: uv python install
21+
22+
- name: Run Ruff linter
23+
run: uv run ruff check .
24+
25+
- name: Run Ruff formatter check
26+
run: uv run ruff format --check .

README.md

Lines changed: 31 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -1,81 +1,47 @@
11
# Wikidata Textifier
22

3-
**Wikidata Textifier** is an API that transforms Wikidata items into compact format for use in LLMs and GenAI applications. It resolves missing labels of properties and claim values by querying the Wikidata Action API, making it efficient and suitable for AI pipelines.
3+
**Wikidata Textifier** is an API that transforms Wikidata entities into compact outputs for LLM and GenAI use cases.
4+
It resolves missing labels for properties and claim values using the Wikidata Action API and caches labels to reduce repeated lookups.
45

5-
🔗 Live API: [https://wd-textify.toolforge.org/](https://wd-textify.toolforge.org/)
6+
Live API: [wd-textify.wmcloud.org](https://wd-textify.wmcloud.org/) \
7+
API Docs: [wd-textify.wmcloud.org/docs](https://wd-textify.wmcloud.org/docs)
68

7-
---
9+
## Features
810

9-
## Functionalities
11+
- Textify Wikidata entities as `json`, `text`, or `triplet`.
12+
- Resolve labels for linked entities and properties.
13+
- Cache labels in MariaDB for faster repeated requests.
14+
- Support multilingual output with fallback language support.
15+
- Avoid SPARQL and use Wikidata Action API / EntityData endpoints.
1016

11-
- **Textifies** any Wikidata item into a readable or JSON format suitable for LLMs.
12-
- **Resolves all labels**, including those missing when querying the Wikidata API.
13-
- **Caches labels** for 90 days to boost performance and reduce API load.
14-
- **Avoids SPARQL** and uses the Wikidata Action API for better efficiency and compatibility.
15-
- **Hosted on Toolforge**: [https://wd-textify.toolforge.org/](https://wd-textify.toolforge.org/)
17+
## Output Formats
1618

17-
---
19+
- `json`: Structured representation with claims (and optionally qualifiers/references).
20+
- `text`: Readable summary including label, description, aliases, and attributes.
21+
- `triplet`: Triplet-style lines with labels and IDs for graph-style traversal.
1822

19-
## Formats
20-
21-
- **Text**: A textual representation or summary of the Wikidata item, including its label, description, aliases, and claims. Useful for helping LLMs understand what the item represents.
22-
- **Triplet**: Outputs each triplet as a structured line, including labels and IDs, but omits descriptions and aliases. Ideal for agentic LLMs to traverse and explore Wikidata.
23-
- **JSON**: A structured and compact representation of the full item, suitable for custom formats.
24-
25-
---
26-
27-
## API Usage
23+
## API
2824

2925
### `GET /`
3026

31-
#### Query Parameters
32-
33-
| Name | Type | Required | Description |
34-
|----------------|---------|----------|-----------------------------------------------------------------------------|
35-
| `id` | string | Yes | Wikidata item ID (e.g., `Q42`) |
36-
| `lang` | string | No | Language code for labels (default: `en`) |
37-
| `format` | string | No | The format of the response, either 'json', 'text', or 'triplet' (default: `json`) |
38-
| `external_ids` | bool | No | Whether to include external IDs in the output (default: `true`) |
39-
| `all_ranks` | bool | No | If false, returns ranked preferred statements, falling back to normal when unavailable (default: `false`) |
40-
| `references` | bool | No | Whether to include references (default: `false`) |
41-
| `fallback_lang` | string | No | Fallback language code if the preferred language is not available (default: `en`) |
42-
43-
---
44-
45-
## Deploy to Toolforge
46-
47-
1. Shell into the Toolforge system:
48-
49-
```bash
50-
ssh [UNIX shell username]@login.toolforge.org
51-
```
52-
53-
2. Switch to tool user account:
54-
55-
```bash
56-
become wd-textify
57-
```
58-
59-
3. Build from Git:
60-
61-
```bash
62-
toolforge build start https://github.com/philippesaade-wmde/WikidataTextifier.git
63-
```
27+
#### Query parameters
6428

65-
4. Start the web service:
29+
| Name | Type | Required | Description |
30+
|---|---|---|---|
31+
| `id` | string | Yes | Comma-separated Wikidata IDs (for example: `Q42` or `Q42,Q2`). |
32+
| `pid` | string | No | Comma-separated property IDs to filter claims (for example: `P31,P279`). |
33+
| `lang` | string | No | Preferred language code (default: `en`). |
34+
| `fallback_lang` | string | No | Fallback language code (default: `en`). |
35+
| `format` | string | No | Output format: `json`, `text`, or `triplet` (default: `json`). |
36+
| `external_ids` | bool | No | Include `external-id` datatype claims (default: `true`). |
37+
| `all_ranks` | bool | No | Include all statement ranks instead of preferred/normal filtering (default: `false`). |
38+
| `qualifiers` | bool | No | Include qualifiers in claim values (default: `true`). |
39+
| `references` | bool | No | Include references in claim values (default: `false`). |
6640

67-
```bash
68-
webservice buildservice start --mount all
69-
```
70-
71-
5. Debugging the web service:
72-
73-
Read the logs:
74-
```bash
75-
webservice logs
76-
```
41+
#### Example requests
7742

78-
Open the service shell:
7943
```bash
80-
webservice shell
44+
curl "https://wd-textify.wmcloud.org/?id=Q42"
45+
curl "https://wd-textify.wmcloud.org/?id=Q42&format=text&lang=en"
46+
curl "https://wd-textify.wmcloud.org/?id=Q42,Q2&pid=P31,P279&format=triplet"
8147
```

main.py

Lines changed: 61 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,16 @@
1-
from fastapi import FastAPI, HTTPException, Query, Request
2-
from fastapi.middleware.cors import CORSMiddleware
3-
from fastapi import BackgroundTasks
1+
"""FastAPI application that exposes Wikidata textification endpoints."""
2+
3+
import os
4+
import time
45
import traceback
6+
57
import requests
6-
import time
7-
import os
8+
from fastapi import BackgroundTasks, FastAPI, HTTPException, Query, Request
9+
from fastapi.middleware.cors import CORSMiddleware
810

9-
from src.Normalizer import TTLNormalizer, JSONNormalizer
10-
from src.WikidataLabel import WikidataLabel, LazyLabelFactory
1111
from src import utils
12+
from src.Normalizer import JSONNormalizer, TTLNormalizer
13+
from src.WikidataLabel import LazyLabelFactory, WikidataLabel
1214

1315
# Start Fastapi app
1416
app = FastAPI(
@@ -32,68 +34,81 @@
3234
LABEL_CLEANUP_INTERVAL_SECONDS = int(os.environ.get("LABEL_CLEANUP_INTERVAL_SECONDS", 3600))
3335
_last_label_cleanup = 0.0
3436

37+
3538
@app.on_event("startup")
3639
async def startup():
40+
"""Initialize database resources required by the API."""
3741
WikidataLabel.initialize_database()
3842

43+
3944
@app.get(
4045
"/",
4146
responses={
4247
200: {
4348
"description": "Returns a list of relevant Wikidata property PIDs with similarity scores",
4449
"content": {
4550
"application/json": {
46-
"example": [{
47-
"Q42": "Douglas Adams (human), English writer, humorist, and dramatist...",
48-
}]
51+
"example": [
52+
{
53+
"Q42": "Douglas Adams (human), English writer, humorist, and dramatist...",
54+
}
55+
]
4956
}
5057
},
5158
},
5259
422: {
5360
"description": "Missing or invalid query parameter",
54-
"content": {
55-
"application/json": {
56-
"example": {"detail": "Invalid format specified"}
57-
}
58-
},
61+
"content": {"application/json": {"example": {"detail": "Invalid format specified"}}},
5962
},
6063
},
6164
)
6265
async def get_textified_wd(
63-
request: Request, background_tasks: BackgroundTasks,
66+
request: Request,
67+
background_tasks: BackgroundTasks,
6468
id: str = Query(..., examples="Q42,Q2"),
6569
pid: str = Query(None, examples="P31,P279"),
66-
lang: str = 'en',
67-
format: str = 'json',
70+
lang: str = "en",
71+
format: str = "json",
6872
external_ids: bool = True,
6973
references: bool = False,
7074
all_ranks: bool = False,
7175
qualifiers: bool = True,
72-
fallback_lang: str = 'en'
76+
fallback_lang: str = "en",
7377
):
74-
"""
75-
Retrieve a Wikidata item with all labels or textual representations for an LLM.
76-
77-
Args:
78-
id (str): The Wikidata item ID (e.g., "Q42").
79-
pid (str): Comma-separated list of property IDs to filter claims (e.g., "P31,P279").
80-
format (str): The format of the response, either 'json', 'text', or 'triplet'.
81-
lang (str): The language code for labels (default is 'en').
82-
external_ids (bool): If True, includes external IDs in the response.
83-
all_ranks (bool): If True, includes statements of all ranks (preferred, normal, deprecated).
84-
references (bool): If True, includes references in the response. (only available in JSON format)
85-
qualifiers (bool): If True, includes qualifiers in the response.
86-
fallback_lang (str): The fallback language code if the preferred language is not available.
87-
88-
Returns:
89-
list: A list of dictionaries containing QIDs and the similarity scores.
78+
"""Retrieve Wikidata entities as structured JSON, natural text, or triplet lines.
79+
80+
This endpoint fetches one or more entities, resolves missing labels, and normalizes
81+
claims into a compact representation suitable for downstream LLM use.
82+
83+
**Args:**
84+
85+
- **id** (str): Comma-separated Wikidata IDs to fetch (for example: `"Q42"` or `"Q42,Q2"`).
86+
- **pid** (str, optional): Comma-separated property IDs used to filter returned claims (for example: `"P31,P279"`).
87+
- **lang** (str): Preferred language code for labels and formatted values.
88+
- **format** (str): Output format. One of `"json"`, `"text"`, or `"triplet"`.
89+
- **external_ids** (bool): If `true`, include claims with datatype `external-id`.
90+
- **references** (bool): If `true`, include references in claim values (JSON output only).
91+
- **all_ranks** (bool): If `true`, include preferred, normal, and deprecated statement ranks.
92+
- **qualifiers** (bool): If `true`, include qualifiers for claim values.
93+
- **fallback_lang** (str): Fallback language used when `lang` is unavailable.
94+
- **request** (Request): FastAPI request context object.
95+
- **background_tasks** (BackgroundTasks): Background task manager used for cache cleanup.
96+
97+
**Returns:**
98+
99+
A dictionary keyed by requested entity ID (for example, `"Q42"`).
100+
Each value depends on `format`:
101+
102+
- **json**: Structured entity payload with label, description, aliases, and claims.
103+
- **text**: Human-readable summary text.
104+
- **triplet**: Triplet-style text lines with labels and IDs.
90105
"""
91106
try:
92107
filter_pids = []
93108
if pid:
94-
filter_pids = [p.strip() for p in pid.split(',')]
109+
filter_pids = [p.strip() for p in pid.split(",")]
95110

96-
qids = [q.strip() for q in id.split(',')]
111+
qids = [q.strip() for q in id.split(",")]
97112
label_factory = LazyLabelFactory(lang=lang, fallback_lang=fallback_lang)
98113

99114
entities = {}
@@ -144,7 +159,9 @@ async def get_textified_wd(
144159
fallback_lang=fallback_lang,
145160
label_factory=label_factory,
146161
debug=False,
147-
) if entity_data.get(qid) else None
162+
)
163+
if entity_data.get(qid)
164+
else None
148165
for qid in qids
149166
}
150167

@@ -154,8 +171,10 @@ async def get_textified_wd(
154171
all_ranks=all_ranks,
155172
references=references,
156173
filter_pids=filter_pids,
157-
qualifiers=qualifiers
158-
) if entity else None
174+
qualifiers=qualifiers,
175+
)
176+
if entity
177+
else None
159178
for qid, entity in entity_data.items()
160179
}
161180

@@ -165,9 +184,9 @@ async def get_textified_wd(
165184
return_data[qid] = None
166185
continue
167186

168-
if format == 'text':
187+
if format == "text":
169188
results = entity.to_text(lang)
170-
elif format == 'triplet':
189+
elif format == "triplet":
171190
results = entity.to_triplet()
172191
else:
173192
results = entity.to_json()

pyproject.toml

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,30 @@ dependencies = [
1313
"sqlalchemy>=2.0.41",
1414
"uvicorn>=0.35.0",
1515
]
16+
17+
[dependency-groups]
18+
dev = [
19+
"ruff>=0.9.0"
20+
]
21+
22+
[tool.ruff]
23+
target-version = "py313"
24+
line-length = 120
25+
26+
exclude = ["data/mysql"]
27+
28+
[tool.ruff.lint]
29+
select = [
30+
"E", # pycodestyle errors
31+
"F", # Pyflakes (catches undefined names, unused imports, etc.)
32+
"I", # isort (import sorting)
33+
"D", # pydocstyle (function/class documentation)
34+
]
35+
36+
[tool.ruff.lint.pydocstyle]
37+
convention = "google"
38+
39+
[tool.ruff.lint.isort]
40+
known-first-party = [
41+
"wikidatasearch"
42+
]

0 commit comments

Comments
 (0)