Skip to content

draft blogpost for Wikidata Reconciliation Service#568

Open
magdmartin wants to merge 1 commit into
masterfrom
202604-blog
Open

draft blogpost for Wikidata Reconciliation Service#568
magdmartin wants to merge 1 commit into
masterfrom
202604-blog

Conversation

@magdmartin

Copy link
Copy Markdown
Member

I’m opening this PR to share a draft blog post following the recent issues around the Wikidata reconciliation service and the related discussions on rate limiting and infrastructure. It was previously discussed with @Ainali during last week's Advisory Committee call.

The goal of this post is to:

  • document the incident at a high level
  • clarify the structure of the reconciliation ecosystem
  • state OpenRefine’s position regarding scope and ownership
  • support ongoing discussions around governance and sustainability of the Wikidata reconciliation service

I would like this post to serve as an official reference point for future conversations, as this topic regularly comes up across the forum, GitHub, and external discussions.

I am requesting a formal approval from the Core Dev Group (@tfmorris @Abbe98) and the Advisory Committee (@Ainali @ej2432 @jfaurelacroix) as this post represents the project’s official position. The goal is to reflect a shared consensus within the OpenRefine community.

Please feel free to suggest edits or alternative directions. Other community members and committers are also very welcome to review, comment, and participate in the discussion.

@netlify

netlify Bot commented Apr 22, 2026

Copy link
Copy Markdown

Deploy Preview for openrefine-website ready!

Name Link
🔨 Latest commit e9e0a39
🔍 Latest deploy log https://app.netlify.com/projects/openrefine-website/deploys/69e9286825de6600083bdeb8
😎 Deploy Preview https://deploy-preview-568--openrefine-website.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@Abbe98

Abbe98 commented Apr 23, 2026

Copy link
Copy Markdown
Member

On the Wikimedia side note that:

  • WMF intends to tighten the screws on API usage even more in the next 14 days, this blog post might therefore be premature.
  • Per Daniels last message OpenRefine is sending complaint headers, what they are seeing is traffic from old versions of OpenRefine and the reconciliation service.
  • We have only received communication from a single member of a single small Wikimedia engineering/product team, there has been no official communication about these changes through the normal channels.
  • In just two weeks said individual has changed their position three times.
  • Several tools and even Wikimedias own services has been having outages because of these changes, contributing to these swift position changes.

On the reconciliation side:

  • There is more than one reconciliation service, we just happens to be bundling one, not all depends on the SPARQL service, etc.

On the OpenRefine side:

  • The diagram is incorrect, simplified there are four types of traffic to Wikimedia from a default OpenRefine installation:

    • Requests from the Wikibase extension backend(sometimes using the Wikidata-toolkitl library), this is mostly made up of authenticated requests
    • Requests from a reconciliation service for which OpenRefine is a client
    • Request from the Wikibase extension issued by the frontend
    • Requests initiated by the user(fetch-by-url, etc)
  • On the subject of what we are actually doing, are we actually doing it? Improving our user-agents does not seem actionable? Working on improved error handling sounds great but is a rather massive undertaking. With the next planned release being 4.0 any notable improvements are also kinda far out.

Given all of this, my suggestion would be to either sit still in the boat or write a generic blog post on OpenRefine/reconciliation. I'm afraid a blog post like this could misinform and do more harm than good.

@Ainali

Ainali commented Apr 23, 2026

Copy link
Copy Markdown
Member

Only on a process note, while it would be great if there were a wide consensus in the OpenRefine community, I don't think there is anything in the blog post that is within the scope of the Advisory Committee mandate to formally approve. It seems to me that what is described lies within the Core Dev group mandate.

@magdmartin

Copy link
Copy Markdown
Member Author

@Abbe98 Thanks for the detailed feedback, this is helpful.

I agree with several of your points, in particular:

  • The situation on the Wikimedia side is still evolving
  • We only have partial visibility
  • The diagram is a simplification

To clarify the intent of the post: this is not meant to be a precise technical description of all traffic patterns or a definitive account of recent changes on the Wikimedia side.

The goal is more limited and practical:

  • document the incident at a high level for OpenRefine users
  • provide a clear reference to explain how the reconciliation ecosystem is structured
  • clarify scope and ownership across the different services involved

We often need to explain this three-layer structure in the forum, on GitHub, and in conversations with partners. This incident highlighted that ownership is not clearly understood across the community. The post and diagram are meant to make the scope and responsibilities of each component visible, and to make OpenRefine’s role explicit.

If there is one point this post aims to make explicit, it is that OpenRefine neither operates nor maintains the Wikidata reconciliation service. The recent incident brought this to the surface, but the underlying question has been around for some time. The post is also intended to help move the conversation toward clearer ownership going forward.

On your specific points:

Timing: Agreed that the situation is still evolving. The post intentionally avoids going into technical details or making assumptions about how things will settle. The focus is on structural aspects (scope, ownership, dependencies), which are more stable.

Diagram: Agreed that it is simplified. The goal is not to represent all request paths, but to show that multiple independently operated components are involved. I can make that explicit in the text and include the direct Wikibase extension -> Wikidata path.

Multiple reconciliation services: Good point. The post already mentions that OpenRefine supports multiple reconciliation services, but we can strengthen that to avoid the impression that Wikidata is the only one.

“What we are doing” section: I agree with your point here. That section is not central to the main argument and may dilute the message. I’m leaning toward removing it to keep the post focused on scope and governance.

Happy to adjust the draft further along those lines.

@tfmorris

Copy link
Copy Markdown
Member

From a process point of view, I would find it easier to first agree on an objective or outline for the blog post. A full written text is anchors and confines the discussion.

There's no question that the Wikidata reconciliation service is a mess, but it's not our mess and I don't think we're in the position to speak for it. In an ideal world, it would have a responsive maintainer and a clear problem reporting mechanism that is easy for the users to find. We may be on the path to that, but I think it's too early to tell. We're definitely not there yet since the current service points to an issue tracker which is archived and points to two other different repos with a fourth repo being proposed as the final resting place.

The planning, communication, and professionalism of the Wiki* engineering teams leaves a lot to be desired, but they're also under stress and we don't really have any influence on their behavior, so we (and our users) just need to deal with the consequences.

I agree with Albin that it's premature to post anything until it's clearer what the outcome is going to be. Ideally, when posted, it should include a pointer to the problem reporting mechanism for the production Wikidata reconciliation service (and the service will have been updated to point to that same place).

Eliminating a lot of the excess detail and focusing on the key message(s) ("not our problem"?) would help readers focus. Currently that's below the fold (ie after the break) and buried deep.

The timeline actually begins in August 2025 from a Wikidata point of view, but it might also be worthwhile to mention the rise of the AI scrapers as context for these dramatic changes, because it affects other reconciliation services and Fetch URL.

The other general topic worth including a discussion of reconciliation services is that because we don't control them, not only can we not fix outages, but we also don't control what is done with users data, so they should be comfortable sending their data to whatever service(s) they choose to use.

Lastly, and this doesn't really relate to the blog post, one of the most troubling things I find about this whole situation is the lack of transparency. Both last Fall's and this most recent round of Wikidata changes were made without any advance notice. The recon service was "fixed" by quietly reconfiguring the service url to redirect to a different host behind the scenes, but none of the underlying bugs have been fixed, so it is susceptible to future Wikidata API tightening that doesn't whitelist the WMCS hosts.

Bottom line - don't post until the situation is clearer and then revise to focus on the main message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants