Uploading to Wikimedia Commons with AI

2026-03-25

Uploading to Wikimedia Commons is fun and rewarding. I love taking photos of stuff, and sharing my photos with the world is nice. Otherwise, it would be kind of sad that my 100 megapixel photos just languish on my computer, never to be seen by anyone other than my family and me. I’ve uploaded tons of photos which you can see at user:dllu on Commons.

However, uploading to Commons has some pain points. Finding the right categories is a big annoyance, because the rather fine-grained categories on Commons are both a blessing and a curse. It’s really easy to make mistakes (by putting a file in a parent category instead of a more specific one) or to forget certain types of categories. Sometimes I even forget what I took a picture of. Also, there are a lot of easily automatable steps like adding the geolocation, tags based on camera parameters, and so on.

I wrote a Hacker News comment complaining about all the friction:

As someone who has uploaded lots of photos to Wikimedia Commons, I think one main challenge is coming up with an encyclopedic description and the right categories. Searching for categories is remarkably difficult, manual, and unintuitive. For example, I recently uploaded a photo of a Caltrain at San Jose Diridon station. So I put “Category:San Jose Diridon station” as a category. But this was in fact incorrect, since there’s a more specific category called “Category:Trains at San José Diridon station”. On Wikimedia Commons, the UI to add a category only tells you if a category exists or not, but doesn’t let you search nicely. There are also location/time categories like “Category:December 2023 in California”. Probably less than 1% of photos taken during December 2023 in California get added to such categories, even when GPS and EXIF data contain information about the location and date of capture. As for writing a description, it should be comprehensive, descriptive, yet succinct, in an encyclopedic tone, and with links to relevant categories on Wikimedia Commons or to relevant pages on Wikipedia as necessary. Again, making the relevant links is mildly tedious. As a result, when uploading any single photo, I would need to open dozens of tabs to verify the right pages and categories.

FIGURE 1 Caltrain photo on Wikimedia Commons.
EXIF data
Camera
FUJIFILM GFX100S
Lens
GF55mmF1.7 R WR
Aperture
f/5.6
Shutter
1/42 s
ISO
ISO 640
Software
Digital Camera GFX100S Ver2.10
Date
2023-11-21 16:55:05
Download
600 × 450
1200 × 900
2400 × 1800
3840 × 2880
7680 × 5760
original 11648 × 8736

Fortunately, we have the technology to solve this!

Generally, the tedious task of searching for categories can be solved by AI by using a vision language model to inspect the photo, along with its metadata, and to search Commons for relevant categories. Obviously, nobody wants to pollute the nice Commons project with inaccurate AI slop, so it is important that the script shows a UI for me, a human, to review and edit the suggested caption and categories.

I wrote a big prompt for Codex to add a script to upload to Wikimedia Commons into my pupphoto project. It is a collection of scripts to help me organize my photos, including scripts to upload photos to this very blog.

pupphoto

I'd like to write a Python script to upload a photo to Wikimedia Commons and populate its description page with the appropriate categories, description, and so on, by querying the appropriate openai api endpoints as well as wikimedia commons endpoints.

* For category search, we should downsize the image to a canonical size, and use openai vision apis (i.e. https://developers.openai.com/api/docs/guides/images-vision#analyze-images ) to analyze it (we can also pass in relevant information such as GPS lat long from EXIF data, if available, and date/time) and propose some keywords to search. Then, it should search for those keywords and find a list of categories along with the as a structured dag (some categories are children of others). Finally, we should use the vision api again, given the image and relevant data, and also pass in the structured dag in a comprehensible format (e.g. json), and ask the model to determine the most appropriate categories. For the vision API, we can use gpt-5.4 with "high" detail level. Please put some thought into how best to format the prompt in order to get the model to propose the best categories. We should generally go for a "leaf node" in the dag for the most specific category, instead of a vague parent category. For example, a photo of a new Caltrain in San Jose station taken in 2023 may have the categories "Stadler KISS of Caltrain", "Trains at San José Diridon station", and "Caltrain in 2023". It should NOT go into both a child and parent category at the same time, e.g. it should NOT go into the category "San José Diridon Station platforms", which is a parent category of "Trains at San José Diridon station". It should also not get the category "2023 in California" which is an ancestor of "Caltrain in 2023".
* We should also automatically search for categories based on photographic equipment, technique, and so on, based on the EXIF data, to get categories such as "Taken with Fujifilm GFX100S", "Lens focal length 55 mm", "Exposure time 1/250 sec", and so on. We may do this based on some hardcoded rules. To generate the hardcoded rules, you may first perform some searches to ensure that you have the right list of such categories. For example we can first generate a list of all "Lens focal length %dmm" categories by searching the Commons API.
* For the caption, we should also use the openai vision API and get it to write a concise caption. Please carefully construct the prompt to the model so that it produces a concise, factual, encyclopaedic, neutral, and objective caption. 
* We should also generate an appropriate filename for the whole image, again using the vision API. You may consolidate vision API calls with the caption one and the categories with good prompting (e.g. ask it to output title on the first line and caption on the second line). This should save us a lot of monetary costs compared to repeatedly uploading the image.
Please append a common suffix (configurable in config.toml).

Next, you should open a simple UI. It is probably easiest to serve a mini webapp and open a browser. The UI shall display the image and have a textarea for the caption as well as an editable list of categories. It shall also have a button to save and proceed to uploading.

Only when the user presses the button shall it actually create the Wikitext content for the description page and perform the actual upload. The Wikitext content should have a template like
```
=={{int:filedesc}}==
{{Information
|description =
{{en|1 = %s}}
|date = %s
|source = {{own}}
|author = %s
|other fields = {{Information field | name = Raw file SHA1 sum | value = %s}}
}}
{{Location|%s|%s}}

=={{int:license-header}}==
%s

%s
```

* For the Raw file SHA1 sum, please consult how the filename is formed from the sha1sum of the raw file in import.py. If the filename of the photo matches it, you can add the that "information field" template with the raw file sha1sum. Otherwise, skip it.
* For the location template, only add it if the file contains valid EXIF. Please respect the "remove_gps_if_banned" logic and avoid adding the location if it belongs to a banned area.
* For the license, it should be configurable in the config.toml, but probably default to '{{self|cc-by-sa-4.0}}'
* For the author name, it should also be configurable in the config.toml.
* For the Commons API user and password, as well as the OpenAI API key, they should be configurable in config.toml.

Upon clicking the save button, it should show a loading spinner until the upload is complete, and then redirect to the newly uploaded file's Wikimedia Commons page.
Then, the script shall quit.
We should make it possible to run multiple of these upload jobs (say, 10) by running multiple instances of this script in parallel (probably easiest to use a randomized port).

Happily, with Codex using gpt-5.4, it nearly one shotted it, minus some small bugs (such as parsing the GPS lat/long from EXIF data).

FIGURE 2 Screenshot of the upload tool.
Download
600 × 498
1200 × 997
original 2015 × 1674

In most cases, it works remarkably well.

However, I soon found that sometimes the AI screws up when there are no GPS coordinates. This is really annoying because sometimes my camera just inexplicably doesn’t connect to the bluetooth app on my phone (which provides the GPS coordinates).

FIGURE 3 The world-famous Holyrood Palace quadrangle gets a rather generic set of categories and caption in the absence of GPS data.
Download
600 × 438
1200 × 877
original 1968 × 1438

So I vibe coded the ability to give the AI a hint.

FIGURE 4 GPT 5 needs to take a hint.
Download
600 × 459
1200 × 917
original 1883 × 1439

FIGURE 5 Much better now. Commons Link.
Download
600 × 481
1200 × 962
original 1928 × 1546

By the way, the upload-commons.py script is combo wombo with my other project, sriv, which I mentioned in earlier blog post, sriv: simple rust image viewer. In my bindings.toml for sriv, I just put:

"ctrl+w" = "cd /home/dllu/proj/pupphoto && uv run python upload_commons.py {file}"

So now, whenever I am scrolling through my images in sriv, I can just upload whichever ones I think look nice.