"Accessing tags made in Shotwell with Python 3" on https://aligot-death.space, available at https://aligot-death.space/txt/shotwell-tags-python-en

txt/guides dev IT 26 Jul 2020 (updated)

Accessing tags made in Shotwell with Python 3

A quick and dirty hack to read images' tags made by Shotwell, in python.

en fr

Read it aloud: play pause stop

6 min

As I was reorganising my gallery, I wanted optimize the classification process. Before that, I was using directories, subdirectories and python black magic to generate a file containing the list of files with categories and tags. I decided to use a proper solution, that is an dedicated software. The plan was to use Shotwell to tag the images and then simply parse the tags in the metadata with python to generate the pages. I activated the "Write tags, titles and other metadata to photo files" setting in Shotwell, and started tagging.

final snippet at the bottom of the page

Once that done, I tried the first snippet of code I found to access exif data in python...

1  >>> import PIL.Image
2  >>> img = PIL.Image.open('1_000010e.JPG')
3  >>> exif_data = img._getexif()
4  >>>
5  >>> exif_data
6  {296: 2, 34665: 220, 271: 'FUJI PHOTO FILM CO., LTD.', 272: 'SP-3000', 305: 'Shotwell 0.30.1', 274: 1, 306: '2019:05:17 15:29:52', 531: 1, 282: (72, 1), 283: (72, 1), 36864: b'0210', 37121: b'\x01\x02\x03\x00', 40960: b'0100', 36867: '    :  :     :  :  ', 36868: '2019:05:16 17:10:49', 40961: 1, 40962: 1703, 40963: 1168, 40965: 494, 41728: b'\x03', 41729: b'\x01', 37500: b'FUJIFILM\x0c\x00\x00\x00\x05\x00\x00\x00\x07\x00\x04\x00\x00\x000130\x00\x80\x02\x00\x06\x00\x00\x00N\x00\x00\x00\x02\x80\x04\x00\x01\x00\x00\x00\xff\xff\xff\xff \x80\x03\x00\x01\x00\x00\x00\x01\x00\x00\x00!\x80\x03\x00\x01\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00135_C\x00'}

But no tag to be seen. Same thing with the good ol' "file" command:

1  nemecle@yggdrasil:~/Pictures/$ file 1_000010e.JPG
2  1_000010e.JPG: JPEG image data, JFIF standard 1.01, resolution (DPI), density 72x72, segment length 16, Exif Standard: [TIFF image data, little-endian, direntries=10, manufacturer=FUJI PHOTO FILM CO., LTD., model=SP-3000, orientation=upper-left, xresolution=168, yresolution=176, resolutionunit=2, software=Shotwell 0.30.1, da

Or even the standard "exif" python library:

1  >>> from exif import Image
2  >>> with open('1_000010e.JPG', 'rb') as image_file:
3  ...     my_image = Image(image_file)
4  ...
5  >>> my_image.has_exif
6  True
7  >>> dir(my_image)
8  ['_exif_ifd_pointer', '_interoperability_ifd_Pointer', '_segments', 'color_space', 'components_configuration', 'compression', 'datetime', 'datetime_digitized', 'datetime_original', 'delete', 'delete_all', 'exif_version', 'file_source', 'flashpix_version', 'get', 'get_file', 'get_thumbnail', 'has_exif', 'jpeg_interchange_format', 'jpeg_interchange_format_length', 'make', 'maker_note', 'model', 'orientation', 'pixel_x_dimension', 'pixel_y_dimension', 'resolution_unit', 'scene_type', 'software', 'x_resolution', 'y_and_c_positioning', 'y_resolution']

After some research, it seems that commonly avaliable exif commands and libraries are unable to read user-made data. Oh well.

Frustrated but still brave, I dove head first in the raw bytes, a bad habit I picked up while debugging shitty non-standard services. I fired a simple xxd in vim (":%!xxd" in normal mode) to ease the search. We can see some metadata at the beginning of the file, and the "JFIF" magic string indicating that the file is a jpeg:

 1  00000000: ffd8 ffe0 0010 4a46 4946 0001 0101 0048  ......JFIF.....H
 2  00000010: 0048 0000 ffe1 1790 4578 6966 0000 4949  .H......Exif..II
 3  00000020: 2a00 0800 0000 0a00 0f01 0200 1a00 0000  *...............
 4  00000030: 8600 0000 1001 0200 0800 0000 a000 0000  ................
 5  00000040: 1201 0300 0100 0000 0100 0000 1a01 0500  ................
 6  00000050: 0100 0000 a800 0000 1b01 0500 0100 0000  ................
 7  00000060: b000 0000 2801 0300 0100 0000 0200 0000  ....(...........
 8  00000070: 3101 0200 1000 0000 b800 0000 3201 0200  1...........2...
 9  00000080: 1400 0000 c800 0000 1302 0300 0100 0000  ................
10  00000090: 0100 0000 6987 0400 0100 0000 dc00 0000  ....i...........
11  000000a0: 0c02 0000 4655 4a49 2050 484f 544f 2046  ....FUJI PHOTO F
12  000000b0: 494c 4d20 434f 2e2c 204c 5444 2e00 5350  ILM CO., LTD..SP
13  000000c0: 2d33 3030 3000 4800 0000 0100 0000 4800  -3000.H.......H.
14  000000d0: 0000 0100 0000 5368 6f74 7765 6c6c 2030  ......Shotwell 0
15  000000e0: 2e33 302e 3100 3230 3139 3a30 353a 3137  .30.1.2019:05:17

Knowing the keywords, I just searched for "home":

 1  [...]
 2 
 3  00001880: 223e 203c 7264 663a 4465 7363 7269 7074  "> <rdf:Descript
 4  00001890: 696f 6e20 7264 663a 6162 6f75 743d 2222  ion rdf:about=""
 5  000018a0: 2078 6d6c 6e73 3a64 633d 2268 7474 703a   xmlns:dc="http:
 6  000018b0: 2f2f 7075 726c 2e6f 7267 2f64 632f 656c  //purl.org/dc/el
 7  000018c0: 656d 656e 7473 2f31 2e31 2f22 2078 6d6c  ements/1.1/" xml
 8  000018d0: 6e73 3a78 6d70 3d22 6874 7470 3a2f 2f6e  ns:xmp="http://n
 9  000018e0: 732e 6164 6f62 652e 636f 6d2f 7861 702f  s.adobe.com/xap/
10  000018f0: 312e 302f 2220 786d 703a 4c61 6265 6c3d  1.0/" xmp:Label=
11  00001900: 2270 686f 746f 6772 6170 6879 223e 203c  "photography"> <
12  00001910: 6463 3a73 7562 6a65 6374 3e20 3c72 6466  dc:subject> <rdf
13  00001920: 3a42 6167 3e20 3c72 6466 3a6c 693e 616e  :Bag> <rdf:li>an
14  00001930: 616c 6f67 3c2f 7264 663a 6c69 3e20 3c72  alog</rdf:li> <r
15  00001940: 6466 3a6c 693e 686f 6d65 3c2f 7264 663a  df:li>home</rdf:
16 
17  [...]

Bingo... I guess? I had no idea of what this was. After some cleaning I ended up with:

 1  http://ns.adobe.com/xap/1.0/.<?xpacket begin="..." id="W5M0MpCehiHzreSzNTczkc9d"?>
 2  <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP Core 4.4.0-Exiv2">
 3    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
 4      <rdf:Description
 5        rdf:about=""
 6        xmlns:dc="http://purl.org/dc/elements/1.1/"
 7        xmlns:xmp="http://ns.adobe.com/xap/1.0/"
 8        xmp:Label="photography">
 9        <dc:subject>
10          <rdf:Bag>
11            <rdf:li>analog</rdf:li>
12            <rdf:li>home</rdf:li>
13            <rdf:li>photography</rdf:li>
14          </rdf:Bag>
15        </dc:subject>
16      </rdf:Description>
17    </rdf:RDF>
18  </x:xmpmeta>

So, we are working with "RDF" elements, which apparently stands for "Resource Description Framework". I searched "rdf:li python", and finally found someone with a close enough issue to be useful but apparently too different to have showed up earlier:

 1  import xml.etree.ElementTree as ET
 2  from PIL import Image, ExifTags
 3  with Image.open("1_000010e.JPG") as im:
 4      for segment, content in im.applist:
 5          marker, body = content.split(b'\x00', 1)
 6          if segment == 'APP1' and marker == b'http://ns.adobe.com/xap/1.0/':
 7              data = body.decode('"utf-8"')
 8              print (data)
 9 
10  <?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?> <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP Core 4.4.0-Exiv2"> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xmp="http://ns.adobe.com/xap/1.0/" xmp:Label="photography"> <dc:subject> <rdf:Bag> <rdf:li>analog</rdf:li> <rdf:li>home</rdf:li> <rdf:li>photography</rdf:li> </rdf:Bag> </dc:subject> </rdf:Description> </rdf:RDF> </x:xmpmeta>
11  ...
12  (a lot of whitespaces)
13  ...
14  <?xpacket end="w"?>

Finally!

Then things got dirty because I don't actually care about most of the data, only the specific last-level "rdf:li" elements, so a quick and dirty regex did the job:

1  re.findall(r"(?<=<rdf:li>).*?(?=</rdf:li>)", data)
2  ['analog', 'home', 'photography']

But when I ran the script, I quickly realised that it would not work on .png images, because the resulting PIL object didn't have the "applist" attribute that containing the data. I loaded a jpeg and png:

1  im = Image.open("Archives/IMG_1670.JPG"
2  im1 = Image.open("Miscellaneous digital drawings/windmill Mawi.png" #png not working

And enumerated the available attributes to compare:

1  object_methods = [method_name for method_name in dir(im)
2                    if callable(getattr(im, method_name))]
3 
4  object_methods1 = [method_name for method_name in dir(im1)
5                    if callable(getattr(im1, method_name))]

And unsurprisingly:

 1  for k,v in inspect.getmembers(im, lambda a:not(inspect.isroutine(a))):
 2          print(str(k))
 3 
 4  # "applist" is listed
 5 
 6  for k,v in inspect.getmembers(im, lambda a:not(inspect.isroutine(a))):
 7      print(str(k))
 8 
 9  # "applist" is not listed

I cycled through some of the existing attributes ot see their content, and finally:

1  [...unrelated stuff...]
2 
3            \n                           \n<?xpacket end="w"?>', 'dpi': (72, 72), 'Comment': 'Created by Nemecle'}

After merging the two solutions, the final snippet (for jpeg and png, and with no fallback if tags are missing) looks like this:

 1  def read_tags(filepath):
 2      """
 3      read the shotwell tags from the metadata
 4      (require the "Write tags, titles and other metadata to photo files" option)
 5 
 6      """
 7 
 8      data = ""
 9      tags = []
10 
11      try:
12          with Image.open(filepath) as im:
13              if im.format is "PNG":
14                  data = str(im.info["XML:com.adobe.xmp"])
15 
16              elif im.format is "JPEG":
17                  for segment, content in im.applist:
18                      marker, body = content.split(b'\x00', 1)
19                      if segment == 'APP1' and marker == b'http://ns.adobe.com/xap/1.0/':
20                          data = body.decode('"utf-8"')
21      except Exception as e:
22          print("Error while reading tags on %s: %s " % (filepath, str(e)))
23          exit(1)
24 
25 
26      try:
27          pattern=re.compile(r"(?<=<dc:subject>).*?(?=</dc:subject>)", re.DOTALL)
28 
29          tag_data = pattern.search(data)
30 
31      except Exception as e:
32          print("Error while extracting tag data on %s: %s" % (filepath, str(e)))
33          exit(1)
34 
35 
36      try:
37          pattern=re.compile(r"(?<=<rdf:li>).*?(?=</rdf:li>)", re.DOTALL)
38 
39          tags = pattern.findall(tag_data.group(0))
40 
41      except Exception as e:
42          print("Error while parsing tags on %s: %s" % (filepath, str(e)))
43          exit(1)

And voilà.