txt/

Accessing tags made in Shotwell with Python 3

Une astuce crados pour lire les tags créés par Shotwell en python.

🚧🏗️🚚HTML WORK AHEAD (i) play pause stop 🚧

"Accessing tags made in Shotwell with Python 3" on https://aligot-death.space, available at https://aligot-death.space/txt/shotwell-tags-python-fr

date2020-06-26
lang
fr en
tags
dev

read time 7 min

Alors que je réorganisais ma galerie, je voulais optimiser le processus de classement des images. Au lieu d 'utiliser des répertoires et des sous-répertoires avec des tags et de la magie noire en python comme c'était le cas actuellement, j'ai décidé de créer ma propre solution, c'est à dire un logiciel fait pour. Le plan était d'utiliser Shotwell pour tagger les images, et simplement parser les tags avec python pour créer les pages. J'ai donc activé la fonction "Write tags, titles and other metadata to photo files" dans Shotwell, et j'ai commencé à tagger.

code final en bas de la page

Sauf que quand j'ai essayé le premier bout de code pour accéder aux EXIF en python que j'ai pu trouver...

>>> import PIL.Image
>>> img = PIL.Image.open('1_000010e.JPG')
>>> exif_data = img._getexif()
>>>
>>> exif_data
{296: 2, 34665: 220, 271: 'FUJI PHOTO FILM CO., LTD.', 272: 'SP-3000', 305: 'Shotwell 0.30.1', 274: 1, 306: '2019:05:17 15:29:52', 531: 1, 282: (72, 1), 283: (72, 1), 36864: b'0210', 37121: b'\x01\x02\x03\x00', 40960: b'0100', 36867: '    :  :     :  :  ', 36868: '2019:05:16 17:10:49', 40961: 1, 40962: 1703, 40963: 1168, 40965: 494, 41728: b'\x03', 41729: b'\x01', 37500: b'FUJIFILM\x0c\x00\x00\x00\x05\x00\x00\x00\x07\x00\x04\x00\x00\x000130\x00\x80\x02\x00\x06\x00\x00\x00N\x00\x00\x00\x02\x80\x04\x00\x01\x00\x00\x00\xff\xff\xff\xff \x80\x03\x00\x01\x00\x00\x00\x01\x00\x00\x00!\x80\x03\x00\x01\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00135_C\x00'}

Aucun tag en vu. Même chose avec la bonne vieille commande file:

nemecle@yggdrasil:~/Pictures/$ file 1_000010e.JPG
1_000010e.JPG: JPEG image data, JFIF standard 1.01, resolution (DPI), density 72x72, segment length 16, Exif Standard: [TIFF image data, little-endian, direntries=10, manufacturer=FUJI PHOTO FILM CO., LTD., model=SP-3000, orientation=upper-left, xresolution=168, yresolution=176, resolutionunit=2, software=Shotwell 0.30.1, da

Ou bien même avec la librairie standard "exif":

>>> from exif import Image
>>> with open('1_000010e.JPG', 'rb') as image_file:
...     my_image = Image(image_file)
...
>>> my_image.has_exif
True
>>> dir(my_image)
['_exif_ifd_pointer', '_interoperability_ifd_Pointer', '_segments', 'color_space', 'components_configuration', 'compression', 'datetime', 'datetime_digitized', 'datetime_original', 'delete', 'delete_all', 'exif_version', 'file_source', 'flashpix_version', 'get', 'get_file', 'get_thumbnail', 'has_exif', 'jpeg_interchange_format', 'jpeg_interchange_format_length', 'make', 'maker_note', 'model', 'orientation', 'pixel_x_dimension', 'pixel_y_dimension', 'resolution_unit', 'scene_type', 'software', 'x_resolution', 'y_and_c_positioning', 'y_resolution']

Après quelques recherches, il semblerait que les librairies habituelles sont incapables d'accéder à des données créées par les utilisateurices. Bon.

Frustréx mais toujours vaillantx, je plonge tête la première dans les octets, une habitude que j'ai pris en débuggant des services non-standard merdique, qui m'a déjà aidé plus d'une fois. Je lance donc xxd dans vim (:%!xxd en normal mode) pour me faciliter le travail. On peut voir quelques metadata en début de fichier, ainsi que la magic string JFIF indiquant que le fichier est un jpeg:

00000000: ffd8 ffe0 0010 4a46 4946 0001 0101 0048  ......JFIF.....H
00000010: 0048 0000 ffe1 1790 4578 6966 0000 4949  .H......Exif..II
00000020: 2a00 0800 0000 0a00 0f01 0200 1a00 0000  *...............
00000030: 8600 0000 1001 0200 0800 0000 a000 0000  ................
00000040: 1201 0300 0100 0000 0100 0000 1a01 0500  ................
00000050: 0100 0000 a800 0000 1b01 0500 0100 0000  ................
00000060: b000 0000 2801 0300 0100 0000 0200 0000  ....(...........
00000070: 3101 0200 1000 0000 b800 0000 3201 0200  1...........2...
00000080: 1400 0000 c800 0000 1302 0300 0100 0000  ................
00000090: 0100 0000 6987 0400 0100 0000 dc00 0000  ....i...........
000000a0: 0c02 0000 4655 4a49 2050 484f 544f 2046  ....FUJI PHOTO F
000000b0: 494c 4d20 434f 2e2c 204c 5444 2e00 5350  ILM CO., LTD..SP
000000c0: 2d33 3030 3000 4800 0000 0100 0000 4800  -3000.H.......H.
000000d0: 0000 0100 0000 5368 6f74 7765 6c6c 2030  ......Shotwell 0
000000e0: 2e33 302e 3100 3230 3139 3a30 353a 3137  .30.1.2019:05:17

Vu que je connaissais les tags que j'avais mis, je cherche "home":

000017a0: 28a0 0fff d900 ffe1 0a20 6874 7470 3a2f  (........ http:/
000017b0: 2f6e 732e 6164 6f62 652e 636f 6d2f 7861  /ns.adobe.com/xa
000017c0: 702f 312e 302f 003c 3f78 7061 636b 6574  p/1.0/.<?xpacket
000017d0: 2062 6567 696e 3d22 efbb bf22 2069 643d   begin="..." id=
000017e0: 2257 354d 304d 7043 6568 6948 7a72 6553  "W5M0MpCehiHzreS
000017f0: 7a4e 5463 7a6b 6339 6422 3f3e 203c 783a  zNTczkc9d"?> <x:
00001800: 786d 706d 6574 6120 786d 6c6e 733a 783d  xmpmeta xmlns:x=
00001810: 2261 646f 6265 3a6e 733a 6d65 7461 2f22  "adobe:ns:meta/"
00001820: 2078 3a78 6d70 746b 3d22 584d 5020 436f   x:xmptk="XMP Co
00001830: 7265 2034 2e34 2e30 2d45 7869 7632 223e  re 4.4.0-Exiv2">
00001840: 203c 7264 663a 5244 4620 786d 6c6e 733a   <rdf:RDF xmlns:
00001850: 7264 663d 2268 7474 703a 2f2f 7777 772e  rdf="http://www.
00001860: 7733 2e6f 7267 2f31 3939 392f 3032 2f32  w3.org/1999/02/2
00001870: 322d 7264 662d 7379 6e74 6178 2d6e 7323  2-rdf-syntax-ns#
00001880: 223e 203c 7264 663a 4465 7363 7269 7074  "> <rdf:Descript
00001890: 696f 6e20 7264 663a 6162 6f75 743d 2222  ion rdf:about=""
000018a0: 2078 6d6c 6e73 3a64 633d 2268 7474 703a   xmlns:dc="http:
000018b0: 2f2f 7075 726c 2e6f 7267 2f64 632f 656c  //purl.org/dc/el
000018c0: 656d 656e 7473 2f31 2e31 2f22 2078 6d6c  ements/1.1/" xml
000018d0: 6e73 3a78 6d70 3d22 6874 7470 3a2f 2f6e  ns:xmp="http://n
000018e0: 732e 6164 6f62 652e 636f 6d2f 7861 702f  s.adobe.com/xap/
000018f0: 312e 302f 2220 786d 703a 4c61 6265 6c3d  1.0/" xmp:Label=
00001900: 2270 686f 746f 6772 6170 6879 223e 203c  "photography"> <
00001910: 6463 3a73 7562 6a65 6374 3e20 3c72 6466  dc:subject> <rdf
00001920: 3a42 6167 3e20 3c72 6466 3a6c 693e 616e  :Bag> <rdf:li>an
00001930: 616c 6f67 3c2f 7264 663a 6c69 3e20 3c72  alog</rdf:li> <r
00001940: 6466 3a6c 693e 686f 6d65 3c2f 7264 663a  df:li>home</rdf:
00001950: 6c69 3e20 3c72 6466 3a6c 693e 7068 6f74  li> <rdf:li>phot
00001960: 6f67 7261 7068 793c 2f72 6466 3a6c 693e  ography</rdf:li>
00001970: 203c 2f72 6466 3a42 6167 3e20 3c2f 6463   </rdf:Bag> </dc
00001980: 3a73 7562 6a65 6374 3e20 3c2f 7264 663a  :subject> </rdf:
00001990: 4465 7363 7269 7074 696f 6e3e 203c 2f72  Description> </r
000019a0: 6466 3a52 4446 3e20 3c2f 783a 786d 706d  df:RDF> </x:xmpm
000019b0: 6574 613e 2020 2020 2020 2020 2020 2020  eta>

Bingo... ? J'avais aucune idée de ce que c'était. En nettoyant un peu, je me suis retrouvéx avec:

http://ns.adobe.com/xap/1.0/.<?xpacket begin="..." id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP Core 4.4.0-Exiv2">
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:Description
      rdf:about=""
      xmlns:dc="http://purl.org/dc/elements/1.1/"
      xmlns:xmp="http://ns.adobe.com/xap/1.0/"
      xmp:Label="photography">
      <dc:subject>
        <rdf:Bag>
          <rdf:li>analog</rdf:li>
          <rdf:li>home</rdf:li>
          <rdf:li>photography</rdf:li>
        </rdf:Bag>
      </dc:subject>
    </rdf:Description>
  </rdf:RDF>
</x:xmpmeta>

Donc, on travaille avec des éléments "RDF", qui apparemment signifie "Resource Description Framework". J'ai cherché "rdf:li python", et finalement trouvé une solution relativement proche, mais assez différente pour être passée sous le radar de mes recherches précédentes :

import xml.etree.ElementTree as ET
from PIL import Image, ExifTags
with Image.open("1_000010e.JPG") as im:
    for segment, content in im.applist:
        marker, body = content.split(b'\x00', 1)
        if segment == 'APP1' and marker == b'http://ns.adobe.com/xap/1.0/':
            data = body.decode('"utf-8"')
            print (data)

<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?> <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP Core 4.4.0-Exiv2"> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xmp="http://ns.adobe.com/xap/1.0/" xmp:Label="photography"> <dc:subject> <rdf:Bag> <rdf:li>analog</rdf:li> <rdf:li>home</rdf:li> <rdf:li>photography</rdf:li> </rdf:Bag> </dc:subject> </rdf:Description> </rdf:RDF> </x:xmpmeta>
...
(a lot of whitespaces)
...
<?xpacket end="w"?>

Enfin !

Les choses sont ensuite devenues un peu sales : je m'en fous de la plupart des données, je m'intéresse juste aux éléments de dernier niveau "rdf:li". Une regex sale plus tard :

re.findall(r"(?<=<rdf:li>).*?(?=</rdf:li>)", data)
['analog', 'home', 'photography']

Mais je me suis rendux compte que ça marcherait pas sur des .png, car l'objet "PIL" qui en résultait n'avait pas le applist avec les données dedans. J'ai donc chargé un jpeg et un png :

im = Image.open("Archives/IMG_1670.JPG"
im1 = Image.open("Miscellaneous digital drawings/windmill Mawi.png" #png not working

Et énuméré les attributes disponibles pour comparer:

object_methods = [method_name for method_name in dir(im)
                  if callable(getattr(im, method_name))]

object_methods1 = [method_name for method_name in dir(im1)
                  if callable(getattr(im1, method_name))]

Et suprenamment :

for k,v in inspect.getmembers(im, lambda a:not(inspect.isroutine(a))):
        print(str(k))

# "applist" is listed

for k,v in inspect.getmembers(im, lambda a:not(inspect.isroutine(a))):
    print(str(k))

# "applist" is not listed

J'ai regardé les attributs existants pour voir leur contenu, et finalement :

[...unrelated stuff...]

'...<?xpacket end="w"?>', 'dpi': (72, 72), 'Comment': 'Created by Nemecle'}

En mergeant les deux solution, le code final (pour jpeg et png en tout cas, et en ignorant la potentielle absence de tags) :

def read_tags(filepath):
    """
    read the shotwell tags from the metadata
    (require the "Write tags, titles and other metadata to photo files" option)

    """

    data = ""
    tags = []

    try:
        with Image.open(filepath) as im:
            if im.format is "PNG":
                data = str(im.info["XML:com.adobe.xmp"])

            elif im.format is "JPEG":
                for segment, content in im.applist:
                    marker, body = content.split(b'\x00', 1)
                    if segment == 'APP1' and marker == b'http://ns.adobe.com/xap/1.0/':
                        data = body.decode('"utf-8"')
    except Exception as e:
        print("Error while reading tags on %s: %s " % (filepath, str(e)))
        exit(1)


    try:
        pattern=re.compile(r"(?<=<dc:subject>).*?(?=</dc:subject>)", re.DOTALL)

        tag_data = pattern.search(data)

    except Exception as e:
        print("Error while extracting tag data on %s: %s" % (filepath, str(e)))
        exit(1)


    try:
        pattern=re.compile(r"(?<=<rdf:li>).*?(?=</rdf:li>)", re.DOTALL)

        tags = pattern.findall(tag_data.group(0))

    except Exception as e:
        print("Error while parsing tags on %s: %s" % (filepath, str(e)))
        exit(1)

Et voilà.

contact

Support

Ko-fi