Teaching AI to Identify Clothes

This blog is a sequel to my first blog on building faAi, an AI system which supposed to assist my wife to choose an outfit to wear. In the previous blog, I shared how I built the Clothes Diary, the first module of faAi which automatically record a snapshot of what my wife wears everyday, tagged with weather condition.

In this blog, I will share how I built the second module, an automatic Clothes Cataloguing system which identify clothes captured in the Clothes Diary. Why do we need to identify clothes, you probably ask? Because, being able to identify a clothing is a very crucial step towards being able to make recommendations.

So, the goal here is quite clear, to build a cataloging system which automatically assigns an ID to each unique outfit seen by the Clothes Diary with additional tagging such as skirt, pants, shorts, vest, etc. As you definitely have more than one of each kind, the final id will be more like skirt-1, skirt-2, pants-1, shorts-1, shorts-2, etc.

Desired labelled outfits. Notice that Long sleeve top-1, trousers-1 and long sleeve top-2 were identified twice on separate photos.

In order to uniquely identify clothes, we need to know what makes an outfit different from others. First, the structural shape, e.g. pants is longer than shorts, skirts is short but has a wider bottom, etc. Second, the texture (including colour), e.g. wool, cotton, red, yellow, etc. So, I need a system which can locate all outfits in an image and identify each one by their structural shape and texture.

One way to achieve this is to build a multi-class object detection. However, training an object detection to locate and identify unique skirt, pants and shorts, etc requires massive effort in labelling the training set. You can see that this approach does not scale well as I don’t want to manually label each outfit she owned and every outfit she buys in the future!

Instead, I decided to go with this idea of two stage process. First to use an object detection to locate and identify the structural shape of clothes (skirt, shorts, pants, vest, dress, etc). Next, to use a gram matrix to generate a texture finger print of each detected outfit. The type of cloth returned by the object detection together with its finger print will be passed on to a search module to find a matching outfit (skirt-1, skirt-2, pants-1, etc) from our clothes catalogue if existed, otherwise a new entry will be added. I will explain my reasoning behind the choice of these techniques at the later section.

There are two kinds of object detection which are relevant to our case, a bounding box based like Faster RCNN or a per-pixel based like Mask RCNN. Faster RCNN is more widely used by many and it is faster, however the fact that it can only return bounding boxes of detected outfit is not good enough.

I need an exact pixel region of all the outfits detected to be able to identify its texture (I will get into details of why in the next section). For this very reason, per-pixel based object detection is an obvious choice.

Exact pixel regions are marked with colour coded label which represent an outfit type.

I used a Deep Fashion 2 (open source datasets) as my training set which already comes with polygon outlines of 13 different type of outfits: shorts, skirt, long sleeve top, vest, short sleeve top, etc. Luckily, Amazon Sagemaker already has a built-in image segmentation algorithm so I don’t need to spend too much time building one myself. Due to limited training set, I use transfer learning with ResNet50 backbone which takes me 5 hours training time on P3.8XL (Nvidia V100). The output is an indexed PNG where each pixel contains an index to one of the thirteen labels. Amazon Sagemaker lets you choose between three different algorithms: FCN, PSPNet or DeepLab. DeepLab outperforms the rest for our use case.

Validation mIoU metrics looks decent at 47%. It is much lower than Pascal VOC 2012 benchmark at 79.7% published in DeepLab paper which is kind of expected as segmenting outfit types is much harder than segmenting rigid object such as cars, person, bicycle, etc. First, clothes are highly deformable object which introduce a very high variance (same clothes may look different due to wrinkles and different body posture). CNN filter, which is the basic building block of all object detection model does not perform well in this condition.

Second, outfits of different types may look alike. E.g. A long sleeve top looks similar to a short sleeve top when the sleeve is rolled and sometimes you can hardly distinguish between a trouser vs a skirt from side view (see the 6th image below).

Testing on faAi clothes diary images shows similar results on images with good lighting and less occlusion (carrying bag, holding laptop, baby, etc).

The first three columns show good detections. The last three columns show bad detections.

The goal here is to generate a texture finger print out of all outfits detected by object detection module. A finger print is just a vector with X dimension. Two outfit with similar texture will yield very similar finger print vectors. In mathematical terms, the more similar the textures are the smallest the Cosine Distance between their finger print vectors.

Finger print vectors of various textures. α1, α2, α3 represent the cosine (angle) distance between the query and each respective texture. Texture 1 is the most similar to the query texture as α1 is the smallest among the three.

Gram matrix is a state of the art technique to build texture finger print. What makes it special is its ability to enable us to control which spatial domain/aspect of an image it should pay attention to in order to generate a finger prints. E.g. Are you more interested in uniquely identifying an outline pattern (higher spatial domain) of a floor-board or just the texture (lower spatial domain) within each floor-board piece?

Similarity at a different spatial domain

For this ability, gram matrix is widely used in Image Style Transfer. One popular application is to replace a style of a painting, E.g. from Van Gogh to Pablo Picasso, or from a normal photo into a Van Gogh painting style. A gram matrix is being used as a loss function to measure how similar the texture style of the generated output painting to the target style and penalise the AI model during training for differences.

Neural Style Transfer to change the style on an input image

As you already can guess, I control the gram matrix to focus on cloth texture to get me a texture finger print. I cropped a rectangular image area within each outfit region as an input to the Gram Matrix. The selection of this area is done by finding an 80x80 pixels rectangular which is fully inside the outfit region (by inspecting the pixels value within the PNG file). When the selection yields to multiple areas (which is most of the time), the priority is given to an area closer to the centre of the outfit region. The reasoning behind this is so that the system will always pick the same part of outfits which helps provide a better consistency for a finger print generation. More often than not, the centre part of an outfit normally yields a better gram matrix quality due to less wrinkle, folding and occlusion. The assumption works pretty well given that our camera view is static and majority of the captures are a frontal view of her standing straight.

The dashed white rectangle is showing the bounding box of an outfit pixel region whilst the red rectangle is showing the 80x80 pixel at the centre of the bounding box selected as a texture input to our Gram Matrix.

Granted that we can probably improve this by cropping area around the same body parts, E.g. shoulder, chest, waist by utilising Human Pose Estimation as a hint as to where those parts are. However, due to time constraint, I might give this a go at a future stage instead.

Amazon Sagemaker does not have a built-in algorithm for a Gram Matrix, so instead I have to resort to build one using a custom container. I built the Gram Matrix module using Keras with VGG19 as the base network and tapping into the first two CNN filter blocks (block_1, block_2) to generate the texture finger print. These was chosen by a bit of trial and error plus a good solid hypothesis that the output of earlier CNN filters contains information of low level aspect of an image (like textures).

Texture Finger Printing Diagram

Generating the finger print itself is as simple as calculating a covariance vector from the output of each filter block. As each filter block output is just a vector and we have two filter blocks, we will end up with two vectors which we can simply concatenate as our final finger-print vector.

The reason why gram matrix works is because a covariance between the presence of low level features uniquely identify texture characteristic. Anyone curious about the technical details can view the research paper here.

To test the system, I cropped 100 texture images from Deep Fashion dataset and faAi Clothes Diary and generated a texture finger print for each . Next, 50 images were selected as a query image and a Cosine Distance score were calculated against the rest of the images. For each query texture, 5 textures with the smallest Cosine Distance were then shown as matching results (sorted with the most similar to appear first).

The result was quite good, with a matching success rate around 85% on clear texture image with good brightness and contrast. As you can see from image below, given a query image, the first two results are showing correct matches.

Similar textures returned by Gram Matrix based on each Query Image (left column). Query image for the first three rows are from DeepFashion 2 datasets while the last three are from faAi Clothes Diary.

We have completed the texture finger printing process. However there is one issue. The resulting finger print vector is a 20,480-digit floating point number which is too impractical to be stored in our database. Remember that later I need to tag each outfit found in photo with this long number? Querying this long number on all existing outfit in the database to search for a matching outfit will be super inefficient.

So, I added a post-processing module to dimension reduce the finger print vector from 20,480 down to only 128 numbers with nearly no drop in accuracy using a simple PCA trained on 750 finger print vectors generated from Deep Fashion dataset. I can easily store this 128 numbers as a string.

Given that we have the type of outfit from our object detection and finger print vector from our Gram matrix, now our job is to build a search module to look for a matching outfit in our Clothes Catalogue database. If no match is found, we added this outfit as a new entry. Note that the finger print vector and the type of outfit are saved as a meta-data of the image.

The matching is simply done by querying the database (Clothes Catalogue) for the same type of outfit and calculating a similarity score using the finger print vector on the query result. As stated earlier, the similarity score is simply calculated using a Cosine Distance between the two vectors in their unit form. I used the SKLearn KNN module for this calculation.

CosineDistance = |V1|.|V2|

An existing outfit with the smallest cosine distance score to the query outfit is considered to be a match but only if the score is < threshold. The diagram below shows the overall matching pipeline.

The overall outfit matching pipeline.

Now that I have all the components I need, it’s time to start putting on my software architect hat to glue everything together. Given such a low utilisation of this system with only around 20 images to process a day, a batch architecture design is more cost effective than a Rest API which runs 24/7.

First, I created a lambda function (scheduled to run daily) to execute my object detection as an Amazon Sagemaker batch transform and processes all new images captured by faAi Clothes Diary. The resulting PNG files are uploaded back into another S3 bucket. Processing 20 images only less than 1 billable minute which is only costing me less than 3 cents a day. Awesome!

Next, I created another lambda function to zip each PNG (paired with their respective JPG image) and store them into another S3 bucket as an input to the Gram Matrix Batch Transform. This lambda function is setup to be triggered on the completion of the previous Amazon Sagemaker Batch Transform Job.

I created yet another lambda function to trigger Amazon Sagemaker Batch Transform of my Gram Matrix which will process all the zip files (PNG and JPG) and output a finger print vector in the form of JSON file, one for each outfit found in the PNG file. This Gram Matrix lambda function is only costing me less than 2 cents a day and will trigger at the completion of the previous lambda.

Finally, I created yet another lambda function to perform a Matching Outfit Search and is triggered by the completion of Gram Matrix lambda. The process is pretty straight forward. For each outfit (we will call this a query outfit for simplicity), an outfit type is used to query the Clothes Catalogue. For each outfit in the search result, calculate the finger print vector Cosine Distance between the outfit against the query outfit. If the resulting value is < threshold (which I have chosen to be 0.1 from trial and error) then skip as it is not a match. For all matched outfit, pick the one with smallest Cosine Distance score as the final match and return its catalogue Id. If no match, simply save this outfit as a new Catalogue entry and return the catalogue Id.

With the catalogue Id, now we can tag this outfit entry in our Clothes Diary with its identifier.

Outfit Matching architecture diagram

For testing, I run the system against three month’s worth of captured images in faAi Clothes Diary to tag each outfit identified with an id. Same id if clothes has been seen previously and a new id otherwise. I then measure an identification accuracy, which is defined as the success rate of the system in matching an outfit to an existing one. An outfit appears at an average of 5 times.

Accuracy on image sets with a good lighting, frontal view and less occlusion is decent at 50% however overall identification accuracy on all image sets is pretty low at 15%

Accuracy drop in good image sets are primarily attributed to the following:

  • Mis-identification of outfit type during object detection phase which mainly due to confusion between several difficult classes E.g. long sleeve top vs short sleeve top, skirt vs shorts as explained in the previous section.
  • Bad texture area sampling. Though it is not very often, the Gram Matrix module sometimes picked areas outside clothes, E.g. hand, phone and bag. This happens a lot more often in a non frontal view photo especially with more challenging body postures. It is due to most of DeepFashion training sets annotate hand, phone, bag, etc to be part of the outfit which causes the trained model to behave the same.
Bad sampling of texture

Mis-identification in our object detection phase is a hard problem to solve without building a different and quite possibly inventing a new object detection architecture.

One idea is to incorporate body posture information to the AI to normalise an outfit deformation property such as rolled up sleeve around elbow. The same technique can also be used to improve the texture sampling so that we can choose better by avoiding limbs and poses which are not ideal.

Another interesting idea is to drop the texture sampling process all together, instead to generate finger print from all outfit region to minimise the chance of picking bad sample and train an end to end object matching model. Furthermore, we can also infused body posture information together with the outfit region to give more local context around different part of outfit to be able to generate a finger print which are not bias to location where it was sampled.

Last but surely the easiest, which I have actually done, is asking Yumi to give a perfect pose with clear visibility of her outfit. I know that it is a little bit annoying for her however, this will definitely help in combating both outfit mis-identification and bad texture sampling as the Gram Matrix finger printing itself already yields 85% accuracy given good texture samples. I should see an accuracy improvement in the next few weeks and summer time surely will help a lot in the lighting department!

It’s a pretty challenging phase and I don’t consider this to be solved yet as I will need a much higher identification accuracy like above 80% in order to build a recommendation system. However, I am pretty happy as I’ve learned a lot in the past few months!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store