The problem: accented characters versus user input
I’ve really been enjoying (wait for it, unpaid endorsement) The Criterion Channel since it launched earlier this year. The app and the website are nascent and therefore have had their share of problems, but to their credit they have been hard at work making things smoother day by day.
But one big annoying thing in general is searchability of films in general. Recently I found a reddit post revealing another big search annoyance: bad handling of matching strings with accented titles, particularly when searching for ‘samourai’ and expecting ‘Le Samouraï’ to pop up in the results:
The Criterion Collection is home to many foreign films, so it’s natural that many of the titles would have accents, so this is a bit unfortunate for their users.
First steps at a solution
I used this as an opportunity to learn something new! I realized that I didn’t immediately know how to solve for this either. A simple regexp was no help:
|
|
I had to go digging a bit to figure out how to make progress…
This great answer by Lewis Diamond on Stackoverflow lead the way to String.prototype.normalize(), but before we jump into that, first we need to take care of some prerequisites with a super brief Unicode overview.
Like many other Unicode characters, what appears to be just one simple character 'ï'
can actually be represented in two different ways which are displayed identically as 'ï'
:
'\u00ef'
(‘Latin Small Letter I with Diaeresis’)'\u0069\u0308'
(‘Latin Small Letter I’ plus ‘Combining Diaeresis’)
Unicode generally calls the first the composed/precomposed form, while the second is decomposed into two symbols (‘i’ and the diacritic).
Side note: with Unicode, things are not always as they appear…
What is somewhat strange is that though the Unicode characters display exactly the same way and are considered Unicode equivalent, in JavaScript they are not equal:
|
|
Removing diacritics
What is most interesting for us here is that we can actually take the decomposed form '\u0069\u0308'
and remove the diacritic (the two dots) simply by removing the symbol '\u0308'
(‘Combining Diaeresis’):
|
|
That’s looking a lot closer to what the original user input was, and will make things easier to match. Sort of like converting two strings using toLowerCase()
to check their case-insensitive equivalence.
It turns out that all diacritics have a predictable Unicode range, so we can easily remove any diacritics with a regexp character range:
|
|
One big catch though - this will only work for decomposed symbols. What happens if we get a composed symbol? We’d be stuck, since we can’t break it apart into separate character and diacritic symbols…
…or can we?
String.prototype.normalize()
to the rescue!
This handy function can convert Unicode characters between their composed and decomposed forms:
|
|
This normalization allows us to easily convert to decomposed form, so our regexp above will correctly remove diacritics. We can now put together the pieces (pun intended) and make a basic helper function to remove accents (diacritics):
|
|
Complete example code
Now we can put it all together into a working example that would solve the original problem with user input in the Criterion Channel’s website:
|
|
Comments