Guest Post: How AI can offset the costs of hotel supply data duplication

Guest Post: How AI can offset the costs of hotel supply data duplication

Codegen software engineer Dasun Pubudumal unravels the complexities of hotel mapping

Dasun Pubudumal, a software engineer at travel technology supplier CodeGen explains the essential principles behind de-duping hotel supply in travel’s mixed and varied supply chain  

For tour operators that adopt a multi-supplier strategy for sourcing rooms, data management is still a major challenge.

This approach can cause duplication as the same accommodation is returned from direct sources, bed-banks and hotel switches in a single search.

Robust reconciliation and discrepancy detection is required to prevent miss-allocation which can have financial implications.

The hotel supply chain is varied and complex. Tour operators may have their own paper contracts loaded into a reservation system but may also source rates for the same hotel via third party API integrations to that reservation system with Expedia or Hotelbeds for instance.

When data is fed into a system from multiple sources, it’s essential to figure out where there are duplicates.

In most cases, there is no magic ID for properties to determine whether two properties are the same.

Although there are initiatives such as TTIcodes aiming to solve this issue, they are yet to be adopted by the wider market.

There are a range of typical attributes in your data that can be used to differentiate between multitudes of properties which can be categorised as follows:

  • Nomenclature-related attributes: Each hotel has a name, but note it may not be unique;
  • Location-related attributes: The position of the hotel, reflected by its address, position metrics (latitude, longitude). These also may not be unique given how each agency system represents addresses and the latitude/longitude combination may not be 100% accurate in each system;
  • Amenity-related attributes: Each hotel has its own amenities. These are far from unique;
  • Other attributes: These include contact information, star rating (4-star, 5-star, etc).

It seems that there are enough groups of attributes for a system to distinguish two hotels. It is not too difficult, in most cases, to discern whether two hotels are different.

What’s difficult is figuring out whether two hotel entities are the same. If the names of the two hotels are drastically different then it’s easy.

It’s not easy to establish when the difference in the hotel names is minute. In order to do this, we need to quantify the difference.

How to Quantify the Difference

Let’s say that we need to calculate the difference between two hotels, “Hotel California, United States” and “California Peek Hotel, United Kingdom”.

We can create an algorithm to quantify the differences between the hotel names, or strings.

By looking at the two strings, we can identify one characteristic that our algorithm should have: it should have a notion about the common word count between the two strings.

When the number of common words between both strings is higher, it’s safe to estimate (although not entirely) that the two strings are extremely likely to be the same and vice versa. Another characteristic would be the length of the two strings.


  1. “Hotel California, United States” (4 Words)
  2. “California Peek Hotel, United Kingdom” (5 Words)  (3 Common Words)

Using these notions people have developed many algorithms to quantify this difference.

Such algorithms include cosine difference, Levenshtein distance and Jaro-Winkler algorithm.

Each has its own strengths and caveats, and therefore should be used with care following a thorough analysis.

Noise and Pre-processing

Data is always imperfect. There could be ‘noise’ which is difficult to identify. Noise can be due to human error from improper data entry or picked up from a third party integration. Or it might be incomplete or degraded over time.

One way to mitigate this issue is use only a narrow subset of data which is essential to guide the algorithm, and drop the rest.

Stop-word removal is one such method: in this method, we remove commonly used words from strings which does not aid in distinguishing between strings.

Common words such as “Hotel”, “Villa”, “Resort”, hotel group tags (“Emirates”, “Hilton”, “Marriott”, etc) which are more domain-related, and language stop words such as “a”, “the”, “an” etc can be removed prior to the algorithm execution in a pre-processing stage.

Same Name, Different Locations

As it’s highly likely that two suppliers will use the same name for their properties, identifying them just by name may cause some challenges.

One way of resolving this is to use location-related attributes in addition. Tour Operators might use the city, state or country name or the latitude/longitude combination.

However, different agencies have their own location hierarchies through which they assign locations to properties.

Through a clever algorithm which uses common string-matching techniques and common location data, one can prepare a map which assigns each city coming from different agency systems to one generic location, which is then maintained in the tour operator system.

These generic locations are then used as pivot points to identify the hotel name pairs for the matching process.

For two hotels to be matched, they should be in the same generic location. The pairs are then fed into the algorithm to match other attributes such as hotel names.

Machine Learning

As we’ve seen above, each hotel entity is paired, and a pair acts as a single input for our matching algorithm which scores each input.

Each score can be normalised to [0, 1] range, where 1 represents a perfect match. The algorithm itself may be either a fuzzy rule-based algorithm or a machine learning one, depending on your data.

What’s different is that for a fuzzy algorithm, you set threshold levels (cut-off levels) in prior.

That is, if we consider only the hotel names, we may instil a rule that a perfect match would represent those pairs which has name difference more than 0.95. These values/thresholds are empirically determined.

However, in machine learning, we let the algorithm figure these parameters out itself by feeding it (training) a multitude of data.

This data has to cover almost every ground that our input data may have to cover: that is to say that the training data needs to have as many intricacies that may be possible to be there in input data.

For example, the machine learning algorithm (model) needs to be trained (or “taught”) on noisy data, incomplete data, and the like for an accurate output.

Common algorithms such as decision trees (and ensembles of trees such as forests), support vector machines, neural networks can be experimented with. Choosing the right model is always empirical.

False Positives

Any machine learning algorithm is not 100% accurate. Even as humans, our decisions aren’t correct in their entirety.

What would happen if a pair which consists of two hotel entities which are very close to being similar (but physically aren’t) is determined as the same entity by the algorithm?

The impact is majorly financial. A false positive, if we go back to the basics, would falsely distinguish those two hotels (from two sources) as the same.

That is, the tour operators would represent in their systems, the two hotels as a single entity provided by one supplier.

If S (supplier created at tour operator system) was initially created using X (assume X and X’ are two hotels which falsely returned as a match from our algorithm) hotel’s data, this would mean that reserving a room in X’ is the same as reserving a room in X in the tour operator system.

To mitigate such false positives, it is imperative to have a fuzzy system in front of the core algorithm, so that it filters out incorrect results coming from the core. Otherwise, the financial repercussions would be devastating.

In addition, for the uncertain outputs, manual intervention might be necessary. Based on certain thresholds of the fuzzy system(s), the outputs can be categorised into levels such as correct matches, possible matches, and the like.

According to these criteria, one can design a technique for manual intervention.

In conclusion

Record linkage problems tend to have the following steps:

  1. Data collection
  2. Pre-processing
  3. Pair generation
  4. Feature extraction
  5. Matching
  6. False-positive mitigation

It is a very real concern that such a process is hefty in terms of both memory and computation, and swift measures are required to mitigate such issues.

However, it must be conveyed that there always exists a trade-off between accuracy and performance (especially speed).

The opportunity cost of a highly accurate algorithm would be the speed of its execution.

In this situation, such is the case, where the financial implications would be devastating if false positives occur. Thus, it may be required to sacrifice speed for an accurate result.

Overall, the accuracy of the process largely depends on the number and the quality of attributes you choose for hotel entities to be discerned by the algorithm, and the type of the algorithm itself.

About the author: Dasun Pubudumal is software engineer at CodeGen. He is passionate about all things tech, particularly data structures and algorithms, artificial intelligence and machine learning.

CodeGen is a leading technology provider of end-to-end travel software solutions. TravelBox, its flagship product is used by some of the biggest travel companies in the world. CodeGen also provides AI driven solutions Lia chatbot, Review Spotter, Revenue Manager and Inspire personalisation engine as a combined solution or standalone products to help clients boost conversion rates throughout the customer journey.