Sign Up For The First Ever Low-Key Data Conference15 speakers. 4 hours. Virtual event. No frills. Join us on 3/22/23 from 1-5pm EST - Click here to sign up!.
Long-tail data can help organizations explore new ways to grow, whether in terms of product sales and revenue or traffic and visibility.
This post aims to explore long-tail data in detail with real-life examples and applications while explaining how businesses benefit from it.
Long-tail data is visually represented by a hyperbolic curve, like the graph of a reciprocal function.
The foundation of the long-tail concept plots different data values to their respective frequencies. The area marked in green is the "head" of the graph which includes a small number of high-frequency items. The area in yellow is the long 'tail' of the graph which includes huge numbers of less frequent items.
Such long-tail curves have been observed for ages in mathematical functions and frequency distributions, like the power law and the 80-20 rule, also known as Pareto distribution.
Even the somewhat mysterious Zipf's law demonstrates long-tail patterns in the frequency of word occurrences in various languages.
The application of long-tail in the context of business and industries was highlighted by writer-editor Chris Anderson.
In 2004, Chris Anderson published an article titled "The Long-tail" in Wired, of which he was also the editor-in-chief.
Then in 2006, he published the book "The Long-tail: Why the Future of Business is Selling Less of More," which went deeper into the concepts touched upon by his previous article.
In his magazine article and later in the book, Chris Anderson builds upon the concept of long-tail through the following key points and highlights:
Before the internet and digital technology, distributors had limited shelf space for products. Even things like music and movies were limited to a fixed number of screens in theatres or particular slots and timings on TVs and radios.
So, only selected products were marketed extensively, which would have a mass appeal. As such, even consumers were limited in terms of what they could choose to consume.
The rise of digital technology has enabled businesses to create, promote, distribute, and sell as many products as desired. There are no limits on "shelf space" in online stores. And digital products can be infinite or limitless.
This allows businesses to market long-tail products without limitations, and the relevant customers can also find such products easily in the online domain. Because of these reasons, the number of low-volume, low-demand products is continuously rising, which means the long-tail of distribution curves keeps growing longer.
The most important point is that all the products that fall within this long tail can collectively have an equal or even greater impact than the high-demand products. For example, the total revenue earned from the sales of billions of copies of 10 bestseller books could be less than the revenue of a million lesser-known books that only sold a few hundred copies each.
So long-tail data is the collection of all data about items that serve a specific niche and have a low demand but exist in greater varieties and larger quantities. Instead of prioritizing products, services, information, or content consumed and purchased by millions, businesses can benefit more from leveraging these low-volume items that reside within the long-tail.
Long-tail effects have completely changed the way businesses operate in the last decade or so in the following ways:
For many years, the success formula in retail models was to sell the most popular items with mass appeal and high consumer demand. But with long-tail data, retailers can now successfully sell niched products to small groups of targeted customers.
For example, Lefty's: The Left Hand Store -- a popular retail store in the US that creates and sells products only for left-handed people -- has expanded its reach nationwide through its online store.
Since products within the long tail also contribute significantly to the total revenue of a business, data pertaining to such low-volume items must also be considered by businesses when tracking metrics and KPIs to measure performance and growth.
One of the notable long-tail effects of long-tail data is that businesses can no longer ignore the importance of items with a smaller but targeted appeal, lest they miss out on potential revenue opportunities. So long-tail data is essential to sales and marketing strategies.
Acknowledging the potential of niched items in the market inspires companies to establish successful businesses around new and innovative products. Sometimes, there isn't any existing demand for certain. For example, many companies profit from selling 'levitating plants' on Amazon.
Businesses are no longer compelled to compete for the top spot and go after the bestselling products. They can target products in the long tail with less competition and still make good revenue.
Pareto Principle --- Italian engineer Vilfredo Pareto introduced the Pareto distribution, popularly known as the 80/20 Rule.
It states that 20% of causes result in 80% of outcomes, while the other 80% of causes only lead to 20% of outcomes.
In the business context, this would imply that 80% of revenue should come from the top 20% of products.
Before the internet, all the buying, selling, and consumption was offline. Retailers and distributors could only sell finite sets of products, which meant consumers also had limited buying options.
Only bestselling, popular products were put in the front of the market. That is why a small collection of products accounted for most of sales and revenue, and Pareto's distribution would hold true.
Long-tail products --- Anyone can sell and buy anything in today's era.
So businesses, producers, and content creators keep adding more products and content on the internet, even if they cater to a select few people.
And those very few people can also easily find such relevant products. This constant generation and consumption of new products have led to a significant increase in the length and thickness of the 'tail' in distribution curves.
That is why those 80% of items in the long-tail can very well account for more than 50% of sales and revenue. The Pareto distribution might not apply to modern-day businesses, and the long-tail model has effectively taken over.
"Big data" is a large umbrella term when it comes to datasets. It involves a massive collection of unstructured and structured data relevant to every aspect of a business.
Big data must be organized, categorized, and analyzed to provide value, insight, or benefit to the organization.
Long-tail data is not different than big data but rather a subset of it.
Consider the example of a popular blog that consistently draws in thousands of visitors and readers every month. Say they have every metric, number, and figure that they collect from web analytics tools, like Google Analytics, as their big dataset.
Now if they extract the marketing data such as page views on all their blog posts, they might realize that the total views on their 50 most popular posts are still lower than the collective page views of all other remaining posts, which would make up their long-tail data.
So it is obvious that big data has greater and broader applications, and long-tail data is one aspect of it. Data scientists can use long-tail data to train machine learning models in identifying niche markets. Such models can be used by businesses to determine potential opportunities for respective niched products.
There are many ways to use long-tail data in different industries and business areas.
Search engine optimization (SEO) is one of the first applications of long-tail data in marketing. In SEO, the term "long-tail keywords" is heard often. It refers to long keywords that address extremely specific, niched queries, for example, "how to grow tomatoes during winters in Texas."
Long-tail keywords have low search volumes but also low competition. They're easy to rank for a while, bringing decent website traffic.
Long-tail data can also identify the preferences of a certain target audience, which in turn helps businesses create personalized messages for those audiences through email or social media campaigns.
Long-tail data can help businesses determine the demand for niche products and find effective ways to make them visible to their target audience. It can also be used to improve and optimize product recommendations.
For example, customers who have purchased a flyfishing guidebook could also be recommended flyfishing poles. And, of course, the idea of targeting long-tail keywords also applies to e-commerce businesses.
Since long-tail data consists of items that are uncommon, niched, and low demand, including such data improves the overall data granularity of predictive algorithms, making them more accurate over time.
Training datasets containing probability distributions with long-tail curves can particularly be useful in developing algorithms that can predict the success of new niched products before they hit the market.
Classifiers have already become an essential component of many present-day technologies. Image recognition is an integral aspect of self-driving cars, and our smartphones are capable of facial recognition too. Long-tail models could be used for a training dataset, such classifiers to identify and label obscure items that are not commonly seen and encountered regularly.
The discovery of the long-tail concept itself was proof of a new trend for its time -- the shift of production, supply, and distribution towards niched demands. And by continuously analyzing long-tail data, it's possible to detect a rise or downfall in the overall sales or popularity of certain niche products, which would be a sign of new and emerging trends in that niche.
Many well-known businesses have implemented long-tail data methodology concepts to achieve significant success, some of which are explained below.
Amazon's book sales back in the early 2000s were also highlighted as one of the primary examples of the long-tail in Anderson's book. He mentioned that Amazon's revenue came from the total sales of all the less popular books, which was more than the combined sales of the bestseller books.
Even today, after evolving into an e-commerce store that sells basically everything, Amazon still earns a huge portion of its revenue from low-volume sales of millions of niched products. By analyzing the long-tail distribution of their product sales, they have improved their product search and recommendation algorithms to efficiently connect retailers and sellers of niched products with people looking for such products.
Long-tail keywords have significantly improved search engine algorithms, helping them aggregate and process complex data points and narrow search queries to produce accurate results. Today, these algorithms can even understand the intent behind such long-tail terms, whether the user wants to purchase a specific product or seek information.
Also, short-tail keywords are generic, so it's difficult to determine what the user is looking for. In contrast, the specific nature of long-tail keywords lets search engines produce exactly the info that users are looking for, which also improves the relevance of the algorithm.
Netflix is a prime example of making the most out of long-tail data distribution. Their library varies based on geolocation, and they promote plenty of regional movies and shows in respective countries. Their recommendation system is very efficient and personalized, so users are suggested more shows from the genres they enjoy.
The most watched movie or series in South Korea may not be recommended as aggressively to users in the US. Netflix's algorithm is designed to identify every individual's unique preferences and serve more relevant content accordingly.
Wikipedia has countless pages on niched topics that very few people search. But when added together, those page views amount to billions, which has played an important role in establishing the domain authority of Wikipedia on Google's search engine.
That is why it often ranks on the first page of Google for most informational queries and enjoys a solid reputation as the most popular online encyclopedia.
Driving is unpredictable, especially in the real world of pedestrians and dozens of other drivers on the road. There is no finite list of every possible scenario. Dozens of real-time infrared sensors put all visible objects into long-tail classes to evaluate all possible trajectories and risks. As a result, autonomous vehicles can choose the safest course of action in every dangerous situation.
That's where long-tail data plays an important role. The self-driving cars' AI system is trained with many special long-tail datasets, including cases of rare, unusual, sudden, and unexpected incidents, which helps them iterate how to react safely in such cases.
Let's consider all the benefits and limitations of working with large volumes of long-tail data.
When long-tail data is added to the training dataset of machine learning models, it helps AI systems learn about uncommon or rare scenarios. For example, in AI systems used for medical diagnosis, long-tail data can include mild symptoms that occur less frequently but are still important indicators of certain underlying conditions and diseases.
And, of course, it also benefits businesses as analysis of long-tail data enables companies to create strategies around their lesser-known, niche products, boosting revenue.
So the inclusion of long-tail data in the big data set of companies or training datasets of AI systems greatly improves the significance and comprehensiveness of the overall dataset.
Long-tail data is important to identify and accurately target small groups of customers who are looking for niche products. It further provides insights into their demographic info, the keywords, and queries they use frequently, how often they purchase such products, etc.
All that data helps businesses sort their products into relevant categories and make necessary optimizations, making them easier to find, even if just a couple of thousand people are looking for them.
Say you're looking at the sales of crime novels in a given year. Novels that barely sold 1,000 copies may seem negligent compared to the ones that sold millions of copies. Normal distributions would have a majority of data centered around those bestsellers.
But when you have a long list of millions of novels that sell very few copies, their collective impact is huge, and ignoring them leaves you with an incomplete perception of the overall market.
So a long-tail distribution is more precise because it encompasses all items and events, even those that individually seem insignificant due to low frequencies. In reality, the 'tail' may have a greater impact than the 'head.'
One of the main downsides of long-tail data is that it is highly unstructured because it comprises an extremely large group of diverse items and events. So, before one can analyze and derive insightful information from long-tail data, it requires extensive efforts in structuring and categorizing.
Individual elements of the dataset need to be categorized into relevant components that share similar properties. These components can be very minute, detailed, and vastly different from each other. This componentization of long-tail data can be tedious and time-consuming when the items at the tail of the distribution are extremely randomized.
As you move further toward the tail of the distribution, you encounter many elements and events with low frequency.
These outliers in power law distribution can be problematic because they can't always be outright ignored. They may have contrasting properties from those elements at the head or early regions of the tail. If such outliers are analyzed with the more prominent items in the data distribution, it may give a skewed outcome with inaccurate results.
Long-tail distributions have greater variance - Because there is a massive difference in the frequency of items at the head and those at the far end of the tail. Also, the graph itself may significantly change when certain parameters are changed.
For example, the long-tail curve of blockbuster movies in the last five years and the last 20 years may lead to different findings.
Such data points can create certain issues and inaccuracies in machine learning models when their training datasets contain large amounts of long-tail data.
Overfitting is one such issue, where the model produces accurate desired results when processing the training dataset but fails to give correct outcomes with completely new datasets. Another problem is that if the head of the distribution is too large and dominant, the model may develop a bias towards items in the head.
To avoid such problems and errors when working with large-scale datasets, random sampling can be implemented. It's a method of preparing the training dataset by randomly selecting items from the overall long-tail distribution dataset. This randomness in data sampling removes possible biases and dominant patterns that can mislead the AI model.
Data science teams working on large-scale datasets must bear these considerations:
Data is always scattered, whether within an organization or throughout various public and private databases. This can create problems for data science teams to access and gather the required long-tail datasets.
To avoid this issue, organizations must establish efficient data-sharing practices, set up proper communication channels, and ensure that their data science teams have the required authorization to access all essential data sources.
If an organization's existing traditional stack isn't suitable for long-tail data, the data science team must urgently design a modern data stack. And when doing so, they must consider the properties and characteristics of long-tail distribution, such as:
Constant addition of new items and elements leads to continuous elongation of the tail end.
Long-tail data is dynamic, and the frequency of items always fluctuates.
The data is extremely diverse and unstructured.
So for efficient storage and analysis of long-tail individual items, the modern data stack must be:
Easily scalable when new elements need to be added to the distribution.
Capable of monitoring and processing frequency fluctuations in real-time.
Connected with various data integration tools to alter, restructure, and categorize the data.
These are just a few general considerations, and teams may have to work on many other factors based on the organization's individual requirements.
Long-tail data enables data science teams to gain insights that cannot be obtained from other data types. The collective impact of a large volume of low-frequency, low-amplitude items, and events improves the relevance of business decisions. So there are many crucial pieces of information that long-tail data can reveal.
How do millions of low-demand niche products affect the overall market?
Do small groups of niche customers tend to be more loyal than large groups of general buyers?
How would specific autonomous AI systems respond to novel situations and rare events?
Many more questions like these that provide greater insights into business decisions and machine learning can be answered only by analyzing long-tail data through ETL and data visualization tools.