Inside Flipkart’s monster-cruncher: how it gleans insights from a petabyte of data daily
Gender is complicated for the data scientists at Flipkart, India’s leading ecommerce site. It’s not enough, for example, to know that the shopper is female. What if she’s shopping for her husband today? So, apart from the label she chose while signing up for the account, there is also a behavioral gender.
Behavorial gender is gender based on your behavior, whether you tend to shop like a female or a male when using your account. This behavior can also vary from session to session, so there’s a third in-session gender.
“We have to compute the label gender when we send you survey results so that we can address you properly and all. But when we have to show you what you want, we will piggyback on the behavioral gender and [adjust that] as soon as we know a little bit about this session [and] what your mood is today,” says Sandeep Kohli, Flipkart’s senior director of engineering.
All that work is just for one user. To see the scale at which Flipkart is doing this, consider the following numbers:
And gender is just one parameter for customer insights. All kinds of behavioral, demographic, and usage data go into making product recommendations or trying to ensure a customer doesn’t drop out without completing an intended purchase.
During the recent Big Billion Day sales, for example, pageviews in a day rose five-fold to nearly a billion, says Kohli. “You have to accommodate that [spike] without degrading the experience because people don’t have too much patience on such days. So every second of delay costs us a few hundred customers who could drop off at that point.”
Survival of the smartest
To manage that kind of data and do advanced analytics and personalization, Flipkart has built its own data center – the only Indian internet company to have done so. And that takes a lot of hardware. “We have 5 petabytes in RAM, 120 petabytes of disk storage, and [a] tremendous amount of cross-sectional bandwidth, because anything can become a bottleneck at that kind of scale,” says Kohli. “It is extremely important to know about your customers and serve them better, because at this point in the highly competitive ecommerce industry, it’s a survival game.”
If you record HD video 24/7 for 3.4 years, you would reach a petabyte.
Every action on the site involves analytics. Take for example a simple search query: how far did the user have to scroll down? If the search results are relevant, the user should be able to find the product they’re looking for without needing to scroll too far down. Now imagine getting the search right for 100 million users.
Efficiency also plays a vital part in customer experience. Where a product order comes from can help with inventory planning, delivery time, and so on. Then there are insights that are important for running the business, such as figuring out if product ratings are genuine or whether somebody is gaming the system for sales offers.
Flipkart gets 10 terabytes of user data each day from browsing, searching, buying or not buying, as well as behavior and location. This jumps to 50 terabytes on Big Billion Day sales days. There’s also order data, shipping data, and other forms of data captured by different systems. All this is mixed together and correlated for meaningful insights. “We actually process more than a petabyte of data every day in order to make sense of what is happening at our scale,” says Kohli.
A petabyte, by the way, is one thousand terabytes. And a terabyte is a million megabytes. If you record HD video 24/7 for 3.4 years, you would reach a petabyte.
Following the breadcrumbs
Flipkart has over 60 machine learning models running on any given day to generate insights for its sales and business teams. These insights are served on over 6,000 real-time terminals that help business leaders make decisions.
How a sale is going, which deals are working or not working, at which point users are dropping off, what the real-time funnel is – the next time you go shopping online, think of all the footprints you’re leaving for data scientists to figure you out.
According to Kohli, the popular perception of a data scientist’s job is that it’s jazzy and romantic. “I think it’s not,” he says. “Most times, the data scientists in an organization are an angry lot, because either they’re not able to find the right data or the data that has been thrown [at] them is not of great quality, and they’re unable to achieve what they want.” Kohli is an Indian Institute of Science post-grad who worked earlier at IBM.
Typically, 80 percent of a data scientist’s job goes to cleaning up the data and other mundane stuff rather than modelling or analytics, adds Kohli. So his focus is on making the data scientist’s job more efficient. The aim is to ensure that the quality of data going into decision-making is up to the mark.
“When we decided to build our own data platform to serve our AI and ML and analytics needs, the first thing we decided was that it will be a tight data system; it will not be some kind of a big data system where you can dump in any data. That is like having a big hard disk with no file system and throwing everything in the root directory,” explains Kohli.
Doing it at scale
Another aspect is the sheer scale. Kohli likes to separate the data science part of what his team does from the engineering requirements. “Scaling is a pure engineering job and we do not want data scientists to be spending their time trying to figure out load balancers, fault tolerance, and other things. So we built a machine learning platform to actually let data scientists automatically deploy the model. The platform takes care of scaling these models.”
This underlying layer is what enables Flipkart to ingest a petabyte of data and digest it. It’s designed to ensure that the right data is captured and the analytics on it works at scale, even when 13 million users land on the Flipkart site daily. Apart from its own private cloud, Flipkart has also teamed up with Microsoft for its AI-powered Azure public cloud.
“Combining Microsoft’s cloud platform and AI capabilities with Flipkart’s existing services and data assets will enable Flipkart to deliver new customer experiences,” Microsoft CEO Satya Nadella said earlier this year when the deal was announced.
If everything is well in the system, what the user gets are relevant search results, product recommendations, and even ads. Checkouts are easier, inventory is managed better, delivery is more efficient, and marketing is more targeted. It all comes from how the data is handled.
As online shoppers in India become more experienced, their expectations also grow. They have less patience with not finding what they need, being stuck, or not getting delivery when and where they want. And they’re spoilt for choice between Flipkart, Amazon, and Alibaba-backed Paytm.
Flipkart raised US$4 billion this year from SoftBank, Tencent, and Microsoft. Amazon is on track to surpass the US$5 billion Jeff Bezos pledged to its Indian unit last year. Discount wars are back after a lull in 2016. But if customer experience is the ultimate decider, it may boil down to those petabytes in the war of data going forward into 2018.
This post Inside Flipkart’s monster-cruncher: how it gleans insights from a petabyte of data daily appeared first on Tech in Asia.