Ingesting Terabytes of Data into Hbase at Scale
At Meesho, they offer a wide variety of in-app advertisement programs to there sellers to promote their products within the app and help in accelerating their business. These advertisement programs are a significant revenue contributor for Meesho and fuels our company’s journey.
At the core of all advertisement programs is ad-server, a microservice that powers the Ads that are displayed on the Meesho app. To maintain exceptional user experience and optimize Meesho’s revenue it's very important to display personalized Ads to our users. To achieve this, we analyze vast amounts of clickstream data from millions of users and generate personalized Ad product recommendations for each user basis their browsing history.
Ingesting Terabytes of Data into Hbase at Scale
Introduction
At Meesho, they offer a wide variety of in-app advertisement programs to there sellers to promote their products within the app and help in accelerating their business. These advertisement programs are a significant revenue contributor for Meesho and fuels our company’s journey.
At the core of all advertisement programs is ad-server, a microservice that powers the Ads that are displayed on the Meesho app. To maintain exceptional user experience and optimize Meesho’s revenue it's very important to display personalized Ads to our users. To achieve this, we analyze vast amounts of clickstream data from millions of users and generate personalized Ad product recommendations for each user basis their browsing history.
Problem Statement
Meesho’s ad-server has to power millions of Ad requests every minute coming in from the users browsing our app. Parallely, it also needs to consume TB’s of recommendation data coming in every few hours to personalize the Ads for better experience of our users. We needed to find a cost-effective and scalable solution that will let us do this seamlessly.
Some of the key challenges we faced include:
Seamless Ad serving: There first and foremost challenge was to ensure that serving of Ads does not get disrupted while the ad-server consumes the new set of recommendations
Scalability: As a fast growing company, the traffic on ad-server and the data size of recommendations would increase continuously with the increasing users and new personalization features that would come in with time. We needed to find a solution that could scale horizontally to 3x increase in Ad serving traffic and 5x increase in data size
Cost Efficiency: At Meesho, they aim to be the cheapest shopping destination for our users. To achieve this it’s really important to have a cost effective tech stack. We need to find a solution that would consume minimal resources to keep the cost in control.
Solution
In there exploration for solution, they came across Bulkload technique in Hbase, an efficient way to load large amounts of data. Hbase is a highly scalable, distributed non-relational database. At Meesho, we have been using Hbase in multiple use cases for high throughput random reads accompanied by large data volumes. Using Hbase with the bulkload technique seemed to be a very good fit for our problem.
Following section explains the bulkload technique in more detail.
Conventional Write Mechanism in HBase

Figure 1: Hbase write path
In the conventional write path of Hbase, data is first written to a Write-Ahead Log (WAL) for durability. Simultaneously, it's stored in an in-memory Memstore, which is flushed periodically to create immutable HFiles on disk. In the background, compactions are run regularly which merge multiple HFiles into a single HFile to optimize for read latency.
Bulkload Approach in HBase
.png)
Figure 2: Bulkload approach in Hbase
Complete Pipeline

Complete Pipeline

Figure 3: sever data flow
Above diagram describes the end-to-end data flow of the ad-server. The pipeline starts with a Ad recommendations job pushing the recommendations to S3. A Lamda function gets invoked upon the arrival of a data dump, which triggers the Spark job. The Spark job uses MapReduce utility provided by Hbase to generate the HFiles and saves it onto the HDFS of Spark Core nodes. These HFiles are then copied onto the HDFS of Hbase cluster using the bulkload utility provided by Hbase
Performance Analysis
For the remainder of this blog, we will delve into comparing the performance of ingesting a ~300GB data dump into HBase using the conventional write mechanism vs Bulkload.
Infra Setup

Test Scenarios
We created a new Hbase cluster in AWS and imported a test table containing 300GB of data. We simulated read traffic by loading up Hbase with bulk get commands at a consistent throughput of 1000 RPS. Each bulk get command fetches 3 random rows from the test table. We then ingested 336 GB data into Hbase via both the approaches that we discussed above and compared the efficiency and impact on Hbase performance of these approaches.
Below graphs show the read latency and Hbase CPU when then are no writes happening and the Hbase is just loaded with read traffic
.png)
Figure 4: Latency of Hbase Bulk Get command at 1000 RPS without any writes
.png)
Figure 5: CPU consumption of Hbase region servers with only read traffic
Approach 1 : Bulkload using Spark
While keeping the read traffic consistent at 1000 RPS, we initiated a Spark job which ingests a 336GB dataset into Hbase via Bulkload. The table is pre-created with 100 regions so that there wouldn’t be any region splitting in between the process. Below graphs show the read latency and Hbase CPU during the ingestion.
.png)
Figure 6: Latency of Hbase Bulk Get commands during data ingestion via Bulkload approach
.png)
Figure 7: CPU consumption of Hbase region servers during the bulkload approach
The Spark job's overall execution took 30 minutes: 17 minutes for HFile generation and 13 minutes for copying of HFiles from Spark to Hbase. The read latency of the HBase Bulk Get commands spiked from 5ms to 15ms during the HFile copying phase. During the same time Hbase CPU has also increased to 30%.
Time taken for HFile generation can be further reduced by increasing the number of Spark nodes. HFile copying can be sped up by increasing the number of threads used for copying in bulkload utility. Please note that increasing these threads will result in more impact on Hbase resources which can be mitigated by scaling-up Hbase region servers.
.png)
Figure 8: Snapshot from Hbase dashboard showing the details of ingested table. Observe that each region contains single storefile
Approach 2 : Ingest with script using conventional Hbase put commands
In this approach, we ingested data using Hbase Bulk Put commands with each command inserting 50 records at a time. To speed up we ran the ingestion parllely from 3 different instances. Overall ingestion took around 3.5 hours. During the ingestion read latency spiked from 5ms to 250ms and HBase CPU reached 54%.
.png)
Figure 9: Latency of Hbase Get commands during data ingestion via HBase put commands
.png)
Figure 10: CPU consumption of Hbase region servers during ingestion via Hbase put commands
From the below snapshot, we can observe that around 50 storefiles are created in each region. Due to high number of storefiles, reads on this table would incur high latency. Hence a major compaction needs to be triggered on the table to optimize the latency. Due to high resource utilization, major compactions degrade ongoing queries on the Hbase and are usually run in off-peak times. This will be a concern for our ad-server usecase as we would be getting multiple recommendation datasets within a day and we need to consume them even during the traffic times
.png)
Figure 11: Snapshot from Hbase dashboard showing the details of ingested table
Performance Comparison

Cost Comparison

Conclusion
From the above analysis we can clearly see that Bulkload is a much faster, lighter and cheaper approach to ingest large datasets into Hbase compared to conventional put commands. Data ingestion time with bulkload has reduced by 85% against the time taken with conventional put commands. The cost of ingestion is also cheaper by 50% and the approach had minimal latency impact on the other ongoing queries in the Hbase.
- Bulkload is a faster, cheaper, and more efficient approach for ingesting large datasets into Hbase.
- Bulkload significantly reduced ingestion time and cost compared to conventional put commands.
- The approach had minimal latency impact on ongoing queries in Hbase.
Review
Kalpesh Shewale
I am grateful to have completed my Full Stack Development with AI course at Apnaguru. The faculty's support and interactive classes helped me discover my potential and shape a positive future. Their guidance led to my successful placement, and I highly recommend this institute.
Kalpesh Shewale
I am grateful to have completed the Full Stack Development with AI course at Apnaguru. The faculty's dedicated support and hands-on approach during the classes enabled me to unlock my potential and shape a promising future. Their guidance helped me secure a placement with a good package. I highly recommend this course, and for those interested, I also suggest doing the offline version at the center for an enhanced learning experience.

Raveesh Rajput
Completing the Full Stack Development with AI course at Apnaguru was a game-changer for me. I secured an internship through this course, which gave me invaluable hands-on experience. I strongly recommend this course to anyone looking to break into the tech industry. For the best experience, I suggest attending the offline sessions at the center, where the interactive learning environment really enhances the overall experience.

swapnil shinde
Apnaguru’s Full Stack Development with AI course provided me with more than just knowledge—it opened doors to an internship that gave me real-world, hands-on experience. If you're serious about a career in tech, this course is a must. I highly recommend attending the offline sessions for the most immersive and interactive learning experience!
Kalpana Waghmare
I recently completed the Full Stack Developer with AI course on ApnaGuru, and I couldn’t be more impressed! The structure of the course, with well-organized topics and self-assessment MCQs after each section, really helped reinforce my learning. The assignments were particularly valuable, allowing me to apply what I learned in a practical way. Overall, it’s an excellent program that effectively combines full-stack development and AI concepts. Highly recommended for anyone looking to enhance their skills!
Completing the Full Stack Development with AI course at Apnaguru was a pivotal moment in my career. It not only deepened my understanding of cutting-edge technologies but also directly led to an internship that provided practical, real-world experience. If you're aiming to enter the tech field, this course is an excellent stepping stone. I especially recommend attending the in-person sessions at the center, where the dynamic, hands-on learning approach truly maximizes the benefits of the program.

Mahesh Bhosle
I completed the Full Stack Development course at Apnaguru, and it was a valuable experience. The focus on live assignments and projects gave me real-world insights, helping me apply my skills in a professional setting. The interactive live sessions, mock interviews, and question banks were excellent for job preparation. Apnaguru’s company-like environment also helped me get accustomed to real work dynamics. Overall, this course equipped me with the skills and confidence needed for a career in full-stack development. I highly recommend it to anyone seeking hands-on learning and industry relevance.
I recently completed the Full Stack course at ApnaGuru, and I’m genuinely impressed! The curriculum is well-structured, covering both front-end and back-end technologies comprehensively. The instructors are knowledgeable and provide hands-on experience through practical projects. The supportive community and resources available made learning enjoyable and engaging. Overall, it’s a great choice for anyone looking to kickstart a career in web development. Highly recommend!

Adarsh Ovhal
I recently participated in the Full Stack Development With AI Course program, and it has been incredibly beneficial. The guidance I received was tailored to my individual needs, thanks to their advanced use of AI tools. The Trainers were knowledgeable and supportive, helping me explore various educational and career paths. The resources and workshops provided were practical and insightful, making my decision-making process much clearer. Overall, I highly recommend this program to any student looking for IT Field and personalized career guidance!
Shirish Panchal
I’m currently pursuing the Full Stack Developer with AI course at ApnaGuru Training Center, and I'm impressed with what I've experienced so far. The curriculum is well-structured, covering key concepts in both front-end and back-end development, along with AI fundamentals. The instructors are knowledgeable and supportive, which makes it easy to engage and ask questions. I particularly appreciate the hands-on projects that help reinforce what I’m learning. While I’m still in the process of completing the course, I feel that I'm building a strong foundation for my future in tech. I would recommend ApnaGuru to anyone looking to explore full stack development with AI!
Apnaguru Training Center stands out as a top-notch institute for IT education. They provide a wide array of courses, including Full Stack Development, Java Full Stack, Python, Automation Testing, DevOps, and MERN/MEAN Stack, all designed to meet the demands of the modern tech industry.

Mahesh Bhosle
Apnaguru Training Center is a fantastic place for IT education! They offer a variety of courses, including Full Stack Development, Java Full Stack, and Python, all taught by knowledgeable instructors who are committed to student success. The curriculum is up-to-date and includes hands-on projects that enhance learning.
dandewar srikanth
I had an excellent experience with the full-stack web development program at APNAGURU. The instructor had in-depth knowledge of both frontend and backend technologies, which made the concepts easy to grasp. From working on HTML, CSS, JavaScript, and React for the frontend to Node.js and MongoDB for the backend, the learning curve was very smooth.