Drops us a line

Oct 22, 2016

Solr Sharding

When dealing with one of our projects (LookSMI media monitoring platform) we have to handle the huge volume of data – and its quantity is constantly growing. At the same time, we must run quick searches with smart rules. To make this work, the whole index should be placed into RAM.

LookSMI Case: Using Shards for the News Portal

It is obvious that when millions of records are being added to the index regularly, RAM size would never be enough. Eventually, you will have to divide index into several parts in order to run search on several machines simultaneously.

LookSMI is dealing with the news, which means that the portal is heavily working with the new records while almost not using the older ones. LookSMI utilizes Solr as a full-text search engine. To ensure fast search within recent records, we decided to shard LookSMI's index.

Sharding is a type of database partitioning that separates very large databases the into smaller and faster, parts called data shards. The word shard itself means a 'small part of a whole'. Technically, sharding means horizontal partitioning but in practice, the term is often used to refer to any database partitioning that is meant to make a very large database more manageable and easier to search through.

Accordingly, we partitioned LookSMI's search index into shards where each shard corresponds with one month. A limited number of shards are set to be active concurrently – just a few last months. Thus, all data a user needs is housed in RAM.

Whenever there is a need, older shards can be activated to accomplish necessary request – and then deactivated. Moreover, when filtering the search by predefined date, we can engage just a part of active shards so that the search is performed only on those servers where the data for the specified period is located.

Steb-by-step Guide to Arranging Solr Sharding

There are several ways to arrange sharding in Solr. The easiest one is to divide index into a few cores. Hence, when carrying out a search query, Solr should be commanded to perform search throughout several cores simultaneously.

First, you should create cores with the identical structure. This can be easily done with the help of cores’ pattern.

For starters, copy configuration files into the configsets folder:

cp sorl/data//conf solr/data/configsets/conf

Then specify the folder’s name when creating the core:

mkdir solr/data/configsets/
cp solr/data/configsets/conf solr/data/configsets//conf

We have automated the process of creating the cores so that new cores are proactively set up every month:

http:///solr/admin/cores?action=CREATE&name=&instanceDir=path/to/instance&configSet=path/to/instance/`

When the cores are created, revise the code so that data would be added to a specific core while performing a concurrent search over the other cores:

http:///solr/select?q=*:*&shards=http:///solr/,http:///solr/`

Then, revise the code so that data would be recorded to a corresponding core:

http:///solr//update -H

Cores can be located on different servers. Furthermore, they can be arranged not only by time periods but also by categories.

Don’t be afraid to create multiple cores as resource overheads under these conditions are quite small. On the other hand, sharding makes database systems smoothly scalable and helps to deal with the problem of slower response times for growing indexes.

Subscribe for the news and updates

More thoughts

Dec 22, 2024Technology

Python and the Point Rush in DeFi

This article demonstrates how to use Python to automate yield calculations in decentralized finance (DeFi), focusing on the Renzo and Pendle platforms. It guides readers through estimating potential rewards based on factors like token prices, liquidity, and reward distribution rules, emphasizing the importance of regular data updates and informed decision-making in DeFi investments.

Mykyta Miazin

Dec 13, 2022Technology

How to create a timelapse video from frames

We’ll tell you how to create a video timelapse from a sequence of snapshots and provide customers with video playlists optimized for browser playback.

Alexey Demianenko

Aug 27, 2020Technology

5 tips for designing database architecture

Designing database architecture is a challenging task, and it gets even more difficult when your app keeps getting bigger. Here are several tips on how to manage your data structure in a more efficient way.

Yurii Mironov

Jul 27, 2017Technology

How to Deploy Django app with AWS Elastic Beanstalk?

In this article I'll show you how to deploy Django application to AWS Beanstalk.

Rostyslav Stekh

Mar 12, 2017Technology

Creating a chat with Django Channels

Nowadays, when every second large company has developed its own instant messenger, in the era of iMessages, Slack, Hipchat, Messager, Google Allo, Zulip and others, I will tell you how to keep up with the trend and write your own chat, using django-channels 0.17.3, django 1.10.x, python 3.5.x.

Mikhail Andreev

Oct 11, 2010Technology

Char search in Emacs as in Vim

In VIM there is a command for char search: f. After first use it can be repeated with ;. I like to navigate in line with it. You see that you need to go to bracket in a middle of a line - you press f( and one-two ; and you are there. There's no such command in Emacs, so I had to write my own. I've managed even to implement repetition with ;.

Volodymyr Sydorenko