Internet search engines: Yandex, Google, Rambler, Yahoo. Composition, functions, principle of operation. The simplest description of the principle of operation of the Yandex search engine Combinations: interest clubs

Hello dear friends! In this article we will continue to look at the Yandex search engine, and as you remember, in previous articles we discussed the history of the creation of this great company, which ranks first among its competitors in Russia and beyond.

All this is good, but beginners and experienced site builders are interested in the most important question, of course, related to how to bring their projects to the first places in the TOP search results.

Therefore, let's look at how the Yandex search engine works in order to understand what mistakes you can step on, and what to expect from a Russian search engine in general.

In the last article we discussed. The topic turned out to be quite interesting and useful. Therefore, I decided to supplement it, deepen it, so to speak.

So, I probably got a little carried away with the question “Why does a search engine index documents”? It’s a no brainer. All that remains is to figure out the “how” question.

Website ranking algorithms

First, let's get acquainted with some algorithms that are fundamental to any search engine:

— Direct search algorithm.

What is it - you remember reading a wonderful story in one of the books. And you start looking one by one. They took one book, looked through it, didn’t find it, took another... The principle is clear, but this method is extremely long. This is also understandable.

— Reverse search algorithm.

For this algorithm, a text file is created from each page of your blog. This file lists in alphabetical order ALL the words you used. Even the position of this word in the text is indicated (coordinates in the text).

This is a fairly fast method, but the search already occurs with some error.

The main thing to understand here is that this algorithm does not search on the Internet, not by searching on a blog. And in a separate text file that was created a long time ago. When the robot came to you. And these files (reverse indexes) are stored on Yandex servers.

So, these were the basic search algorithms. Those. how Yandex simply finds the necessary documents. There shouldn't seem to be any problems with this.

But Yandex knows more than one or even 100 documents, but according to the latest data from my sources, Yandex knows about 11 billion documents (10,727,736,489 pages).

And among all this quantity, you need to select documents that match the request. And more importantly, you need to somehow rank them. Those. arrange according to the degree of importance, or rather according to the degree of usefulness for the reader.

Mathematical search models

To solve this issue, mathematical models come to the rescue. Now we’ll talk about the simplest models.

Boolean mathematical model– If a word appears in a document, the document is considered found. Just a coincidence and nothing complicated.

But there are problems here. For example, if you, as a user, enter some popular word, or even better, the preposition “v”, which is the most common word in the Russian language and is found in EVERY document, then you will be given so many results that you don’t even realize such a number, How many documents did you find? Therefore, the following mat model appeared.

Vector mathematical model– this model determines the “weight” of the document. Not only does the coincidence occur, but the word must occur several times. Moreover, the more a word appears, the higher the relevance (compliance).

It is the vector model that ALL search engines use.

Probabilistic model– more complex. The principle is this: the search engine found the page template itself. For example, you are looking for information about the history of Yandex. Yandex stores some kind of standard, let’s say this will be my previous article about Yandex.

And he will compare all other documents with this article. And the logic here is this: the more similar your blog page is to my article, the MORE LIKELY is the fact that your blog page will also be useful to the reader and also tells about the history of Yandex.

To reduce the number of documents that need to be shown to the user, the concept of relevance was introduced, i.e. compliance.

How relevant is your blog page to the topic? This is an important topic when it comes to search quality.

Assessors - who are they and what are they responsible for?

This relevance is also needed to assess the quality of the algorithms.

For this purpose there is a special forces headquarters - they are called Assessors. These are special people who look through search results with their hands.

They have instructions on how to check sites, how to evaluate, etc. And they manually determine whether your pages are suitable for search queries or not.

And the quality of search algorithms depends on the opinion of assessors. If all the assessors say that the search results do not correspond to the requests, this means that the ranking algorithm is incorrect and Yandex is the only one to blame.

If the assessors say that only one site does not meet the request, it means that the site flies somewhere far away and is lowered in the search results. More precisely, not the entire site, but only one article, but this is “not the point.”

Of course, assessors cannot review and evaluate ALL articles with their hands and eyes. This is understandable.

And other parameters by which pages are ranked come to the rescue.

There are a lot of them, for example:

  • page weight (vIC, PageRank, baby bumps All in all);
  • domain authority;
  • relevance of the text to the request;
  • relevance of external link texts to the query;
  • as well as many other ranking factors.

Assessors make comments, and the people who are responsible for setting up the mathematical ranking model, in turn, edit the formula, as a result of which the search engine works more efficiently.

The main criteria for evaluating the performance of the formula:

1. Search engine results accuracy- percentage of documents that match the request (relevant). Those. The fewer pages that do not match the request, the better.

2. Completeness of search engine results- this is the ratio of relevant web pages for a given query to the total number of relevant documents in the collection (the totality of pages found in the search engine).

For example, if there are more relevant pages in the entire collection than in the search results, this means that the results are incomplete. This happened because some of the relevant web pages were filtered.

3. Relevance of search engine results- this is the compliance of the web page with what is written in the snippet. For example, a document may be very different or not exist at all, but still be present in the search results.

The relevance of the search results directly depends on how often the search robot scans documents from its collection.

Collection collection (indexing of site pages) is carried out by a special program - a search robot.

The search robot receives a list of addresses for indexing, copies them, and then sends the contents of the copied web pages for processing to an algorithm that converts them into reverse indexes.

Well, “in a nutshell,” so to speak, we discussed the principles of the search engine.

Let's summarize:

  1. A search robot comes to your blog.
  2. The search robot stores the reverse index of the page for subsequent searches.
  3. Using a mathematical model, the document is processed and displayed in search results using formulas and taking into account the opinion of the assessor.

This is very, very simplified. Just to get a basic understanding of how the Yandex search engine works.

I have now written so much text, and perhaps so much is not clear. Therefore, I suggest you return to this article a little later and watch this video.

This is an excellent guide, which I also learned from at one time.

I hope this information will help you better understand why one of your sites occupies appropriate positions in searches and do everything to improve them.

With this I say goodbye to you, if you have any questions, I’m always happy to answer them in the comments. Or maybe you want to add to the article?

In any case, express your opinion. !

Yandex, today, is the most popular search engine in Russia. Service statistics LiveInternet, shows the share of Yandex in the mass of the all-Russian audience - it is 53.4%, if we take into account only Moscow and the region, then it is even higher - 67.9% (Moscow, according to requests, occupies more than 50% of all Russia).

The website www.yandex.ru was created in 1997; only one server was enough for it, which stood under the desktop of one of the group of first Yandex developers, Dmitry, whose last name was Teiblyum. Very quickly after opening, we purchased a second server, and soon, when it was necessary to install another one, it became clear that there was enough space under the table for either three Yandex servers, or […]

Search engine developers strive to provide users with the best answers to their queries. Sometimes such an answer may be a number (for example, the weather in a city), a picture (for example, an address on a map), a translation of a word or a quatrain. When you have a suitable array of information at hand, the answer can be given immediately. Therefore, Yandex supplements Internet search results with answers from its […]

Approximately every tenth request to Yandex is “navigational”, that is, it consists of the name of an organization or website and the user wants to go to the website of this organization. In this case, the Yandex search bar is used instead of the browser address bar and the user, as a rule, is not interested in the remaining nine search results. Without distracting the user from the main goal, we added after the main […]

The main task of a search engine is to answer the user's question. When a user asks a query, the search engine does not access every site on the Internet, but searches through a database of pages known to it - the search index. There she finds all the pages with the words from the query. The user sees links to these pages on the search results pages.

As we see, Yandex does not stand still, and I am sure that the search technologies of this system will continue to develop in order to improve the quality of search, which can hardly be called ideal yet.

On November 10, 2009, Yandex announced a new version of the search algorithm - Snezhinsk. Fundamental changes have occurred in the algorithm for calculating relevance - Yandex representatives wrote the following: “We managed to create a more accurate and much more complex mathematical model, which led to a significant increase in search quality. Thanks to the redesign of the search ranking architecture, it was possible to implement the accounting of several thousand [...]

Testing of the new version of the Yandex algorithm began on July 9, 2008. According to Yandex, “the main changes in the program are related to a new approach to machine learning and, as a result, differences in the way ranking factors are taken into account in the formula.”

On April 14, 2008, the new search algorithm “Magadan” began to be tested at the address buki.yandex.ru. In addition to doubling the number of ranking factors, the following innovations were also added:

Before venturing into the algorithmic jungle, let's remember how a search engine works in general. The logical structure of a search system can be represented in the form of three modules (see diagram) Robot (crawler) is a special program that crawls Internet sites and downloads their content. The robot has a special schedule according to which it carries out its rounds. Website pages loaded by a robot, a special [...]

66. What has more influence: a link from a free platform (blogspot, LJ, etc.) or from an offline site/blog? Free platforms transfer less weight than standalone sites. However, the impact could be greater. This is due to many factors: the current anchor list, the state of the sites being compared, etc. It is impossible to give an unambiguous answer to this question. 67. The greatest weight is transferred between […]

Vamana Tour - travel, air tickets and visas around the world and to India, Nepal, Sri Lanka, the Maldives, Mauritius and many other places on the planet. Advice for travelers and pilgrims. How to get the most out of your trip. Amazing historical chronicles and stories of experienced travelers.

What is the purpose of taking into account external links to a site? As you can see from the previous section, almost all factors influencing ranking are under the control of the page author. Thus, it becomes impossible for a search engine to distinguish a truly high-quality document from a page created specifically for a given search phrase or even a page generated by a robot that does not contain useful information at all. […]

They have long become an integral part of the Russian Internet. Search engines are now huge and complex mechanisms that represent not only an information search tool, but also tempting areas for business.

Most search engine users have never thought (or thought about it, but did not find an answer) about the principle of operation of search engines, about the scheme for processing user requests, about what these systems consist of and how they function...

This master class is designed to answer the question of how search engines work. However, you will not find here factors that influence the ranking of documents. Moreover, you should not count on a detailed explanation of the Yandex algorithm. He, according to Ilya Segalovich, the director of technology and development of the Yandex search engine, can only be recognized “under torture” by Ilya Segalovich himself...

2. Concept and functions of a search engine

A search system is a software and hardware complex designed to search the Internet and respond to a user request, specified in the form of a text phrase (search query), by producing a list of links to sources of information, in order of relevance (in accordance with the request). The largest international search engines: "Google", Yahoo , MSN . On the Russian Internet these are Yandex, Rambler, Aport.

Let's take a closer look at the concept of a search query using the Yandex search engine as an example. The search query should be formulated by the user in accordance with what he wants to find, as briefly and simply as possible. Let's say we want to find information in Yandex on how to choose a car. To do this, open the Yandex main page and enter the text of the search query “how to choose a car.” Next, our task comes down to opening the links provided at our request to sources of information on the Internet. However, it is quite possible that we will not find the information we need. If this happens, then either you need to rephrase your request, or the search engine database really does not have any relevant information on our request (this can happen when asking very “narrow” queries, such as, for example, “how to choose a car in Arkhangelsk”)

The primary goal of any search engine is to deliver to people exactly the information they are looking for. And teach users to make “correct” requests to the system, i.e. queries that comply with the operating principles of search engines are impossible. Therefore, developers create algorithms and operating principles for search engines that would allow users to find the information they are looking for.

This means the search engine must “think” the same way the user thinks when searching for information. When a user makes a request to a search engine, he wants to find what he needs as quickly and easily as possible. Receiving the result, he evaluates the performance of the system, guided by several basic parameters. Did he find what he was looking for? If he didn’t find it, how many times did he have to rephrase the query to find what he was looking for? How much relevant information could he find? How quickly did the search engine process the query? How convenient were the search results presented? Was the result you were looking for the first or the hundredth? How much unnecessary garbage was found along with useful information? Will the necessary information be found when accessing a search engine, say, in a week, or in a month?

In order to satisfy all these questions with answers, search engine developers are constantly improving search algorithms and principles, adding new functions and capabilities, and trying in every possible way to speed up the operation of the system.

3. Main characteristics of a search engine

Let us describe the main characteristics of search engines:

  • Completeness

    Completeness is one of the main characteristics of a search system, which is the ratio of the number of documents found by request to the total number of documents on the Internet that satisfy the given request. For example, if there are 100 pages on the Internet containing the phrase “how to choose a car,” and only 60 of them were found for the corresponding query, then the completeness of the search will be 0.6. Obviously, the more complete the search, the less likely it is that the user will not find the document he needs, provided that it exists on the Internet at all.

  • Accuracy

    Accuracy is another main characteristic of a search engine, which is determined by the degree to which the found documents match the user's query. For example, if the query “how to choose a car” contains 100 documents, 50 of them contain the phrase “how to choose a car”, and the rest simply contain these words (“how to choose the right radio and install it in a car”), then the search accuracy is considered equal to 50/100 (=0.5). The more accurate the search, the faster the user will find the documents he needs, the less various kinds of “garbage” will be found among them, the less often the found documents will not correspond to the request.

  • Relevance

    Relevance is an equally important component of search, which is characterized by the time that passes from the moment documents are published on the Internet until they are entered into the search engine index database. For example, the day after interesting news appeared, a large number of users turned to search engines with relevant queries. Objectively, less than a day has passed since the publication of news information on this topic, but the main documents have already been indexed and available for search, thanks to the existence of the so-called “fast database” of large search engines, which is updated several times a day.

  • Search speed

    Search speed is closely related to its load resistance. For example, according to Rambler Internet Holding LLC, today, during business hours, the Rambler search engine receives about 60 requests per second. Such workload requires reducing the processing time of an individual request. Here the interests of the user and the search engine coincide: the visitor wants to get results as quickly as possible, and the search engine must process the request as quickly as possible, so as not to slow down the calculation of subsequent queries.

  • Visibility

4. Brief history of the development of search engines

In the initial period of Internet development, the number of its users was small, and the amount of available information was relatively small. For the most part, only research staff had access to the Internet. At this time, the task of searching for information on the Internet was not as urgent as it is now.

One of the first ways to organize access to network information resources was the creation of open directories of sites, links to resources in which were grouped according to topic. The first such project was the Yahoo.com website, which opened in the spring of 1994. After the number of sites in the catalog increased significantly, the ability to search for the necessary information in the catalog was added. In the full sense, it was not yet a search engine, since the search area was limited only to the resources present in the catalog, and not to all Internet resources.

Link directories were widely used in the past, but have almost completely lost their popularity at present. Since even modern catalogs, huge in volume, contain information only about a negligible part of the Internet. The largest directory of the DMOZ network (also called the Open Directory Project) contains information about 5 million resources, while the Google search engine database consists of more than 8 billion documents.

In 1995, search engines Lycos and AltaVista appeared. The latter has been a leader in the field of information search on the Internet for many years.

In 1997, Sergey Brin and Larry Page created the Google search engine as part of a research project at Stanford University. Google is currently the most popular search engine in the world!

In September 1997, the Yandex search engine, which is the most popular on the Russian-language Internet, was officially announced.

Currently, there are three main search engines (international) - Google, Yahoo and, which have their own databases and search algorithms. Most other search engines (of which there are a large number) use in one form or another the results of the three listed. For example, AOL search (search.aol.com) uses the Google database, while AltaVista, Lycos and AllTheWeb use the Yahoo database.

5. Composition and principles of operation of the search system

In Russia, the main search engine is Yandex, followed by Rambler.ru, Google.ru, Aport.ru, Mail.ru. Moreover, at the moment, Mail.ru uses the Yandex search engine and database.

Almost all major search engines have their own structure, different from others. However, it is possible to identify the main components common to all search engines. Differences in structure can only be in the form of implementation of the mechanisms of interaction of these components.

Indexing module

The indexing module consists of three auxiliary programs (robots):

Spider is a program designed to download web pages. The spider downloads the page and retrieves all internal links from that page. The html code of each page is downloaded. Robots use HTTP protocols to download pages. The spider works as follows. The robot sends the request “get/path/document” and some other HTTP request commands to the server. In response, the robot receives a text stream containing service information and the document itself.

  • Page URL
  • date the page was downloaded
  • Server response http header
  • page body (html code)

Crawler (“traveling” spider) is a program that automatically follows all the links found on the page. Selects all links present on the page. Its job is to determine where the spider should go next, based on links or based on a predetermined list of addresses. Crawler, following the links found, searches for new documents that are still unknown to the search engine.

Indexer (robot indexer) is a program that analyzes web pages downloaded by spiders. The indexer parses the page into its component parts and analyzes them using its own lexical and morphological algorithms. Various page elements are analyzed, such as text, headings, links, structural and style features, special service HTML tags, etc.

Thus, the indexing module allows you to crawl a given set of resources using links, download encountered pages, extract links to new pages from received documents, and perform a complete analysis of these documents.

Database

A database, or search engine index, is a data storage system, an information array in which specially converted parameters of all documents downloaded and processed by the indexing module are stored.

Search server

The search server is the most important element of the entire system, since the quality and speed of the search directly depend on the algorithms that underlie its functioning.

The search server works as follows:

  • The request received from the user is subjected to morphological analysis. The information environment of each document contained in the database is generated (which will subsequently be displayed in the form, that is, text information corresponding to the request on the search results page).
  • The received data is passed as input parameters to a special ranking module. Data is processed for all documents, as a result of which each document has its own rating that characterizes the relevance of the query entered by the user and the various components of this document stored in the search engine index.
  • Depending on the user’s choice, this rating can be adjusted by additional conditions (for example, the so-called “advanced search”).
  • Next, a snippet is generated, that is, for each document found, the title, a short abstract that best matches the query, and a link to the document itself are extracted from the document table, and the words found are highlighted.
  • The resulting search results are transmitted to the user in the form of a SERP (Search Engine Result Page) – a search results page.

As you can see, all these components are closely related to each other and work in interaction, forming a clear, rather complex mechanism for the operation of the search system, which requires huge amounts of resources.

6. Conclusion

Now let's summarize all of the above.

  • The primary goal of any search engine is to deliver to people exactly the information they are looking for.
  • Main characteristics of search engines:
    1. Completeness
    2. Accuracy
    3. Relevance
    4. Search speed
    5. Visibility
  • The first full-fledged search engine was the WebCrawler project, published in 1994.
  • The search system includes the following components:
    1. Indexing module
    2. Database
    3. Search server

We hope that our master class will allow you to become more familiar with the concept of a search engine and better understand the main functions, characteristics and operating principles of search engines.

1. Terms and definitions In this agreement on the processing of personal data (hereinafter referred to as the Agreement), the terms below have the following definitions: Operator - Individual Entrepreneur Oleg Aleksandrovich Dneprovsky. Acceptance of the Agreement - full and unconditional acceptance of all the terms of the Agreement by sending and processing personal data. Personal data - information entered by the User (subject of personal data) on the site and directly or indirectly related to this User. User - any individual or legal entity who has successfully completed the procedure of filling out the input fields on the site. Filling out input fields is the procedure for the User to send their first name, last name, phone number, personal email address (hereinafter referred to as Personal Data) to the database of registered users of the site, carried out for the purpose of identifying the User. As a result of filling out the input fields, personal data is sent to the Operator’s database. Filling out the input fields is voluntary. website - a website located on the Internet and consisting of one page. 2. General provisions 2.1. This Agreement is drawn up on the basis of the requirements of the Federal Law of July 27, 2006 No. 152-FZ “On Personal Data” and the provisions of Article 13.11 on “Violation of the legislation of the Russian Federation in the field of personal data” of the Code of Administrative Offenses of the Russian Federation and is valid for all personal data that the Operator can obtain about the User while using the Site. 2.2. Filling out the input fields by the User on the Site means the User’s unconditional agreement with all the terms of this Agreement (Acceptance of the Agreement). In case of disagreement with these conditions, the User does not fill out the input fields on the Site. 2.3. The User’s consent to the provision of personal data to the Operator and their processing by the Operator is valid until the termination of the Operator’s activities or until the User withdraws consent. By accepting this Agreement and going through the Registration procedure, as well as by subsequently accessing the Site, the User confirms that, acting of his own free will and in his own interest, he transfers his personal data for processing to the Operator and agrees to their processing. The User is notified that the processing of his personal data will be carried out by the Operator on the basis of the Federal Law of July 27, 2006 No. 152-FZ “On Personal Data”. 3. List of personal data and other information about the user to be transferred to the Operator 3. 1. When using the Operator’s Website, the User provides the following personal data: 3.1.1. Reliable personal information that the User provides about himself independently when Filling out input fields and/or in the process of using the Site services, including last name, first name, patronymic, telephone number (home or mobile), personal email address. 3.1.2. Data that is automatically transferred to the Site services during their use using software installed on the User’s device, including IP address, information from Cookies, information about the User’s browser (or other program through which the services are accessed). 3.2. The Operator does not verify the accuracy of the personal data provided by the User. In this case, the Operator assumes that the User provides reliable and sufficient personal information on the questions proposed in the Input Fields. 4. Purposes, rules for the collection and use of personal data 4.1. The Operator processes personal data that is necessary to provide services and provide services to the User. 4.2. The User's personal data is used by the Operator for the following purposes: 4.2.1. User identification; 4.2.2. Providing the User with personalized services (as well as informing about new promotions and services of the company by sending letters); 4.2.3. Maintaining contact with the User if necessary, including sending notifications, requests and information related to the use of services, provision of services, as well as processing requests and applications from the User; 4.3. During the processing of personal data, the following actions will be performed: collection, recording, systematization, accumulation, storage, clarification (updating, changing), extraction, use, blocking, deletion, destruction. 4.4. The user does not object that the information specified by him in certain cases may be provided to authorized state bodies of the Russian Federation in accordance with the current legislation of the Russian Federation. 4.5. The User's personal data is stored and processed by the Operator in the manner provided for in this Agreement for the entire period of activity by the Operator. 4.6. The processing of personal data is carried out by the Operator by maintaining databases, automated, mechanical, and manual methods. 4.7. The Site uses Cookies and other technologies to track the use of Site services. This data is necessary to optimize the technical operation of the Site and improve the quality of service provision. The Site automatically records information (including URL, IP address, browser type, language, date and time of request) about each visitor to the Site. The user has the right to refuse to provide personal data when visiting the Site or disable Cookies, but in this case, not all functions of the Site may work correctly. 4.8. The confidentiality conditions provided for in this Agreement apply to all information that the Operator can obtain about the User during the latter’s stay on the Site and use of the Site. 4.9. Information that is publicly disclosed during the execution of this Agreement, as well as information that can be obtained by the parties or third parties from sources to which any person has free access, is not confidential. 4.10. The Operator takes all necessary measures to protect the confidentiality of the User’s personal data from unauthorized access, modification, disclosure or destruction, including: ensuring constant internal verification of the processes of collecting, storing and processing data and ensuring security; ensures physical security of data, preventing unauthorized access to technical systems that ensure the operation of the Site, in which the Operator stores personal data; provides access to personal data only to those employees of the Operator or authorized persons who need this information to perform duties directly related to the provision of services to the User, as well as the operation, development and improvement of the Site. 4.11. The User's personal data remains confidential, except in cases where the User voluntarily provides information about himself for general access to an unlimited number of persons. 4.12. The transfer by the Operator of the User’s personal data is legal during the reorganization of the Operator and the transfer of rights to the Operator’s legal successor, while all obligations to comply with the terms of this Agreement in relation to the personal information received by him are transferred to the legal successor. 4.13. This Statement applies only to the Operator’s Website. The Company does not control and is not responsible for third party sites (services) that the user can access via links available on the Operator’s Website, including in search results. On such Sites (services), other personal information may be collected or requested from the user, and other actions may be performed 5. Rights of the user as a subject of personal data, change and deletion of personal data by the user 5.1. The user has the right: 5.1.2. Require the Operator to clarify his personal data, block it or destroy it if the personal data is incomplete, outdated, inaccurate, illegally obtained or not necessary for the stated purpose of processing, and also take measures provided by law to protect his rights. 5.1.3. Receive information regarding the processing of his personal data, including information containing: 5.1.3.1. confirmation of the fact of processing of personal data by the Operator; 5.1.3.2. the purposes and methods of processing personal data used by the operator; 5.1.3.3. name and location of the Operator; 5.1.3.4. processed personal data related to the relevant subject of personal data, the source of their receipt, unless a different procedure for the presentation of such data is provided for by federal law; 5.1.3.5. terms of processing of personal data, including periods of their storage; 5.1.3.6. other information provided for by the current legislation of the Russian Federation. 5.2. Withdrawal of consent to the processing of personal data can be carried out by the User by sending the Operator an appropriate written (printed on a tangible medium and signed by the User) notification. 6. Responsibilities of the Operator. Access to personal data 6.1. The Operator undertakes to ensure the prevention of unauthorized and non-targeted access to personal data of Users of the Operator's Website. In this case, authorized and targeted access to the personal data of Site Users will be considered access to them by all interested parties, implemented within the framework of the objectives and subject of the Operator’s Site. At the same time, the Operator is not responsible for possible misuse of Users’ personal data that occurs as a result of: technical problems in the software and in hardware and networks beyond the control of the Operator; in connection with the intentional or unintentional use of the Operator’s Websites other than for their intended purpose by third parties; 6.2 The Operator takes necessary and sufficient organizational and technical measures to protect the user’s personal information from unauthorized or accidental access, destruction, modification, blocking, copying, distribution, as well as from other unlawful actions of third parties with it. 7. Changes to the Privacy Policy. Applicable legislation 7.1. The Operator has the right to make changes to these Regulations without any special notice to Users. When changes are made to the current edition, the date of the last update is indicated. The new edition of the Regulations comes into force from the moment of its publication, unless otherwise provided by the new edition of the Regulations. 7.2. The law of the Russian Federation shall apply to this Regulation and the relationship between the User and the Operator arising in connection with the application of the Regulation. I accept I do not accept

We are not as unique as we think: millions of people before us puzzled and millions after us will puzzle the search engine with almost identical questions. On the other hand, we are too unpredictable: the formulation of our request is influenced by a huge number of factors that we are not aware of. And at least for this reason, the request of each of us, no matter how banal it may be, requires an individual approach.

In fact, the entire work of the Yandex search engine comes down to two simple things: to understand what a person really wants to know, and in a few seconds to find suitable ones among billions of documents on the Internet.

Take fingerprints

The search engine's operating system is somewhat similar to the Matrix, and the search robot (the complex, independently decision-making program it created) is similar to Agent Smith.

In order not to search the entire Internet every time someone needs to know something, the search engine does part of the work in advance - it checks what is on the Web and where it is, using thousands of search robots. They come in two types: basic and fast. The main one crawls and processes the Internet as a whole, and the fast one - documents that appeared a minute or even a couple of seconds ago. The task of robot programs is to select suitable and useful information for users, process it, weeding out everything outdated and unnecessary. In some ways, this is reminiscent of sorting garbage: paper in one container, glass in another, plastic in a third, food waste in a fourth...

The information collected by robots forms the so-called Internet cast. It is stored on thousands of Yandex servers and is constantly updated. A nugget is like a list that tells you where to find what information. In this list, each keyword has not one, but millions of “pages”. To ensure that all nugget updates are available to users, they are moved from the repository to the “base search”. Data from the main robot is transferred every few days, and from the fast robot - in real time.

Bring to clean water



ILLUSTRATION: EVGENY TONKONOGY

While searching for the answer to a given question in a prepared database, the machine faces two main difficulties. The first difficulty is language. Before looking for an answer to a question, it is important for the machine to understand in what language it should do so. For example, for a Russian-speaking person, the search for “Prince Igor’s squad” will find documents with information about the army, and for a Ukrainian, the “Prince Igor’s squad” will also return documents mentioning Princess Olga, his wife, since in Ukrainian “wife” is "squad". And in the rich Russian language, the same word or its derivatives can mean different things. For example, the word “steel” is one of the forms of the noun “steel” and the verb “become”. The second difficulty is human psychology. When entering a request, we expect a quick and accurate answer, without naturally worrying about whether the wording of the request corresponds to the principles of mathematical analysis by which the machine’s brain works. For example, by entering the word “Napoleon” into the search bar, what does a person want to get: a cake recipe or a biography of the French emperor, buy cognac or find the address of a psychiatric hospital?


In such situations, several technologies come into play. You can give you several hints under the search bar that will specify your request. Like, choose what you need: Napoleon recipes or Napoleon - Bonaparte. If the user does not respond to the machine’s request and does not add words to the “Napoleon”, then the “Spectrum” technology helps the matter: without hoping for help, the machine immediately searches for information in several categories (about the cake, and about the emperor, and about the horse). ..). In addition, personalization mechanisms help to understand the user - the machine’s knowledge of what this user was looking for on his computer a day, two, three, or months ago: if you often asked Yandex questions about cooking, then the machine will first show you results that say that Napoleon is a cake.

Combinations: interest clubs

The task of a search engine is not simply to select documents that contain words and phrases from the search query. The machine must understand which documents meet our conflicting requirements and why they meet them. Do we want to get information about Napoleon the cake, or maybe we visited a fitness club with a pretentious name for a couple of years, or are even completely concerned about the complexes of short people. In any case, solving the problem requires a non-trivial approach.


The creators of the Yandex search program found this approach by delegating the right of choice to the machine. On the one hand, a soulless, but very fast and smart machine does not know and does not want to know anything about us as individuals, and on the other hand, it tries to find out as much as possible about everyone.

In addition to the geographic location of the user and linguistic analysis of his queries, the search engine uses several thousand criteria that are not at all obvious to humans.

The trick is that the machine develops and updates these criteria independently.

It simply uses data on the preferences and user behavior of millions of people and relates this “arithmetic average” to the history of our queries. The principles that guide the Matrix within itself, comparing the thousands of categories of user interests it has developed, often do not fit into traditional human ideas about what “interests” can be in principle. There are tens of thousands of them. They create different, sometimes funny, combinations with each other. For example, one of these combinations could be that search results match the interests of a person who breeds newts. At the same time, a person is not just interested in newts, but is already breeding them, but only for the first year.

Ratings. Helping hands


The matrix, of course, decides itself (with the help of higher mathematics) what and in what sequence needs to be shown to users based on tens of thousands of criteria. But the Matrix also uses living people - 1000 Yandex employees, the so-called assessors, evaluate search results for a particular request (of course, not every request is evaluated, and this is not done in real time) to determine whether they meet the expectations of an ordinary user : not as rational as a machine, not as precise in formulation, contradictory and emotional.