Organization of data search on the internet. Organization of information search on the Internet

MINISTRY OF BRANCH OF RUSSIA

State educational institution of higher professional education

"RUSSIAN

STATE HUMANITARIAN UNIVERSITY "

Branch of the Russian State Humanitarian University in St. Petersburg.

Saint Petersburg 2011

Introduction 3

1. The Internet as a modern source of information 4

2. Specificity of information in the student's educational activities 6

3. Features of using the Internet in the search for information for the student's educational activities 8

Conclusion 13

List of sources and literature 14

Introduction

Today, a student cannot do without a PC. Communication with a computer begins at school, where students master the basics of computer technology, get acquainted with educational Internet websites. As a rule, when entering a university, many applicants are already well acquainted with a computer, and most have it at home.

To make the learning process easier for themselves, students often resort to using the Internet, downloading abstracts and essays. For the time being, such an attitude to classes can get away with. However, studying at a university presupposes a more serious approach and requires mastering various specific sciences. In this sense, the Internet is no longer a reliable source of information, and in a sense it is completely harmful.

The modern Internet has many social and cultural facets; it is a universal information medium. In this regard, the issue of the Internet as a source of information in the educational activities of a student is relevant.

The tasks of the work are:

    Describe the Internet as a modern source of information.

    To reveal the specifics of information in the student's educational activities.

    Consider the peculiarities of using the Internet in the search for information for the student's educational activities.

1. The Internet as a modern source of information

According to wikipedia.org: The Internet (pronounced [internet]; English Internet) is a worldwide system of interconnected computer networks, built on the use of the IP protocol and routing of data packets. The Internet forms a global information space, serves as the physical basis for the World Wide Web and many other systems (protocols) for data transmission. Often referred to as “ Worldwide network"And" Global network ". In everyday life, they sometimes say "Internet" 1.

Nowadays, when the word "Internet" is used in everyday life, most often it means the World Wide Web and the information available in it, and not the physical network itself.

Today, the Internet is becoming one of the main sources of information due to the gigantic amount of data located on the network and the ability to easily access it. At the same time, the search in the network is gaining more and more practical value, since with the rapid increase in the amount of available data, the procedure for finding the necessary information becomes more and more complicated 2.

The network contains a huge amount of information resources. According to some estimates, the number of documents has exceeded 65 million and continues to grow rapidly 3. Such a volume of information requires the correct organization of the search process and the use of special technological tools, such as search engines. A simple search for a keyword usually yields from tens of thousands to several million links, it is obvious that working with such a large number of documents is practically impossible, that is, it contains information that is irrelevant.

In addition to the problem of search, there is the problem of the reliability of information on the Internet. The ease of access and publication of data makes it possible to easily disseminate erroneous and often deliberately false information 4.

These two problems: search and reliability determine the specifics of the Internet as a source of information.

2. Specificity of information in the student's educational activities

According to the site wikipedia.org: The term information comes from the Latin word information, which means "information, clarification, presentation" 5.

Currently, science is trying to find general properties and patterns inherent in the concept of "information", but so far this concept remains largely intuitive and receives various semantic content in various branches of human activity.

In everyday life, information is any data or information that is of interest to someone, for example, a message about any events, about someone's activity, etc. To inform in this sense means to communicate something previously unknown.

Information - information about objects and phenomena of the environment, their parameters, properties and condition, which reduce the degree of uncertainty about them, incompleteness of knowledge 6.

One and the same informational message (newspaper article, announcement, letter, telegram, reference, story, drawing, radio broadcast, etc.) may contain a different amount of information for different people depending on their accumulated knowledge, on the level of understanding of this message and interest in it 7.

Based on the foregoing, we can conclude that information in a student's educational activity should have a number of specific features.

1. Information should correspond to the student's degree of preparedness, his level of knowledge. Too high a level of difficulty lowers the digestibility and lowers the motivation of students. Too low level - reduces information content and negatively affects the effectiveness of the learning process.

2. The information used by the student must be up-to-date, i.e. correspond to the modern level of scientific knowledge and the development of society.

3. The information used by the student must be accurate.

4. Information should be available in terms of cataloging and searching.

3. Features of using the Internet in the search for information for the student's educational activities

The modern student, armed with a personal computer, is well aware of what and where on the Internet. He quite masterly gets on the Internet everything that he needs to create the next obligatory creation: an essay, an essay, a course project, a diploma, etc. And after a little revision, which often consists only in indicating his name and group number, having printed it out on a printer, he hands over “his work” to the teacher 8.

At the same time, his laziness increases many times over, and this approach reduces the likelihood of success in a future career. It should be noted that the practice of cheating, which is essentially plagiarism, is much more common in Russia than in the West, which reduces the chances of getting a prestigious job in competition with graduates of Western universities.

To achieve success in the competition, one should learn how to process colossal amounts of information, be able to view samples of written work, noting strengths and weaknesses in them, try to "dissect" someone else's text in order to isolate the most essential part from it. Based on the resulting skeleton, the student should learn how to create the required work. Essentially, this work was done in libraries with books before the Internet boom 9. The work of the teacher is also important here, who should competently guide the student, not prohibiting the use of the Internet, but pointing out possible pitfalls and giving instructions on how to use it. For example, in order to narrow the search circle, the teacher can advise certain information resources, thus ensuring the adaptation of the teaching material to the preparation of the student, in addition, the teacher will help filter out false and incorrect information.

In the modern information society, the role of the teacher is increasing. For example, teachers of the "old school" can read the same lectures for years, not at all interested in the latest achievements in this industry, field of activity. Moreover, a student with any mobile device connected to the Internet can confuse any teacher. The teacher is no longer perceived as the only source of knowledge. At any time, a student with the Internet can correct the teacher, and criticize and put before an insoluble question. The teacher must be ready for this, this is the challenge of modern society to the modern education system. The teacher should not get angry, avoid answering, or compose an answer on the fly. If earlier the teacher-student relationship was built on the principle of senior-junior, now they should be closer to the principle of the Internet: peer-to-peer.

There is one more danger that keeps the mobility of the Internet in itself, namely the lack of the need to memorize anything. What for? If you can always ask Yandex. In order not to fall into this trap, the student must complete all the tasks of the traitor, not be lazy, write down, remember, teach. It is the luggage of knowledge in human memory that forms his general erudition and the ability to solve applied problems in a given subject area. The extreme degree of this mobility effect lies in the fact that a student, upon encountering an unfamiliar term, says to himself: “I can look at the meaning of this term on the Internet at any time. Now I do not have time, I will look later ”- this is how gaps in education come. Before the Internet era, a student would have thought differently: “I can look up the meaning of this term in a dictionary (textbook, encyclopedia, ...). Now I do not have time, but I will have to look at the meaning of this term and remember it, because I cannot go with a dictionary all the time. "

From the point of view of self-education, the student, and indeed the whole society, faces the problem of an information crisis 10. The information crisis lies in the contradictory unity of the “information hunger” and the “information explosion”, that is, in the lack of information in the conditions of its overproduction 11. The amount of information on a certain area of ​​human activity exceeds the capacity of the human brain 12. Therefore, the need for systematization of information and for filtering information noise increases. The student should use the trusted sources recommended by the teacher, indicated in the textbook bibliography.

The growth rate of information is measurable. Librarian R. Barton and physicist R. Kebler from the USA introduced the concept of "half-life" scientific articles by analogy with the half-life of radioactive substances. The half-life of a publication is the time during which half of all currently used literature on a given industry or subject has been published 13. For example, if the half-life of a publication in physics is 4.6 years, then this means that 50% of all currently used (cited) publications in this field are not more than 4.6 years old. Although such a definition gives a numerical estimate of the aging of information, such an assessment must be treated with caution, and in the final light, each specialist himself determines the depth and degree of prescription that he needs in each specific case 14. For the student, the degree of relevance of the information will help to determine the supervisor.

Another feature of information on the Internet is its scattering over an ensemble of sources - Bradford's law 15. Simplistically, this can be formulated as follows: 1/3 of scientific articles on a specific topic will be published in a small number of sources directly related to this topic. The next third will be published in more sources related to this topic. And the last third will be published in sources that have nothing to do with the topic, and the ratio of the number of sources in these zones according to Bradford is equal. Considering this pattern, it should be noted that the achievement of complete information content on a specific topic is impossible if the researcher is limited to a range of sources on this issue, without resorting to the help of special information, service and bibliographic services. In most cases, the first third will be enough for the student, however, for deeper work, such as term papers in specialized disciplines, thesis, the student needs to seek help from this kind of electronic catalogs.

Despite the fact that the freedom of access of Internet users to information resources is not limited by state borders, but linguistic boundaries are preserved. The predominant language of the Internet is English. The second most popular language is Chinese and the third most popular is Spanish. The Russian language ranks 9th place 16. In this regard, a student who speaks foreign languages, primarily English, gains access to much more information. If we talk about the division of information on the Internet, then it is worth noting that information on various areas of human activity is not evenly presented in terms of volume. There is more technical information on the Internet related to programming, information technology, computer design and less information related to the humanities. This can be explained by the fact that technical specialists are in one way or another connected with information technology and the Internet by the nature of their work, and therefore the number of materials they publish is higher.

Conclusion

Summing up the considered aspects of the Internet as a source of information in the student's educational activities, we can highlight the main key features and recommendations.

    The student must be able to use the Internet and at the same time constantly improve their skills in using the Internet.

    The student should, relying on information on the Internet, check its degree of reliability and relevance.

    To search for information on a given topic, it is advisable to use specialized electronic bibliographic catalogs.

    To increase the efficiency of using the Internet, the student should improve English as the most common language on the Internet.

    In responding to the challenges of the information society, the student must be able to process large amounts of data, extracting key information from them and filtering out redundant and unnecessary data.

    The Internet provides not only great opportunities for obtaining information, but also fraught with danger in the form of a cheat sheet, which often does a disservice in the learning process.

It should be noted that according to points 2 and 3, the student must work in direct contact with his supervisor.

List of sources and literature

Literature

    Blumenau, V. I. Information and information service. / D. I. Blumenau. - L .: Nauka, 1989. - 192 p.

    Galeeva, IS Internet as a tool for bibliographic search / IS Galeeva; scientific. ed. M.I. Vershinin. - SPb .: Professiya, 2007 .-- 248 p.

    Efimov A.N. Information explosion: real and imaginary problems / A. N. Efimov. - M .: Nauka, 1985 .-- 160 p.

    Information search on the Internet: textbook. allowance / V. I. Averchenkov, V. V. Miroshnikov, S. M. Roshchin and others,; Bryan. state tech. un-t. - Bryansk, 2001. - 28 p.

    Kuzin, F.A. A handbook for graduate students and applicants for scholar. Degrees / F.A.Kuzin. -. - M .: Os-89, 1999 .-- 208 p.

    Kuznetsov I. N. Internet in educational and scientific work: A Practical Guide. - 2nd ed. - M .: Publishing and Trade Corporation "Dashkov and Co", 2005. - 192 p.

    Kuznetsov I. N. Textbook on information and analytical work. M .: Yauza, 2001 .-- 320 p.

    Mikhailov, O. A. New in Internet search according to the sources of 2000 / O. A. Mikhailov; Grew up. state arch. scientific and technical documentation. - M .: Max Press, 2001 .-- 171 p.

    Parshukova GB Methods of searching for professional information: study guide. Manual / G.B. Parshukova.- SPb .: Professiya, 2009. - 224 p.

    Solomenchuk V.G. Internet: Short course. SPb .: Peter, 2001 - 322 p.

Internet resources

    Url: Internet

    URL: http://ru.wikipedia.org/wiki/Information

1 URL: http://ru.wikipedia.org/wiki/Internet

Objective: studying the principles of organizing search in Internet networks and the acquisition of practical skills in writing search queries.

2.1 Simple Search TechniquesWeb-pages

Simple search techniques do not imply the use of the powerful search capabilities of the Internet and are based on knowledge of the principles of forming symbolic domain names and intuition.

Search for commercialWeb-sites. To get the address you are looking for, you can add a domain to the name of a firm, enterprise, organization, or a simple English noun (keyword). com, precede www. Web pages with a top-level domain in the address. com., most often contain information in English.

Example 1. Let's take the name of the company SONY, add a domain. com and www. - get the address of the SONY web page: www. sony. com... Similarly, you can get:

www. cnn. com- CNN World News;

www. mtv. com- MTV music news;

www.- COSMOPOLITAN magazine.

If you enter a keyword in address bar Internet Explorer and press Ctrl+ Enter, then the browser will try to navigate to the exact URL, automatically adding the protocol name and Web tag, such as http: // www. and top level domain. com. For example, if you type me in the address bar and press Ctrl-Enter, then the browser Internet Explorer will try to open a Web site with the address http:// www. me. com... If the node does not open, then it does not exist.

Search by regions. For the Russian and other regions, the above method remains valid. V in this case the region top-level domain (two-letter country code) is appended to the keyword to give the address of the Web page. For example, to search for Russian servers, you can try adding a domain to a keyword. ru.

Example 2. Known to have a server www. audi. com... You can try to find its branch in Russia by replacing the domain. com to the domain. ru, - www. audi. ru.

Search for large educational institutions. The domain is appended to the name or abbreviation of the institution. edu (mainly for the American and European regions), which usually gives the correct address.

Example 3. Let's take OXFORD University, add a domain. edu, and ahead of www. - get the address of the OXFORD University Web page: www. oxford. edu... Often the domain is missing from the address of a school website. edu. The registered second-level domain (or domain alias) can be the abbreviated English name of the educational institution. To search for a Russian educational institution, you can take it English abbreviation, for example MSU (Moscow State University), add a domain. ru - www. msu. ru- Moscow State University named after.

Often the URL of a Web page contains Domain name the Internet service provider on whose computer this web page is installed, for example, www. kgtu. runnet. ru- the address of the Krasnoyarsk State Technical University, where ***** is the domain name of the Internet service provider.

Many countries have a registered second-level domain for educational institutions. For example, for the UK, this is the AC (Academic) domain. Any Web page can have several alias addresses, when accessing which the user gets to the same Web page. For example, for OXFORD University these are addresses www. ox. ac. uk and www. oxford. edu.

Other searchWeb-pages. You can manipulate keywords and top-level domains to find government (.gov), military (.mil), and other organizations (.org). For example, the address of the White House of the US government is: www. whitehouse. gov.

2.2 Web search enginesInternet

V The Internet has powerful means of searching for any information: documents, images, programs, Web pages, etc. Search is carried out in the so-called search engines, which are also called search programs, search engines, search engines. There are many search engines on the Internet. The most famous information retrieval systems are shown in Table 2.1. A list of links to various search engines is available on the web page www. monk. newmail. ru.

Table 2.1 - Most Popular Search Engines

Search engine name

The address

Yandex (Russian)

http: // www. *****

Rambler (Russian)

http: // www *****

Aport (Russian-speaking)

http: // w w w. *****

Yahoo! (english speaking)

AltaVista (English)

Google (Russian)

http: // www. *****

Search system implemented as a web page with a regular address, which contains the so-called search string and the button Search (Search), and may also contain thematic resource catalog, links to popular pages, etc.

To call the search engine, you must enter its address in the address bar of the Internet browser. After loading the search engine in the search bar, you must enter inquiry (query), which is a string of text (in any language) - the key phrase of the documents you are looking for on the Internet and click the button Search. For more effective search it is necessary that the query contains words or a phrase that will be on the searched Web page or in the searched document (they need to be "guessed"). After a while, the screen will display list of addressesWeb-pages, containing links to the documents you are looking for, which, as a rule, are accompanied by comments. By clicking the link, you can go to any of the found documents.

To go to the next page of the list of found documents, you must click the corresponding number (1, 2, 3, ...) in the main window with the search result. Usually documents from the top ten found match the query as closely as possible.

The basis of any search engine is a special program - network robot or spider (Spider), sometimes you can find names worm (worm),crawler (creeper). The search engine sends out such "spiders" on the Internet that maximum amount(if possible) provided on the Internet Web pages, and then register their address (URL) and content in their database. After the user enters a query and clicks the button Searchsearch system scans the database and displays the search result.

In addition, almost all search engines allow you to register a user page on the Internet. To do this, on the page of a major search engine, for example, such as YAHOO !, you need to call the registration mode and enter the URL and description of your page. Further, the search engine will distribute your registration information to all other major search sites, those, in turn, to others, etc. There are also global registration servers.

Search directories are available, for example, on the search engines Rambler, Yahoo !, AltaVista, etc. To search the catalog, you need to select topics with the mouse, deepening and narrowing the search range until the list of displayed links is reduced to several pages that can browse manually, or to a sufficiently large group in which you can carry out a normal search (for example, in the Yapeeh search engine: Study Higher education Moscow State University).

2.3 Rules for executing queries in search engines

When executing queries, there are certain rules that may partly differ in different search engines, but the basic actions are similar. The rules for executing queries can always be found on the web page of a specific search engine in the section Help(this section may be called Help, How to search, Search tips, Query rules etc.). Request rules usually include the use of query language for advanced search.

The simplest rule for all search engines is to specify any phrase and click Search.

In the next section, we will consider some of the rules for executing queries using the Yandex system as an example. Many of these rules apply to other search engines as well. Examples of queries are taken from the help pages of the Yandex search engine.

2.4 Examples of simple queries in the Yandex search engine

Typically, a query is just one or more keywords, for example: company microprocessorsIntel. For such a request, there are documents in which all the words of the request are found. Some words in the request are ignored (conjunctions, prepositions, etc.), since they have no semantic meaning. For example, upon request apples in the snow all documents will be found that contain two words at the same time: "apple" and "snow" (however, the order of their display in the list will be different). Where words are located within the document, in what grammatical form they are located - it does not matter. Pretext on the ignored. Therefore, the above query can be written like this: snow on the apple. The search result will be the same.

An important and very useful property of search engines: no matter in what grammatical form you write a word in a query, it is in documents in all its forms. For example, upon request man walked among others, documents containing the text "people are coming" will be found. Recognition of all forms works for ordinary Russian words. For exotic words, neologisms, etc., it is not carried out.

Yandex operators, their purpose and examples of use can be found in the help section of the system.

You can use the advanced search capabilities on the Advanced Search page to visually create complex queries.

1. Check out the theoretical information.

2. Compose the website address of a world famous company (Intel, IBM, Sony, etc.) and open it in Internet Explorer. Save the found Web pages in a separate folder.

3. Using the same technique, go to the St. Petersburg state university and in the same way, open the website of the Department of Applied Mathematics of the same university. Save the found Web pages in a separate folder.

4. In each search engine (table 2.1), perform several queries concerning the problems you are interested in, and open the found documents.

5. Try searching through thematic directories.

6.Using the advanced search on Yandex, compare the popularity of the following sites by the number of pages linking to them: President and Government Russian Federation; Moscow State University and St. Petersburg State University; Hermitage and Louvre. Save the found Web pages in a separate folder. Create text file, where record the number of links to each of them.

7. Find information about when and where you were born. List his works. Find pictures of him in different years of his life. Save all information in a separate folder.

8. Search for information on the Internet on the selected topic of the course work. Based on the search results, create in text editor Word table according to the sample (table 2.3) and fill it out.

Table 2.3 - A sample of the search results report

and / and

Characteristics of search results

URL of the found resource

Short description resource

9. Invite a teacher for a progress report.

10. Delete the files saved during work from the working folder.

2.6 Test questions

1. Describe the simple techniques for finding information on the Internet.

2. What are the principles of the Internet search engines?

3. Formulate the basic rules for composing search queries.

4. Which of the search engines you have considered have the ability to use the query language?

5. Which of the search engines you have considered have a thematic catalog of resources?

6. Which of the search engines you have considered have search capabilities in various categories of information resources?

Whoever owns the information owns the world. For an assistant manager, the desire to have certain knowledge is not dictated by vanity or ideas to conquer the world, but most often by professional necessity. The ability to obtain useful data is, undoubtedly, one of the key in the work of a personal assistant, since not always true or necessary information lies on the surface.

What is confidentiality of information?

According to clause 7 of Art. 2 of the Federal Law of 27.07.2006 No. 149-FZ "On information, information technology and on the protection of information "(as amended on November 24, 2014) confidentiality of information - a mandatory requirement for a person who has gained access to certain information not to transfer such information to third parties without the consent of its owner... Another concept of confidentiality is the inaccessibility of information to a certain circle of users. One way or another, this is information that is transmitted only according to certain rules established by its rightholder, be it a legal entity (organization) or an individual (an ordinary citizen who has become interesting to us for some reason). So, on the one hand, confidentiality provides protection to the copyright holder of information, and on the other hand, it creates obstacles for those who are interested in accessing it.

Information puzzle

One of the special properties of information is that it "lives", i.e. transmitted in one way or another, using different methods and tools. For this reason, even confidential information, which is often impossible to obtain by filing an official request, appears in the public domain due to the carelessness of its copyright holder or careless attitude to its data. Today, in view of active use modern technical devices, as well as the Internet, most often pieces of the mosaic from the general information image of its copyright holder are chaotically scattered in the Internet space. In order to hide data, you need to do it purposefully, in addition, you need to have certain skills and abilities. And besides, would it ever occur to someone that someone would think of conducting an information investigation about his person or organization?

In a word, in order to obtain the necessary information, a manager's assistant only needs to have access to the Internet, make the necessary inquiries, collect data and make full use of his analytical skills.

Spy motives

Lack of information is the main motive for replenishing the information stock. It is known that actions taken in conditions of a lack of information can lead to unpleasant consequences. Goal-setting in "information investigation" plays an important role, on the one hand, in determining the expected result, and on the other, in the choice of sources for finding the necessary data. In professional activity, an assistant manager can receive various instructions from his superiors regarding the search for any kind of information. Their list is individual, and, probably, its boundaries are boundless. However, it is possible to identify the main situations for the resolution of which it will be useful for the assistant manager to resort to the collection of additional information.

  • Interview. Changing jobs and looking for a new one require responsibility and careful analysis of the data received about the employer. It happens that according to the results of one or several stages of the interview, there is not enough data to make an informed decision "for" or "against". Or due to the fact that the employer did not provide the necessary materials due to lack of time or simply not attaching importance to them, or because of the desire to deliberately hide them. In any case, company representatives are unlikely to be ready to answer "delicate" private questions honestly during an interview, for example, questions about salary delays, staff turnover or related common problems In the organisation.

If the initial data is available, the manager's assistant is recommended to find as much useful information about the company as possible before the interview: on the one hand, to hedge and ask the necessary questions, on the other, to be able to show professionalism and show off awareness and preparedness for the meeting.

  • Professional tasks. The activities of modern organizations are inevitably associated with cooperation between each other. Each firm has partners, customers, contractors, etc. So, for example, before concluding an agreement with a certain company for the supply of products or the provision of services, a lawyer requests for verification the necessary package of documents containing a minimum or maximum complete list - depending on the requirements of your organization to counterparties. A lawyer does not always check companies, in some cases it is done by a manager's assistant on behalf of his superiors. Therefore, finding information about a new company or its management may be part of the duties of a personal assistant.
  • Personal and professional contacts. The assistant manager communicates with a large number of people on a daily basis (colleagues, contractors, new acquaintances at work or in private life). There are cases when, for some reason, it is necessary to collect additional information about a person, for example, when hiring a new employee: who he worked with before, what is his hobby, are there any flaws in the professional biography, etc. Relatively personal contacts additional knowledge will also not be superfluous, since in most cases people tend to hide personal information about themselves (at best - due to simple human suspicion, at worst - when there really is something to hide).

In large organizations, the so-called. security Service. She is engaged in a professional search for all information about individuals or organizations, if required by ensuring the business, economic, industrial safety of the company. As a rule, the specialists of this service have their own resources at their disposal for making inquiries and collecting data. If your organization has a security service, it is recommended that you contact its specialists to obtain the necessary information from reliable sources.

Initial data

When conducting an "information investigation", no details are "small" or superfluous. In addition, in conditions of a lack of information, any information "hook" is a necessary clue to find useful data step by step. The "hooks" for finding the necessary materials in the Internet space are correctly formulated queries, as well as any initial data that the assistant manager currently possesses. At first glance, the most "modest" news will be enough to start your search.

Request for the name of the organization:

  • will give information about the name of the company's website;
  • will allow you to get contact information;
  • will provide search results based on data from news and advertising resources;
  • will give information about the field of activity, registration data, location, etc.

Request for full name the head of an organization or a private person:

  • will help to get information about the name of the company and the field of activity;
  • will allow you to get acquainted with the information of advertising, business, news resources;
  • will provide search results for resumes, biographies, reference materials;
  • will provide information about the "presence" in business and entertainment social networks, etc.

Request by company phone number or number mobile phone:

  • will allow you to get information about the company if it is an office phone number;
  • will give information about the belonging of a mobile phone number to a certain region of Russia;
  • will provide search data for advertising sites, advertisements, posted vacancies and offers of an organization or individual, etc.

On a note. The specified initial data can be considered basic for further information collection. Search results for the specified queries should be used as data for subsequent queries. For example, if initially only the phone number of the organization was known, then based on the results of such a request, you can get data on the name of the organization, and then information with the data of managers and founders.

Let's consider an example of information retrieval and its use.

In the search engine, enter the initially known phone number. We get the following result (Fig. 1):

Further, by the name of the organization, we type the following request and get several sites with reference information about the organizations. In this case, we get acquainted with the results on the site rusprofile. ru(fig. 2).

When can this information be useful?

  • The manager's assistant was tasked with contacting the editorial board of the magazine to place advertisements;
  • the assistant manager is instructed to prepare an official letter addressed to general director however, the name of the CEO was not initially known;
  • the manager received a call with the definition of the phone number and name of the contact person and the assistant was asked to clarify which company they were calling from.

Search engines, as a rule, provide many results with links to various resources and sites for queries by name, phone number and company name. It is recommended that the manager's assistant carefully read the reference materials and carefully filter out useful data from "spam", also paying special attention to the source of the information: in this case, the official website of the organization will turn out to be more reliable than, for example, an advertising reference web resource.

Thus, having a minimum of initial data and skills in working with Internet search engines, the assistant manager can find the information required at the present time or obtain additional data for further inquiries and the continuation of the "information investigation".

"Elementary Watson!"

In the Internet space, each of the users, one way or another, leaves their own "traces", and the information ever posted by them has "tails". So, when using queries in the line google search, Yandex or other search engines, the assistant can get acquainted with the information:

  • about the advertisements posted by the user on job search sites or employees, private advertisements for the sale or purchase, on the services provided or required;
  • about the published news of the organization or about officials, about participation in any business events, exhibitions, other activities of the company;
  • about the created CVs and biographies, if we are talking about a private person;
  • presence in social networks and groups;
  • about reviews of the company's products or reviews of the company as an employer, and many others. dr.

Even if the information about the search for a job or employees, the posted ads and news are not relevant, they are not always deleted by the copyright holders - out of forgetfulness or if unnecessary.

What and where are we looking for? How do we use it?

Depending on which information task it is up to the personal assistant to decide (whether it is collecting the most complete information or only checking certain data about an organization or an individual), others can be chosen effective methods Internet search. To use them, you also need initial data (information about the name of the company, or contact details, or the full name of the manager or private person will be enough) (see table).

Type of information and examples of its use

Type of information

Examples of using

Full name of the company and its activities

The full name, organizational and legal form and officially registered types of activities performed are basic information about the company. It will be useful both when looking for a new job to collect data about the employer, and in the work of a personal assistant to check the activities of a partner or counterparty.

Often unscrupulous employees of organizations present their companies as "large" or even "international", but in fact it turns out that these are ordinary individual entrepreneurs, the list of activities of which does not include those jobs for which your company planned to attract them

Registration date in government bodies

The date of registration of an enterprise with state authorities is important if many years of experience of a counterparty or partner in a particular field is important for your organization.

For example, if a company told you that it has been on the market for more than 10 years, but in fact was registered several months ago, this may raise doubts about the solidity and reliability of the company.

Information about the duration of the operation of the enterprise will be useful both in the professional work of an assistant manager, and in the event that it is necessary to collect the most complete data about the new employer.

Information about the founders and leaders of the organization

Information about the management and founders of the company may include information about the number of founders, their full name, shares of participation, the number of managers and their full name. The usefulness of this data is that it can serve as a basis for subsequent information retrieval. So, for example, knowing the founders and managers of the company, it is necessary to make a further request for the full name and surname. and the participation of these individuals in the activities of other organizations. Thus, one can understand how “large” the founders are, and if their shares of participation are large, it is possible that they are also investors in several companies. Based on the search results, you can go further and familiarize yourself with the activities of the newly found companies - to create the clearest picture of the business activity of its participants.

In addition, the additionally obtained data on founders and directors can be analyzed from different points of view. For example, if the names are the same on the list of persons, it is likely that the company is a family one. If the names of foreign persons are encountered, it is possible that the company has connections with foreign partners or parent organizations.

Company addresses and telephones

The need for contact information is difficult to overestimate. They fulfill their main role: they provide an opportunity to contact the organization or its individual employees. However, sometimes it makes sense to make additional inquiries to the organization's address and phone numbers.

It happens that several legal entities may be located at the same company address. These are often both subsidiaries and third parties. In addition, registration at the same address is often resorted to by unscrupulous organizations that do not have the funds to rent a full-fledged office. Then, under certain conditions, a legal address is bought, but in fact the company is not located at the specified address.

At the request of the assistant manager, the company's phone number may appear in various "responses" of the search engine. It is necessary to carefully read the results in order to understand if they contain information, for example, about the entry phone number to blacklists of employers, etc.

The address of the official website of the enterprise

The address of the company's official website is a very informative resource:

  • if the organization does not have a website, then the company probably does not have the funds to create it, or it was created relatively recently;
  • the amount of information provided by the company matters: the presence or absence of data on managers, employees, news, information about partners or customers, etc.;
  • the date of creation of the site and its interface indicate when the site was created and how professional it was, technically and in terms of design;
  • presence or absence of contact details (see above);
  • the website domain address contains additional information, with the help of which there is the possibility of additional verification

The size of the authorized capital of the company

The size of the charter capital of the organization is of great importance. There is a minimum amount of the authorized capital, and many companies are limited to it when registering. However, if a company supplies your company with products or renders services for an amount tens or hundreds of times higher than its authorized capital, it is worth remembering that in the event of unfair work or delivery, your organization will receive only what is in the company's fund.

Debts to tax authorities

The data on the absence of debts is only a plus and speaks of the company as a conscientious taxpayer. However, the presence of debts to the tax authorities, as well as the size of these debts, must be taken into account.

When applying for a job, it will be useful to know if the company has debts to the Pension Fund.

When a company cooperates as a counterparty or partner, debts can become an indicator of its unfair business approach or unprofitable position.

Participation in litigation

Information about participation in legal proceedings is important, however, it is necessary to become familiar with their content. One thing is non-payment of fines for incorrect parking of an official car, another is labor disputes or other serious cases. It is also important to consider whether the company acted as a plaintiff or a defendant.

Disqualification of an official

It happens that officials of organizations, whether they are managers or other officials, are disqualified in court in accordance with the law. This information will be useful for analyzing new partner companies or when applying for a job. This is especially important for the manager's assistant, whose work is directly related to the professional activities of the bosses.

The address Email contact person

By the email address of an organization or its employee, one can judge its "solidity". As a rule, in modern organizations it is customary to use corporate email addresses hosted on their domains (domain address after the @ sign), in order to information security and an indicator of a particular corporate culture. If the company's address is on a public server, for example mail.ru or yandex.ru, then it makes sense to additionally check it by entering a query at the address in the search bar. Based on the search results, you can analyze in which articles the address was used, in which ads it was indicated and whether it was included in the blacklists of users

Cell phone number

By the mobile phone number, you can find out whether it belongs to a particular company - whether it appears in the search results in ads on behalf of legal entity... In addition, it can be recorded on the sites of private ads - it will be useful to familiarize yourself with their content. It is necessary to pay attention to whether it is included in any blacklists of users and whether there are any comments to it.

In addition, by the number of the mobile phone, you can find out its belonging to certain telephone operator and the region

Personal data

In order to characterize the personality of a future leader, new colleagues, or those with whom the assistant manager intersects at the current place of work, personal data is often useful. Marital status, photos, hobbies, communication style, circle of friends, interests - all this is often available on social networks. It is not recommended to waste time looking for such information out of idle interest, but for additional characteristics human social networks can provide a personal assistant with various data: both positively and negatively characterizing a person

Employee reviews, employer lists

Feedback from former employees of organizations, as well as those who participated in interviews conducted by the organization, are important if the assistant manager plans to carry out his professional activities in the organization. It is worth considering that there are always “offended” and “dissatisfied”, it is not recommended to draw conclusions based only on the reviews and comments of individuals. However, it is necessary to take them into account and compare with other data about the company.

In addition, there are official resources who periodically publish lists of both the best and unscrupulous employers

There are many resources on the Internet that offer information about your organization to users. At the same time, there are those on which information is presented in the public domain, as well as commercial sites. Paid web pages often offer to provide you for money with the data that you find on another page for free. Do not rush to pay for the first information that comes across, carefully study the available resources, incl. trusted sources - official sites of various services before resorting to a paid request.

For reference. As a rule, the data on companies published on reference resources are obtained from open sources (USRLE and Rosstat) and are not subject to the Federal Law of July 27, 2006 No. 152-FZ "On Personal Data" (as amended on July 21, 2014) according to Art. 6 of the Federal Law of 08.08.2001 No. 129-FZ "On state registration legal entities and individual entrepreneurs "(as amended on 03/30/2015; as amended on 05/18/2015).

  • www.egrul.nalog.ru . The official website of the Federal Tax Service will provide the assistant manager with free information published in accordance with the law and not confidential. This resource allows you to get data on several items specified in the table at once.

The required initial data for the search: the name of the legal entity or individual entrepreneur, OGRN or TIN (Fig. 3).

In this case, we carry out a search by the name of the legal entity. It is not necessary to enter the region where you are located, but if you have this information, it is recommended to enter it in order to get the most complete search results. After entering the data, press the button Find and we get the result (Fig. 4).

A search on the website of the Federal Tax Service shows results in pdf format. The documents are available for download and contain full information on the registration of the company: data on the founders, managers, on the date of registration, registered types of activities, address data, etc. (Fig. 5).

The resource of the tax service is also convenient in that it provides the ability to search through other databases, which can be used for free by clicking on the desired link and entering the initial data required for the request. The manager's assistant can easily find information about disqualified persons, legal entities that have tax arrears, and other useful data.

It is worth noting that you may need to search for Additional Information, for example, to search for debts - the taxpayer's TIN (Fig. 6). If initially the TIN was not known, it can be viewed in the information on the registration of legal entities obtained from the search results by the name of the company.

Signs of fly-by-night companies

Retrieving
from the Public Criteria for Self-Assessment of Risks for Taxpayers, used by tax authorities in the process of selecting objects for conducting field tax audits,
approved by order of the Federal Tax Service of Russia dated May 30, 2007 No. MM-3-06 / [email protected]
"On approval of the Concept of the planning system for field tax audits"

(as revised on 10.05.2012)

[...] When assessing tax risks that may be associated with the nature of relationships with some counterparties, the taxpayer is advised to investigate the following indicators:

Lack of personal contacts between the management (authorized officials) of the supplier company and the management (authorized officials) of the buying company when discussing the terms of delivery, as well as when signing contracts;

Lack of documentary evidence of the powers of the head of the counterparty company, copies of his identity document;

Lack of documentary evidence of the powers of the counterparty's representative, copies of his identity document;

Lack of information about the actual location of the counterparty, as well as about the location of warehouse and / or production and / or retail space;

Lack of information on the method of obtaining information about the counterparty (there is no advertising in the media, there are no recommendations from partners or other persons, there is no counterparty's website, etc.). In this case, the negativity of this sign is aggravated by the presence available information(for example, in the media, outdoor advertising, Internet sites, etc.) about other market participants (including manufacturers) of identical (similar) goods (works, services), including those offering their goods (works, services) at lower prices;

Lack of information on the state registration of the counterparty in the Unified State Register of Legal Entities ( general access, the official website of the Federal Tax Service of Russia www.nalog.ru).

The presence of such signs indicates a high degree of risk of qualification of such a counterparty by the tax authorities as problematic (or “one-day”), and transactions with such a counterparty are doubtful.

These risks are additionally increased by the simultaneous presence of the following circumstances:

A counterparty with the above characteristics acts as an intermediary;

The presence in contracts of conditions that differ from the existing rules (customs) of business turnover (for example, long payment delays, delivery of large consignments of goods without prepayment or a guarantee of payment, incomparable with the consequences of violation by the parties of contracts with penalties, settlements through third parties, settlements in promissory notes, etc.) P.);

Lack of obvious evidence (for example, copies of documents confirming that the counterparty has production capacities, the necessary licenses, qualified personnel, property, etc.) the possibility of real fulfillment by the counterparty of the terms of the contract, as well as the existence of reasonable doubts about the possibility of the counterparty actually fulfilling the terms of the contract, taking into account the time required for the delivery or production of goods, performance of work or provision of services;

Acquisition through intermediaries of goods, the production and procurement of which is traditionally carried out by individuals who are not entrepreneurs (agricultural products, secondary raw materials (including scrap metal), industrial products, etc.);

Lack of real action by the payer (or his counterparty) to collect the debt. An increase in the payer's (or his counterparty's) debt against the background of the continuation of the delivery of large consignments of goods or significant volumes of work (services) to the debtor;

Issue, purchase / sale of bills by counterparties, the liquidity of which is not obvious or not investigated, as well as the issuance / receipt of loans without collateral. At the same time, the negativity of this feature is aggravated by the absence of conditions for interest on debt obligations of any kind, as well as the maturity of these debt obligations is more than three years;

A significant proportion of the costs of a transaction with “problem” counterparties in the total cost of the taxpayer, while there is no economic justification for the feasibility of such a transaction, while there is no positive economic effect from its implementation, etc.

How to check the counterparty company for "reality"?

  1. Use electronic services on the website of the Federal Tax Service of Russia(http://www.nalog.ru/):
  • « Information about the persons in respect of whom the fact of the impossibility of participation (leadership) in the organization is established (confirmed) in court "(https://service.nalog.ru/svl.do). According to the OGRN or TIN of the organization, you can find out whether the person who, according to the Unified State Register of Legal Entities, is the head or founder of the organization, has not declared that he has nothing to do with it;
  • « Information published in the journal "State Registration Bulletin" on the decisions taken by the registering authorities on the upcoming exclusion of inactive legal entities from the Unified State Register of Legal Entities (http://www.vestnik-gosreg.ru/publ/fz83/). Such a decision can be made by the tax authorities if the company has not submitted tax reports during the year and has not carried out transactions in at least one bank account. The exclusion of a company from the Unified State Register of Legal Entities is equivalent to its liquidation, which means that it cannot conclude and execute contracts.

Our advice: print or save on your computer web pages (screenshots) with company information. This will help in the future to prove that you carried out the verification.

  1. Request certified copies of the following documents:
  • the charter of the organization;
  • certificates of state registration of the organization;
  • certificates of registration of the organization with the tax authority at the place of its location;
  • decisions on the election (appointment) of the head of the organization;
  • passports of the head of the organization (p. 2, 3);

By the way: The validity of the passport can be checked by its series and number, using the service "Checking the list of invalid Russian passports" on the website of the FMS of Russia ( http://services.fms.gov.ru/info-service.htm?sid=2000).

  • licenses, if a transaction with an organization is concluded within the framework of a licensed activity. In addition, information about the licenses issued to the company can be checked on the websites of the licensing authorities;
  • accounting statements for the year preceding the year of the transaction. The accounting data of the organization for any period can also be obtained free of charge from Rosstat (provided that the company submits accounting records to the statistics authorities). To do this, you need to send a request in an approved form to any territorial agency of Rosstat.

The results of the check can be issued in the form of a certificate and presented to the manager.

Video instructions for checking a counterparty - on the websitehttp://egrul.nalog.ru/.

  • www. fssprus. ru . Official site Federal Service bailiffs of Russia provides users with the opportunity to familiarize themselves with the bank of enforcement proceedings and carry out a search using a simple form (http://fssprus.ru/iss/ip/) (Fig. 7).

The database contains information on legal entities and individuals. In order to carry out a search, it is necessary to enter the data of an individual, a legal entity, respectively, or in a separate tab of the search form - the number of enforcement proceedings, if known (Fig. 8).

Note! Unlike the website of the Federal Tax Service, the introduction of data on territorial bodies on the FSSP website is mandatory.

If a company or an individual has any debts and enforcement proceedings have been initiated against them, then in the search results the assistant manager will receive the following data: the full name of the company and address of location, number and date of initiation of enforcement proceedings, details of the executive document, as well as the amount outstanding debt. In the example, some of the table data has been deleted, but the columns are retained for clarity of displaying the search results.

  • www. rusprofile. ru . The RusProfile project is a reference system for companies that can be used to quickly find an organization, contact information and registration information.

In the "Companies" section, you must enter the name of the company and get the search results (Fig. 9).

The Internet provides unlimited access to information resources, both in the field of legal science and practical lawmaking. Search engines greatly facilitate the task of finding the necessary information, any data, articles, monographs and programs. Internet resources are becoming an effective means of acquiring new knowledge, as well as providing access to electronic versions of not only legal magazines and newspapers, but also to a variety of legal literature available both in free and paid form.

Almost any lawyer can try himself as a "remote consultant" on legal problems. To do this, it is not necessary to create a personal web page, it is enough to become a member of one of the existing Internet projects. For example, the so-called "Virtual Legal Advice" (www.uristy.ru) is very popular on the domestic Internet. Any specialist with a legal education can take part in the work of this consultation, it is enough just to register in the system.

But it should be noted that the availability and ease of posting information, as well as the almost complete independence of servers from each other on the Internet, turned the global achievement into chaos. That is why from year to year the problem of finding the necessary information on the Internet is becoming more and more urgent. This is especially important in conditions of limited time and in the case when a decision should be based on a specific document.

The easiest way to find something is to enter keywords directly into the address bar of your browser. The search takes place in Microsoft's WSN Search system.

Fig. 18 WSN Search system

Another way to search is by using the Search button on the browser bar. When using this button, the window is divided into two parts. On the left there is a line for entering keywords, a list of found pages, and on the right, you can view the selected pages. You can use another search engine using the button Tune in the panel Search.

Internet search engines:

Search engines can be classified into the following groups:

    search directories

    search engines or search indexes

Search directories.

Resource directories - global, local, specialized - are web-based databases with resource addresses. These databases can have a different amount of accumulated information. They are usually hierarchical.

Search catalogs are organized in the same way as thematic catalogs of large libraries. Referring to the address of the search directory, we find on its main page a list of subject categories, such as "Jurisprudence", "Education", "Sports", etc.

Each entry in the category list is a hyperlink. Clicking on it opens the next page of the search directory, where the selected topic is presented in more detail. As you continue to dive into the topic, you can come down to a list of specific Web pages and choose the resource that is most suitable for solving your problem. You can also use the Search button in the search directory to refine your search for the pages you need.

Search directories are created largely by hand by highly trained editors who browse the WWW, select what they think is of public interest, and catalog the addresses.

Yahoo(www.yahoo.com) - recognized as the most popular catalog in the world. Search in Russian is possible.

Russian catalogs:

« List. Ru» (www. list. ru),

"Constellation Internet" (www. stars. ru),

"Russia on the Net" (www.ru) other.

Introduction. - 4

1. Information retrieval system. - 5

1.1. Documentary IPS. - 6

1.2. Factographic IPS. - eight

2. Search engine of the global network "Internet". - 9

2.1. How search engines work. - 9

2.2. Search technology. - 14

3. Search engines of the global network "Internet". - eighteen

3.1. How to Search the Internet - 18

3.2. Search directories. - 21

3.3. Search pointers. - 23

4. Comparative characteristics of two search engines

systems based on Rambler.ru and Yandex.ru. - 29

4.1. Rambler.ru - 29

4.2. Yandex.ru. - 35

Conclusion. - 40

Literature. - 42

Appendix. - 43

Introduction


The Internet has made life a lot easier modern society, globalized it, increased the capabilities of some people and reduced the capabilities of others. Today it is much more convenient and profitable to use postal services via the Internet (for example, a letter from Tobolsk to London will reach in 5 seconds).

According to my observations, the Internet has become a source of business, a source of world culture, a source of education, a mass media.

Today, any user on the Internet can get access to all world exchanges and museums in a couple of seconds. Any user can get education via the Internet, get acquainted with the world's leading electronic newspapers.

Information has become the virtual gold of our days, and he will achieve faster and greater success who can get it faster. And it doesn't matter who you are, a businessman looking for a new sales market or a student looking for material to term paper Both need information and the Internet can give it to them if they have enough knowledge to take it.

It would take me a long time to list the benefits of the Internet for the citizens of the Earth, but I am afraid that I would not finish soon.

I want to note the main thing on the Internet, some of its "cornerstone", this is information and its main properties:

1) Wide availability

2) Speed

Inexperienced users have a myth that the Internet has everything. In fact, my experience on the Internet has proven that this is not the case. Materials for posting on the Web are prepared by real people, and therefore you can find there only what they considered necessary (in the sense of being useful or beneficial for themselves) to publish. However, the river is fed by streams, and thanks to their work, there are already about two billion Web pages on the Internet today. As a result, cataloging the resources available on the Web has become a serious problem. Despite the fact that thousands of organizations are engaged in it, the problem is not only not getting closer to resolution, but is becoming more acute. The percentage of cataloged (or indexed) resources has been steadily declining. In the past two years, this decline has become catastrophic. So, if in 2000 the percentage of indexed resources approached 40%, then in just one next year it dropped to 25%. The takeaway is simple: the Web space fills up faster than it gets organized. Unfortunately, Internet experts have no reason to believe that anything can change for the better in the near future. As a result, finding information on the World Wide Web can be considered the most difficult task on the Internet.

In connection with the above, a high-quality search for information on the Internet is one of the most pressing topics in our time, this problem has touched me more than once.

The topic of my term paper interested me with its originality and novelty, and I want to try to reveal it. My task will be a high-quality organization of information retrieval on the Internet.

1. Information retrieval system


Before getting to the specific search mechanisms in the global Internet, it is necessary to analyze the theoretical basis of such questions as "what is information?", "Information processes?", "Information retrieval system and its types?"

There is no unequivocal answer that such information is not, you can only cite a part of the properties characterizing this term:

" Information - this is information that is the object of storage; this is the content of a message, signal, memory, as well as information contained in a message, signal, memory. "

The processes of transmission, storage and processing of information have always played an important role in the life of society. People exchange oral messages, notes, messages. They transmit requests, orders, reports on the work done, inventory of property to each other; publish advertisements and scientific articles; keep old letters and documents; They reflect for a long time on the news received or immediately rush to follow the instructions of their superiors. All these are informational processes. Information is always associated with a material carrier, and its transmission - with the expenditure of energy. However, the same information can be stored in different material forms (on paper, in the form of a photo-negative, on a magnetic tape, ...) and transmitted with different energy costs (by mail, by phone, by courier, etc.), Moreover, the consequences - including material ones - of the transmitted information do not depend at all on the physical costs of its transmission. For example, a light press of a button lowers a heavy theater curtain or blows up a large building, a red traffic light stops a train, and unexpected bad news can cause a heart attack. Therefore, information processes are not reducible to physical ones, and information, along with matter and energy, is one of the fundamental essences of the world around us. In the 20th century. with the development of technology, new devices appeared: communication means, automation devices, and since the 40s. - computer technology. It turned out that it was impossible to describe the efficiency of their work with the help of physical concepts and that the essential characteristics of such devices should be described in completely different ways. As a result, the exact concept of information and the mathematical theory of information emerged for the first time. It became clear that the means of communication, no matter what physical processes they use, are means of transmitting information. The unification of the concepts of "information" and "management" led N. Wiener in the 40s. to the creation of cybernetics, which, in particular, for the first time indicated the commonality of information processes in technology, society and living organisms.

The use of the concept of information has had a significant impact on the development of modern biology, especially such sections as neurophysiology and genetics. And finally, in connection with the development of computer technology, which stimulated the informatization of the whole society, a complex of sciences arose about various aspects of working with information - informatics.

" Information retrieval system - this is a system where an information array is stored, from which the necessary information is issued according to the requirements of users. "

The search for information at the user's request is carried out either automatically or manually (as in libraries, when a reader requests a reference fund employee, and the employee uses the catalog system). In the second case, computers are used, equipped with special software that analyze the processes of requests, search and issuance of the necessary documents. Thus, information retrieval systems (ISS) implement a question-and-answer relationship, which brings the tasks facing the creators of such systems closer to the tasks that the creators of man-machine systems solve.

Information retrieval systems are divided into two types:

1. Documentary IPS.

2. Factographic IRS.

1.1 Documentary IRS


In such an ISS, all stored documents are indexed in some special way. Each document (article, report, protocol, etc.) is assigned an individual code that makes up the search image of the document. The search goes not by the documents themselves, but by their search images, which contain information (address) about the location of the document. This is how they look for books on the orders of the reader in large libraries (in small libraries the librarian usually searches for books himself). At the request of the reader, they first find the card in the catalog, and then, by the code indicated on it, the book itself is found.

Differences in documentary ISS are determined by how the search image of the document is arranged. In the simplest case, this is just its individual title (for example, title, author, year of publication of the book). In more complex cases, there is no one-to-one correspondence between the search image of the document and the document itself. It is quite possible that the search image of a document corresponds to several different documents and, conversely, the same document corresponds not to one, but to several search images.


Such ambiguity is possessed, for example, by search images of documents in descriptor systems. "A descriptor is a word or phrase that is closely related to the content of a document. A collection of descriptors defines a group of documents with similar content." V Lately journals publishing scientific articles require their authors to indicate for each article a list of keywords that play the role of descriptors. If, for example, you describe an article that you are reading using keywords, then one of the possible lists will be as follows: information retrieval, information retrieval system, descriptor, thesaurus, document retrieval image.

By a set of these keywords (a set of descriptors), you can find this article among all the articles of the book, if you enter its article-by-article content into any ISS of a descriptor type.

The general block diagram of a descriptor-type ISP is shown in Fig-1. This circuit has two inputs. One by one, the information array of documents stored in the system is replenished, and the second one is used to receive user requests.

1.2 Factual IRS

Unlike document-graphic IRS, IRS of this type does not store documents, but facts related to any subject area. Stored facts can be extracted from various documents. For example, it is necessary to rework the history of the eighteenth century in the database of facts, they are connected with each other by a system of various relations. Such a network in the ISS is called a domain thesaurus. The queries coming into the factual IRS use the thesaurus to find answers to queries. The search is carried out by the search method, according to the model widely used in the knowledge bases of artificial intelligence systems.

For example, it is necessary to rework the history of the eighteenth century, collect all the information about Catherine II.

IRS of factual type are gradually approaching in their organization and functioning to developed databases and knowledge.

2. Search engine of the global network "Internet".


I do not want to get into the jungle of the inner workings of the search engine (at the electronic level), tk. this does not meet the goals of my work, and in my opinion this is the work of top-level programmers to which I am now striving.

I want to disassemble and sort it out on the "shelves", how I understood the technology of information retrieval, and the mechanism of information retrieval itself.

2.1 Technology of information retrieval on the Internet


The search technology itself becomes clearer in Figure-2.

1) To begin with, the user solves such a problem that he wants to find, and where it can be.

2) Then he logs into the Internet, into an ordinary Internet Exploer window (Browser) (Fig-3). If the user knows the name of the site on which the information of interest is located, then he simply reports his name and enters it.

Example. The user wants to know the film distribution for today and goes to the site film.ru. (Fig-3).

This is the most primitive way to search for information on the Internet, and the search may end there.

information about a film that has long been discontinued, for example, find the film "Brother-2", it is enough in the window



The search is performed automatically based on the number of words found on the server. The first group of found links with the best indicators in terms of the number of occurrences of the search words found will be transferred to his computer.

Often, brief information about the document can be displayed along with the link. If there are no necessary documents among the found, then you can display the following group - the total number of documents is usually in the thousands. In order to go to the server where the found information is located, simply click on the link in the search result.

This is the most primitive way to search for information on the Internet, and the search may end there.

There are also inside site (local) search systems.

Example. In the same film.ru there is an opportunity to view

information about the film long out of the box office, to

For example, find the film "Brother-2", it is enough in the window

search type the word Brother-2. (Fig-3)

3) If the user does not know the name of the site where he can find the information of interest to him, then he resorts to the help of some search engine. There are a significant number of help systems... Having entered the specified server, he will receive a request form on the screen, in which he must enter information for the search. Usually, a form has the ability to limit the search area (for example, by topic). He can enter the desired term, define the scope of the search and try to get an answer.

The search is performed automatically based on the number of words found on the server. The first group of found links with the best indicators in terms of the number of occurrences of the search words found will be transferred to his computer. Often, brief information about the document can be displayed along with the link. If there are no necessary documents among the found, then you can display the following group - the total number of documents is usually in the thousands. In order to go to the server where the found information is located, simply click on the link in the search result.

Typically, a search for a pair of keywords results in tens of thousands of links to documents containing those terms. Such a volume of results rarely allows you to effectively find a "pearl" among unrelated materials. What can you advise?

First, the user needs to narrow down the search area. Try to determine on the servers of which profile, in which country, etc. the materials of interest are most likely to be found. Think about what other keywords can characterize search objects, use several keywords.

If several terms are specified by the search object, then the search engine searches for the occurrence of each word in the document independently. That is, as a result of the search, you can get a document that contains only one word, but several times. Therefore, when defining the terms by which the search is made, it is possible and necessary to use logical operations.

For example, entering word_1 & word_2 will force you to search for those pages where both the first and second terms are used.

Secondly, it is necessary to conduct a search on all known search engines. Each of them uses its own, somewhat different from the others, search technology. Therefore, absolutely similar searches can lead to different results. Most search engines are free, so there is nothing stopping you from doing as many searches as you need.

Thirdly, very often a search for documents based on possible links to them can bring the result.

The user should try to determine which well-known documents may contain references to his topics. And already through the hypertext links in the documents to reach the desired source. This path is often effective. Try to find organizations (WWW servers) that have a profile similar to the subject of your search. Sometimes, through the links in the documents of these servers, you can go to the necessary materials.

Fourth, try to find a conference on similar topics, i.e. just go to some CHAT. For example, at www.anekdotov.net.ru. Often times, a question "thrown" on the newsgroup provides enough background information.

And finally, don't forget to ask your friends. They can suggest unexpected solutions.

In any case, you need to tune in to the fact that the search can take a fairly long period of time and require a lot of effort from him.

Example. The user enters the search engine Yandex.ru, and in the search window types in the word Brother-2, then a search is made for everything that can at least somehow be connected with this word. Yandex will recommend referring to many sites, including film.ru and directly to the site about the film itself. (fig-4)

2.2 How search engines work

A search engine usually searches for the information you want through three stages:

I) Stage: A robot (agent, spider, or crawler) travels around the Web and collects information.

II) Stage: All information collected by robots enters the database in the form of links - it is indexed.

III) Stage: A search engine is launched, which users use as an interface to interact with the database. those. the database was issuing hyperlinks, and then there is an ordinary search of the necessary links by the user.

These stages are clearly expressed in the work of the flowchart (Fig-2)

The first two are preparatory and invisible to the user.

Let's consider in more detail the stages of information search in

Search engine:

I) Stage. The search engine collects information from the World Wide Web. To do this, use special programs similar to browsers. They are able to copy a given Web page to a search index server, view it, find all the hyperlinks that it contains, navigate to the URLs specified in them, copy those resources that are found there, find the hyperlinks in them again, and so on. etc. These are special programs such as agents, spiders, crawlers and robots that search for pages on the Web, extract hypertext links on those pages, and automatically index the information they find to build a database. Each search engine has its own set of rules governing how to collect documents. Some follow every link on every page they find and then, in turn, explore every link on every new page, and so on. Some people ignore links that lead to graphics and sound files, animation files; others are instructed to browse the most popular pages first.

Agents - the most "intelligent" of search tools. They can do more than just search: they can leave a message about your visit to the site. Already, they can search for sites of a specific topic and return lists of sites sorted by their attendance. Agents can process the content of documents, find and index other types of resources, not just pages. They can also be programmed to retrieve information from pre-existing databases. Regardless of the information that the agents index, they pass it back to the search engine database.

General search for information on the Web is carried out by programs known as spiders. Spiders report the content of the found document, index it and retrieve the summary information. They also look at headers, some links, and send the indexed information to the search engine's database.

Crawlers look through the headers and only return the first link.

Robots can be programmed to follow different links of varying nesting depths, index and even check links in a document. Due to their nature, they can get stuck in loops, so they need significant Web resources to follow links. However, there are methods designed to prevent robots from searching on sites whose owners do not want them to be indexed.

Robots retrieved and indexed different kinds information. Some, for example, index every single word in a encountered document, while others index only the most important 100 words in each, index the size of the document and the number of words in it, title, headings and subheadings, and so on.

The type of index built determines what search can be done by the search engine and how the information received will be interpreted.

People who want to provide information to the general public, or who want more traffic to their site, put short excerpts about what this site is directly in the index, filling out a special form for the section that they think the search robot will turn to and pull this site into the database and provide it some user.

When someone wants to find information available on the Internet, they visit a search engine page and fill out a form detailing the information they need. Key words, dates and other criteria can be used here. The criteria in the search form must match the criteria used by robots to index the information they find while navigating the web.

The indexed information is sent to the search engine database in the same way as described above.

II) Stage: After copying the searched Web resources to the search engine server, the second stage of work begins - indexing. In the course of indexing, special databases are created, with the help of which it is possible to establish where and when a particular word was encountered on the Internet. An indexed database is a kind of dictionary. It is necessary so that the search engine can respond very quickly to user queries.

The database retrieves the subject of the query based on the information provided in the completed form and outputs the corresponding documents prepared by the database. The database uses a ranking algorithm to determine the order in which the list of documents will be displayed. Ideally, the documents most relevant to the user's query will be placed first in the list.

"The operation of sorting the results obtained is called ranking."

Different search engines use different ranking algorithms, but the basic principles for determining relevance are as follows:

The number of query words in the textual content of the document (i.e. in the html code).

Tags in which these words are located.

The location of the search words in the document.

The proportion of words for which relevance is determined in the total number of words in the document.

These principles are applied by all search engines. And the ones presented below are used by some, but quite well-known (like AltaVista, HotBot).

Time - how long the page has been in the search engine's database. At first it seems like a pretty meaningless principle. But, if you think about how many sites exist on the Internet that live a maximum of a month! If the site has existed for a long time, this means that the owner is very experienced in this topic and the user is more suitable for a site that has been broadcasting to the world about the rules of table behavior for a couple of years than the one that appeared a week ago with the same topic.

Citation Index - how many links to a given page lead from other pages registered in the search engine base. The database displays a similarly ranked list of HTML documents and returns it to the user who made the request. Different search engines also choose different ways to display the resulting list - some only show links; others display links with the first few sentences contained in the document, or the title of the document along with the link.

III) Stage. The user's request is processed and the search results are returned to him in the form of a list of hyperlinks. Then comes the user's work to recycle the links provided by the database. When he clicks on a link to one of the documents that interests him, this document is requested from the server on which he is located, if the user's information on this site does not satisfy him, he clicks on another link. This stage can take a long time and turn out to be the most difficult for the user.


3. Search engines

There are a great many search engines (search engines) on the Internet, they have different types, each with its own advantages and disadvantages. The user will always be overcome by such questions: how to search the Internet, which car is better. So I will try to answer these questions.

3.1 How to search the Internet

When searching on the Internet, two components are important - completeness (nothing is lost) and accuracy (nothing extra was found). Usually this is all called in one word - relevance, that is, the correspondence of the answer to the question.

1. Coverage and depth. By coverage we mean the volume of the search engine base: which is measured by three indicators - the total volume of indexed information, the number of unique servers and the number of unique documents. Depth is understood as whether there is a limitation on

the number of pages or the depth of nesting of directories on one server.

How to check: Some machines write robot statistics on their website. But you can check it yourself - you need to set several search queries consisting of one word (in order to exclude the influence of the query language, including different interpretation of the space), and at the same time look at the statistics of the results issued by the machine - usually at the beginning of the list it is indicated how many all documents were found. In addition to the fact that the words should be from different areas, it is also good to take words of different "weights" - rare, "medium" and "heavy" (frequency), and compare the amount of found. Heavy words, in particular, test the full-text content (indexing of all words in a document) of a search engine.

It is more difficult to check the depth of the robot's movement - for this you need to take some sites, for example, with a branched structure of archives, and check whether the documents that can be accessed are indexed only, for example, in 6 clicks on the links.

2. Crawl speed and relevance of links.

The network crawl speed shows how quickly the newly added resource is indexed and how quickly the information in the database is updated. An important indicator of the quality of a search engine (its robot) is not only the "capture" of new territories: but also

tracking the status of those already covered. Servers disappear and appear, the pages on them are refreshed. The links that the search engine gives in the list of found must, firstly, exist, and, secondly, their content must correspond to the request.

How to check: Objective information can be obtained by analyzing the server logs - the search engine robot is usually represented by the name of its machine (or a similar way), so that you can see how often it visits the server, how many pages it views, etc. Unfortunately, usually only the log of your site is available for study, so an experimental method remains.

To determine the crawl speed, you need to create a page of text somewhere, add it to search engines and see how quickly it starts to be found. Or change an existing page. To determine the relevance of links - check the documents at least on the first page of the list found for several queries. Message " Not found"indicates that the document no longer exists.

3. Search quality(subjective indicator).

Each search engine has its own algorithm for sorting search results. The closer to the top of the list you find the document you need, the better the relevance works.

How to check: Only by experiment. It is recommended to make queries of different lengths for comparison. You can also use the query language, while those who do not want to read the description can use the expanded query page ("advanced search" in Aport and Yandex, "detailed query" in Rambler - translation options into Russian "advanced search").

Besides relevance, there are important user characteristics.

1. Search speed. If the search engine responds slowly, it is ineffective to work with it. It should be added that the speed visible to the user depends not only on the search engine itself, but also on the Internet channels.

How to check: By experiment - you need to search for queries of different lengths, different "severity" of words and at different times of the day (server load is significantly uneven throughout the day, the peak is about three to four hours of the day).

2. Search capabilities (work with the document language, query language). Another point of comparison is what exactly and how the search engine enters into the index. A full-text search engine indexes all words of the text that is visible to the user. The presence of morphology makes it possible to find the desired words in all declensions or conjugations. In addition, there are tags in HTML that can also be processed by a search engine (headings, links, image captions, etc.). Almost all machines have a query language in the form of standard logical operators (AND, OR, NOT). Some people know how to search for phrases or words at a given distance - this is often important to get a reasonable result. Additional opportunity is a search in areas of the document - titles, links, keywords (META KEYWORDS), etc. An additional feature of the query language is a natural language query that does not require knowledge of operators.

How to check: Usually this information is published on the server of the search engine (in Help "e). However, it is recommended to check on real queries, since sometimes wishful thinking is passed off as real.

3. Additional amenities. These are additional opportunities that a search engine provides to users. This includes all kinds of search options (specialized pages, searching for similar documents, limiting the search area), and a list of found servers, and a search by dates and servers, and a convenient search engine interface, and the ability to personalize it.

How to check: The information can be partially published on the server of the search engine, but it is best to try to work with these possibilities yourself.

Search engines consist of search directories and search indexes, many search indexes also contain directories. Let's consider them.

3.1 Search directories

Any book begins with a table of contents and ends with an alphabetical index. Despite the fact that they are located in different places of the book and look completely different, they have one task: to help find in the book exactly the section that is in this moment needed. Content is an example of cataloging.

When a person chooses a topic that is interesting to him, he uses it to find the page number where this topic is revealed. An alphabetical index is an example of indexing (in English, index is an index). A person finds the required term in the index and gets the number of the page on which he appears.

Directories and are different from search engines. Directories are a collection of sites, collected in thematic headings. These headings, in turn, can be broken down into subheadings, which can also have even smaller subdirectories, etc.

Directories from the user's point of view are the same search engines. But these catalogs are filled not by "robots", as on the signs, but by the most living people. This is very good for users as it produces more relevant results compared to search engines. In part, the search index also contains a catalog, it is presented in the form of tables of contents (hyperlinks) on the most popular topics.

When cataloging a resource, an experienced editor carefully reviews it, determines which area of ​​knowledge the resource belongs to, establishes its category in this industry, and enters the resource into the catalog. The largest directory on the Internet is Yahoo (www.yahoo.com). It employs over 150 qualified editors. It is a large organization, but its efforts are only enough to maintain the catalog at the level of approximately 1 million resources. Further expansion is constrained by the need


in the Russian part of the Internet in table-1. [attachment]

3.3 Search pointers

Search indexes are automated systems... They are able to function without human intervention, and therefore their knowledge of the real resources of the Web is much (several orders of magnitude) greater than that of directories. The number of indexed Web pages can be measured in hundreds of millions.

The search index operates in three stages, which are indicated in clause 2.2.

Specific guidelines for choosing a search index get old very quickly. The situation on the Internet is changing literally before our eyes. Not even half a year passes so that something does not change in the search engines. The system that was the best yesterday may not be the best today and very bad tomorrow. At the same time, popularity is a tricky thing. She earns hard, but then she lives for a long time. As a result, we very often come across a situation where the most popular is far from best system... We will help the reader learn how to independently check different search engines and choose the ones that give the best results for work. When validating, the size of the search index is not critical. After all, we do not need millions of links, but only two or three, but preferably the best ones. Therefore, it is important not only how many web pages the search engine indexed, but also when it last did it, how often it later checked the relevance of links and how correctly it presents the search results.

Comparative review of search engines.

There is no need to talk in detail about how to use search directories. Since you just need to go to the site, select a category that interests you, select a section in it, and so on, until a list of specific links opens.

It is much more interesting to consider the techniques for using search pointers, especially since these techniques are different for different pointers. But before starting to study a specific system, it is necessary to consider general concepts that are equally relevant to all search indexes, as an example I will consider such popular, and in my opinion the most convenient, search engines like Yandex and Rambler.

And I'll start by looking at the main types of search. Basically there are only four types of search.

All search indexes implement several search algorithms. These include simple search, advanced search, contextual search, and ad-hoc search.

Simple search. In a simple search, one or more words are entered into the query field, which can characterize the content of the document. If this word is one, then, as a rule, such a large number of links are returned in response, with which it is not clear what to do. If more than one word is entered, the result depends on how the words are entered, and this, in turn, depends on the specific system used. Receptions simple search in different search engines, as a rule, their own, and before using them, it is advisable to read the instructions. A simple search in Rambler is presented at

rice-8. When you enter the phrase: Everything is confused in the Oblonskys' house, the search indexes give the following results: Rambler 9 (documents)

Yandex 2400 (documents)

Advanced Search. Advanced search always implies a query from a group of words. In advanced searches, in most cases, it is allowed to link keywords with logical operators AND (AND), OR (OR), NOT (NOT) and others. The main advantage of advanced search is that, as the rules for writing keywords and logical operators in different systems either the same or very similar. Therefore, having mastered the advanced search techniques once, you can use them anywhere. You just need to first switch the system to the desired mode (Fig-9.)

When you enter the phrase: Everything is confused in the Oblonskys' house, in the advanced search, the search indexes give the following results: Rambler 9 (documents)

Yandex 2400 (documents)

Fig-8 Simple Search in Rambler


Fig-9 Switching the system to the advanced search mode.

Contextual search. This is a very useful form of search that, unfortunately, is not implemented in all search indexes. The systems that support it should be especially appreciated. A contextual search requires an exact match of a phrase or a group of words, for example “All

mixed in the Oblonskys' house ”. In most search engines that include this method, the key phrase must be enclosed in quotation marks: "Everything is mixed up in the Oblonskys' house." (Fig-10)

When you enter the phrase: "Everything is confused in the Oblonskys' house", the search indexes return the following results:

Rambler 0 (documents)

Yandex 8 (documents)

Fig-10. Contextual search in RAMDLER.RU


Special search. Using commands special search are looking for additional information. For example, such commands allow you to determine how often there are hyperlinks on the Web that point to a resource, with their help you can find keywords,

included in the headers of Web pages, etc. As a rule, the teams of special search in different search engines are different.

You also need to consider general rules search command records.


General rules for writing search commands:

Space separated words

Let's say a user wants to find a Web page that says something about the operating system. Microsoft Windows... It is logical to enter the words Microsoft Windows in the search field and wait for the result. But the result can be discouraging. Some search engines understand such a record as Microsoft AND Windows - they will give what the user is looking for. Others may interpret this entry as Microsoft OR Windows, which will search for all Web pages that contain either the first word, the second, or both. The user, of course, is only interested in those pages on which both words are found together, but they will literally be buried among other pages that he does not need.

When you get started with an unfamiliar system, you need to start by checking how it handles groups of keywords. One word is entered first: Microsoft. You can see how many results the system will give.

Rambler 28184 (documents)

Yandex 1048379 (documents)

Then the second word is entered: Windows. The quantity is checked again. Both words are entered: Microsoft Windows.

When you enter the phrase: Microsoft, search indexes return the following results:

Rambler 6641 (documents)

Yandex 259276 (documents)

If the number of found Web pages is greater than in the first and in the second case, it means that the system considers that the keywords are related by OR (sets are combined). If the result is less than in each of the first tests, then the system uses the ratio AND (the sets intersect). In either case, you will have to read the background information to find out how to get the opposite result. For example, all major Russian search engines put the AND operator between words by default, although the Yandex system has its own characteristics (see table-2). There, it is believed that these two words should be simultaneously present not in the document, but in one sentence. If it is enough that they are present in the document, before each word you must put a sign<+>... At the same time, an inverse problem arises: how to make the search for documents containing one of the given - keywords, that is, how to set the OR relation?

Rambler: Microsoft OR Windows; (50986 documents)

Yandex: Microsoft | Windows; (2034641 documents)

Role of capital letters

In most search engines, “bread” is not equal to “BREAD”, but “BREAD” * “bread”. The general rule is that if the client entered lowercase characters, then both lowercase and uppercase characters are searched, but if the client used uppercase letters, then an exact match is only searched for uppercase letters. A classic example is Little Red Riding Hood. If you enter them in this way, using capital letters, then only documents in which there is

combination of Little Red Riding Hood. However, if the keywords are written as a little red riding hood, then more documents will be searched. All documents in which the combinations are found will pass through the selection sieve: Red Riding Hood, Red Riding Hood, Red Riding Hood and Little Red Riding Hood. Therefore, one should not abuse the use of capital letters in the request and use them only when there is absolute certainty of the result.

However, some search engines are different. So, for example, in the Rambler system, when indexing, all uppercase letters are forcibly "lowered" to lowercase. This means that it is useless to use uppercase letters in a query on this system.

When you enter the phrase: Little Red Riding Hood, search indexes return the following results:

Rambler 2921 (documents)

Yandex 16458 (documents)

The role of reserved words

Reserved words are words that are not counted when processing a request. During indexing of Web-Pages, the program throws them out of the text, which significantly reduces the size of pointers and shortens the search time. To reserved words usually

includes non-informative words: prepositions, conjunctions, pronouns, articles and other small words. So, for example, if you search for the phrase "Everything is mixed up in the Oblonskys 'house" in the Yandex system, then documents containing What is mixed up in the Oblonskys' house will also be searched for? - and Where did it get confused? In the Oblonskys' house? In some systems, words may be reserved that occur extremely often and therefore are not informative. If, for example, the system is focused on searching for books, then the word book is not informative for it. The word auto is uninformative for a search engine that deals with automobile affairs, and the words computer and the Internet are uninformative for systems focused on finding information on computing technology... It is especially important to take into account the role of reserved words when conducting contextual search. contextual search requires an exact match between what the user ordered and what appears in Web documents. If a search engine has “stripped” Web documents of reserved words at the indexing stage, then it cannot cope with contextual search, except perhaps by “glancing” into copies of Web pages, if any, but it takes a lot of time. Therefore, honest contextual search in search engines is rare. In Russia, for example, both Yandex and Rambler only pretend that they provide the opportunity for contextual search, for this the desired phrase must be enclosed in quotation marks. However, after some simple tests, it is easy to see that this is not really a contextual search, but a search with an accuracy of reserved words. An example, when the query "Everything is mixed up in the Oblonskys 'house" returns the result What is mixed up in the Oblonskys' house. In table-2, I give a comparative description of the main search engines (search engines). [Appendix]


4. Comparative characteristics of two search engines based onRambler. ruandYandex. ru


4.1 RAMBLER

Rambler.ru is historically (before Yandex) the most popular search engine in Russia. She started working earlier than others and for a long time was the leader in terms of index size and quality of search services. Alas, today these achievements are in the past. Despite the fact that the size of the search index "Rambler" is approximately 12 million web pages, it has not really been updated for a long time and gives outdated results. Today Rambler is a popular portal, the best classification and rating system in Russia, plus an advertising platform. (Fig-10)

Search techniques in the Rambler system:

Search query language

A search query can consist of one or more words and may contain punctuation marks. You can write simple queries without going into the intricacies of the query language. So, if you enter several words into the search line without punctuation marks and logical operators, documents containing all these words will be found (and at a limited distance from each other).

However, knowledge and correct application of the search engine query language will help make the search on Rambler fast and efficient.

Register

In general, the case of spelling of search words and operators does not matter, that is, home and DOM, Not and nOt are perceived the same. And only sometimes, in order to improve the quality of the search, the case of words search query taken into account.

For example, if a query consists of two, three or four words, each of which is written with a capital letter, then a search by proper name is assumed, and the limitation of the distance between the words of the query is automatically changed from the default value to the value (n-1) * 2 , where n is the number of words in the query. This allows you to find a group of query words, within which there is no more than one "extra" word or punctuation mark, for example "Baden-Baden", "A. Pushkin", "Fyodor Mikhailovich Dostoevsky".

Operators

A multi-word query can contain operators. Operators are not searched for in the document; they serve only as instructions to the search engine. All search engine operators are binary, that is, they have left and right parts, each of which is also a query (by default, consisting of one word). To change the scope of operators (grouping multiple query words into an operator argument), parentheses and quotes are used. Two queries connected by the AND operator (logical AND) form a complex query that is satisfied only by documents that simultaneously satisfy both of these queries. In other words, the query "dog AND cat" will only find documents that contain both the word "dog" and the word "cat".

A complex query consisting of two queries connected by the OR operator (logical OR) is satisfied by all documents that satisfy at least one of these two queries. If you search for "dog OR cat", you will find documents that contain at least one of the words "dog" or "cat" (or both of these words together). The NOT operator (logical AND NOT) forms a query, which is answered by documents that satisfy the left side of the query and do not satisfy the right. So, the search result for the query "dog NOT cat" will be all documents that contain the word "dog" and not the word "cat". If no operator is explicitly specified, the default AND operator is used: only documents containing all of the query words are found. Thus, the query "information technology credit" will be interpreted as "information AND technology AND credit". On the Advanced Search page, the default operator can be replaced with OR (Search for query words: at least one).

Each of the operators has an abbreviation:

Operator abbreviation

A query of several words interspersed by operators will be interpreted according to their priority. Operators AND and NOT traditionally have a higher priority, so a query of several words is first grouped by the operators AND and NOT, and only then by the operators OR. You can change the grouping order using parentheses.

Quotes

You can use double quotes to search for quotes. Query words enclosed in double quotes are searched in documents in exactly the order and in the forms in which they appear in the query. Thus, double quotes can also be used simply to search for a word in a given form (by default, words are found in all forms). For example, the query "airplane" refueled "landing" is satisfied by the document containing the text "... the airplane has landed and refueled ...", and the document containing ".. the airplane has landed to refuel ..." is not satisfied.

Parentheses

When building queries, sometimes it becomes necessary to combine query words into groups that will be the arguments of a certain operator. Such groups are enclosed in brackets. The bracketed part of the query is itself a query and is subject to the rules of the query building language. Using brackets

allows you to build nested queries and pass them to operators as arguments, as well as override the default operator precedence. If a query without parentheses "car airplane | airfield" is equivalent to the query "car AND airplane OR airfield" and, according to operator priorities, means "find documents containing either the words" car "and" airplane "or the word airfield, then the query with the parentheses "car (airplane | airfield)" is equivalent to "car AND (airplane OR airfield)", which means "find documents containing the word" car "and one of the words" airplane "or" airfield "".

Metacharacters

Rambler does not yet support searching for strings using metacharacters ("*", "?"), Which are usually used in the meaning of "any substring" and "arbitrary single character", respectively. However, these operators are reserved for similar future use.

Using the query language

Each request addressed to the Rambler search engine is processed in accordance with the rules of the query language. Certain words and symbols are treated as query language operators and processed in a special way. In fact, the query language describes a certain formula that is used in the search - each of the documents is "matched" with it, and the search result is only those documents that satisfy it. For example, the query "airplane" is satisfied by all documents in which the word "airplane" in any form has been encountered at least once. A request consisting of several words is satisfied by documents containing each of these words in any form (under certain conditions). The issue of document compliance is more complex query is determined by the logic of operators and constructs of the query language.

Morphology

For each word of the query, the search is carried out taking into account the rules of inflection of the corresponding language. Rambler understands and distinguishes between the words of Russian and English languages- by default, the search is carried out in all forms of the word. For example, a search for the word "person" will also find documents containing the words "person", "person", "person", and even "people." To search for only one specific form of a word, you need to enclose it in double quotes or use the search for an exact phrase in the advanced search.

Stop words

Some words and symbols are excluded from the request by default due to their low information content. These are the so-called stop words - the most frequent words of the Russian and English languages, for example, prepositions, particles and articles. The presence of these words can slow down searches and negatively affect the completeness of the results. It is possible to indicate the need for these words in the query by taking the query in double quotes or by using the search for the exact phrase in the advanced search.

Distance limitation

If a query is composed of one or more words without using operators and query language constructs, then documents will be found in which all query words are found. At the same time, for each request there is always a so-called context limitation - a positive number, by default equal to a distance of 40 words. A document containing all the query words will be issued only if the distance in words between occurrences of the query words is less than this number. For example, the query "red army" will find documents in which the words "red" and "army" appear at least once in less than 40 words from each other. The value of the context limitation can be changed by the construction "(number, query)", where number is any positive number, the query is any correct from the point of view of a search engine, a query consisting of more than one word (obviously, limiting the distance between words in the case of a one-word query is not makes sense). Thus, at the request "(2, red army)" there will be only those documents in which there is not a single word between the words "red" and "army" at least once (since only in the case of their immediate proximity, the difference in the ordinal numbers of words less than 2, i.e. equal to 1)

Words not found

If the query consists of several words, and some of them could not be found on the Internet at all, then search results are returned for a partial query, from which words that are absent on the Internet are excluded. In this case, the corresponding diagnostics are displayed on the search results page.


Sorting results

By default, found documents are sorted by relevance (matching the query). However, you can request that the most recent (or, alternatively, the oldest documents) be placed at the top of the list instead. To do this, select the appropriate setting in the "Sort by ..." menu on the detailed request page. You can also restrict the search to documents created in a certain period of time: for this, you must specify "From date ... to date ..." on the detailed request page.

Distance between words

You can require Rambler to return only those documents where the words from the query are at a minimum distance from each other. The "Limit the distance between words" mode can be enabled in a detailed query. All of the above rules can be used together with each other in the required sequence.

Delivery of results

By default, search results are returned in portions of 15 documents. The "Issue by ..." menu on the detailed request page allows you to increase this number to 30 or 50. The "Output form ..." menu allows you to receive document descriptions with increased or decreased detail.


4.2 YANDEX

Yandex.ru is a search engine capable of finding the most suitable web pages in the Russian part of the Internet upon request. Yandex searches hundreds of thousands of Web pages every day, looking for changes or new links. The collection of links is constantly growing. Yandex does not require knowledge of special search commands. Yandex will find everyone who referred to the page, files with the desired image, the latest news or products in electronic stores. At the heart of the Yandex system is the largest index - about 27 million Web pages, but it's not just size. It is not just a pointer to resources, but a pointer to the most up-to-date resources. In terms of relevance, Yandex is the undisputed leader today (Fig. 4)

Search techniques in the Yandex system

Before proceeding with the description of the query language of the Yandex system, I note that it is noticeably more powerful and more complex than the query languages ​​of other domestic search engines. However, the average user does not need to be intimidated. Even if he really does not like to read and, moreover, to study instructions, he can work with the system intuitively.

In principle, the Yandex system uses heuristic algorithms in its work, which are not entirely strict from a mathematical point of view. As a result, the user may get different results, for example, if he searches for documents with the words Bush Gore elections and Bush Gore elections. But thanks to these algorithms, an intuitive approach to creating queries (without reading instructions) gives a very good result, moreover, in a very short time.

Search by one word

When a user enters a search word in the search field and clicks the Find button, the words are searched for with all possible word forms, which is especially important for the Russian language. For example, if the word snow is entered, then documents will be found that include the words snow, snow, etc., but not snowy, snowy, etc. If the search for word forms is not required, then it can be canceled using an exclamation point sign, for example! snow.

Search by word group

If the words are separated by a space, then documents are searched in which all entered words are found in one sentence. So, at the request of Bush Gore elections

the system gives out documents with phrases like ... On the eve of the elections, hackers broke into the websites of Bush and Gore. Among the results of such a search, loose matches are possible - the search engine shows its intelligence. To strictly ensure the appearance of words in a sentence, you must put a + sign in front of them, for example: + Bush + Gore + elections. The + sign must be written together with the word to which it refers (without a space). The space plays the role of the AND operator, which can also be entered explicitly (the & symbol), for example: + Bush & + Horus & + elections. There must be spaces to the right and left of the logical operator.

If the simultaneous presence of words is required not only in a sentence, but also in the entire document, the && operator is used, for example: + Bush && + Gore && + elections.

Now I will consider the techniques of excluding words from the search. To do this, use: sign - (strict exclusion from the sentence), the ~ sign (not strict exclusion from the sentence) and ~~ (exclusion from the entire document). So, for example, the request + Bush + Gore ~~ elections will allow

select documents in which the words Bush and Gore are found in one sentence, but the entire document does not contain the word election and its derivatives (choice, at elections, after elections, etc.).

In cases where you need to combine keywords using the OR operator, use the | (vertical bar). For example, the request Bush | Gore && + elections will select documents that mention either George W. Bush or Albert Gore, but without fail

there is the word elections.

Search by distance

A long time ago, the NEAR operator appeared in search engines, which allows you to find documents in which two words are located close to each other. It is true that each system understands "close" differently. In the Yandex search engine, you can specifically specify at what distance these words should be from each other.

In the document, each word has its own position number. The position numbers of two adjacent words differ by one (the position number of the word on the right is greater). The distance operator is written as / + n, where n is the number corresponding to the distance. For example, the / + 1 operator matches two consecutive words, that is, Microsoft / + 1 Windows is the same as Microsoft Windows.

The distance operator can also be negative. This means that the second word specified in the request must appear before the first in the document. For example, a Microsoft / -5 Windows request may link to a document containing the phrase about operating systems that will replace Windows, said a Microsoft executive.

When conducting a search with an indication of the distance, you can specify not the exact distance between words, but a range, for example / (- 5 +5). In this case, documents will be selected in which the words specified in the query as keywords fall within the specified range. In fact, if the parameter sign is not specified, then this is also a range search. So, the operator / 5 should really be considered as the range / (- 5 +5). A request for Bush / 5 Gore will seek out proposals such as: Women sympathized with Bush, and men sympathized with Gore or Gore Bush was no sweeter.

System - Yandex has rather complex rules for the query language (compared to Rambler), but it has extensive capabilities. For example, distances can be measured not only between words, but also between sentences. This unit is used when the double && or ~~ is used in the query. So, the query Bush / + 1 && Gore will return documents in which the words Bush and Gore occur either in the same sentence or in adjacent ones.

Using brackets

A search job is essentially a boolean expression that acts as a filter when viewing documents included in a search engine's database. V

in a logical expression, just like in arithmetic, you can use parentheses. They serve to control the order of actions. Case Study: Bush & Gore & (election | vote). Such a query will return links to Web pages containing proposals that include the words Bush, Gore, election, or Bush, Gore, vote.

Ranking management

The purpose of ranking is to make sure that Web pages that

the ones that best matched the query were shown in the list of results as early as possible. What algorithms the search engine uses for ranking is its business. Users are either happy with their work, or turn to another search engine. In the Yandex system, it is possible to independently change the algorithm of the ranking mechanism using weighting factors. Such a factor can be assigned to any keyword or whole expression if it is enclosed in brackets: Weights are entered through a colon, for example Bush: 5 Gore elections. With such a query, documents in which the word Bush occurs more often take precedence and appear in the resulting list at higher positions.

Another ranking control technique is related to the qualifier word. This is a word that does not have to be contained in the selected documents, but if it is there, then this document gets a ranking priority. The qualifying word is entered after the signs<_. Например, при поиске по ключевым словам Гор Буш<_младший выборы получат преимущество Web-страницы, в которых речь идет не просто о Джордже Буше, а о Джордже Буше-младшем.

Special search

The techniques of searching for information contained in special header fields of Web pages (each Web page has service fields in its header) or searching for special elements included in Web pages, such as hyperlinks, stand apart. In the Yandex system, special search commands in the header fields begin

with the $ character, and commands for finding individual elements of Web pages with the # character. All ad-hoc searches are noticeably slower than conventional searches.

Team

Description

Example

Explanation

$ title (expression)

The search for keywords specified in the expression is performed only in the titles of Web pages


$ title (Space)


Only Web pages with the word Cosmos in their titles are searched for (Fig. 7.10)



The search for keywords specified in the expression is performed only in the anchors of internal links of Web pages


$ anchor (introduction)


Internal (expression)


#keywords = (expression)


#keywords - (news)



#abstract = (expression)


Searching the annotation of a Web page

#abstract = (Bush | Gore)


#image = "filename"


Search for illustration files by their name


#image - "Bush. *"

If you do not know in advance what extension the file name can have, use the wildcard “*”, which replaces any number of arbitrary characters


#hint - (expression)

Search for words in the alt text of illustrations


(Bush | Gore)



#url = " Url-the address"

Find a site or web page


# uri. = "www.anysite.ru"


Typically used to localize searches. For example, to limit the search range to one site, or, conversely, to exclude it from the search scope


#link - the address"

Commonly used " Url- to identify web pages that have hyperlinks to their own page


Conclusion

I was able to fully disclose the questions posed and figured out this topic (how to carry out an effective search on the Internet?). I was convinced from my own experience that in our age of high technologies, effective information retrieval is not solvable and remains one of the main problems. I can explain this as follows.

Firstly, it is not the perfection of the search engines themselves, which casts doubt on any search at all.

Search engines lack order, structure, structure, structuring, as well as a system, systematization, systematization, robots of most search engines bring a huge number of useless hyperlinks,

Secondly, the inexperience of users. to search and find what is needed in a heap of texts on the Internet is the skill not only of the search engine, but also of the user asking the question.

Thirdly, the greed of the programmers, and the advertising agencies that hire them, wanting their sites to be requested as often as possible. These "greedy" programmers cheat robots and give out a site on the Internet that supposedly contains information that the user needs, and there are advertising brochures or an automatic hyperlink requesting an advertising site or, even worse, a paid site. Although specialists serving search engines are struggling with this phenomenon, it is still gaining large-scale dimensions every day.

Today, the Internet is used as a reference by 23% of users, a research tool by 15%, entertainment by 14%, and only as a news source by 12%.

It is not an optimistic opinion that 10% of users always, and 73% often manage to find the information they need.

To the question which search engine is the best and which one I prefer to use, I will answer in this way: you need to use the machine which is more convenient, and it is more convenient for me to use Yandex.

The Internet made it easier to search, and required specific knowledge about search, today it is not always effective, we are only at the dawn of its development. And therefore, do not forget about the old no less effective search for information - books and libraries, this source of information has justified itself since the times of the "Alexandria Library", and the Internet will only be more effective in the near future and will become almost irreplaceable.

List of used literature


1.Andrey Alikberov "A few words about how search engine robots work".

# "#"> The language of the search engine Yandex is used

Search by phrase

Prefixes

Iterative search (in results)

Once logged in, click More ...

replacement of a part of a word

* (not always correct)


Table 2

Pivot Table of Top Search Engines


I AMndex

Aport!

AltaVista

Search area, database size

Russian part of the Internet. Search through the pages of sites from the catalog section, by region. Special search for news, goods, pictures.

Russian part of the Internet.

Russian part of the Internet. Specialized search for news, products, pictures, MP3

Dedicated search for news, products, entertainment, audio (MP3) and video.

Specialized University Search USA, Apple, Linux, BSD

Base volume at the beginning of 2001

Over 31 million documents

Over 12 million documents

Over 14 million documents

Over 250 million documents

1.25 billion pages

Indexing type

full-text indexing

full-text indexing

full-text indexing

full-text indexing and indexing by links

Availability of additional services

The system integrates a search engine and a catalog, as well as a number of additional projects (Bookmarks.Ru, Narod.Ru, the system of intelligent selection of goods, CY, etc.).

The system integrates a search engine, a catalog and additional services (online purchases, etc.)

The system integrates a search engine, a catalog and a number of additional services (hosting, domain name registration, translation, etc.)

The system combines a search engine and a directory containing 15 sections and 1.5 million Web pages.

Search language syntax

logical AND

space or & (within a sentence) && (within a document)

AND, &, default space between words

And, AND, &, +, default space between words

AND, & (complex search only)

default for all search words

logical OR

OR (default for simple search), | (only for complex search)

binary operator AND NOT

~ (within sentence)

~ ~ (within document)

not used

replaced by the prefix operator "-" (AND is the default space)

AND NOT,! (only for complex search)

replaced by the prefix operator "-"

prefixes of required (+) and forbidden (-) words

not used

+, - (simple search only)

word grouping

not used

distance between keywords in search

/ (n m) - in words, && / (n m) - in sentences (- back, + forward)

with advanced search - issuing documents only with a minimum distance between words

sl2 (...), c2 (...), w2 (...), (- back, + forward)

NEAR (within 10 words, complex search only)

not used

phrase search

part-word replacement characters

*,? (replace any character)

* (only at the end of a word)

document language restriction

choice: any, Cyrillic, Latin

choice: any, Russian, English

choice: Russian, English

choice of 25 languages

choice of 25 languages

morphology

all declensions and conjugations by default,! (search for the exact word form)

# (all forms of words), @ (words of the same root)

! (indicating normal form)

date search

limit search by fields

Search in titles, addresses, names of documents (only with advanced search). Search for similar documents.

Extended form capabilities, quality of care

setting up an extended form

setting a dictionary filter, setting by date, by site, link, image, special object

by document, date, AND, OR modes, spacing between words, truncation of a word

by document, title, image date, 5 sections (sites, MP3, pictures, products, news)

by boolean questionnaire, date, website, link, image, text, etc.

customizing the output of results

setting the number of results per page, output form

setting the form of issue

setting the number of results on the page, all elements of the output form

setting the number of results on the page, all elements of the output form

ranking search results

sort by relevance or date

by site popularity

by terms specified in SORT

by citation (links to the page from other pages)

iterative search (in search results)

Yes. Done by checking the box

Yes. Done with the search scope switcher

Done by checking the box

Done with

quality of help section

there is a detailed description of the query language, a syntax table and a section on searching in categories

short HELP section

a detailed reference on the query language, there are many Russian synonyms for the main operators

the largest on-line tutorial on the query language discussed in this table

very limited HELP section

family filter