20 facts about parsing that will bring collecting data to the next level

Fincase – rapidly growing startup using Artificial Intelligence and Machine Learning for transforming real estate appraisal market. Fincase innovative service Scoring Analyses of Value helps to analyse big data and make profit of investing in property.

By Dmitry Tsyplakov, CEO/Product manager of Fincase

What is parsing? The verb “to parse” means to do syntactic analysis. In a broader sense, the word «parse» means collect and organize any information on specific parameters. The program, according to your task, collects data from open sources on the Internet, groups it and gives a ready-made report. 

 What is parsing for?
A person physically cannot process all the information that the Internet is filled with today. At this point, parsing comes to the rescue.
It can:
– Conduct market price analysis. Parsing collects data on competitors and shows the average cost of a product. Agree that it’s extremely difficult to manually process even one online store with several thousand positions. And if you need to look at 2-3 competitors?
– Keep track of the latest in your area of ​​activity, as well as identify changes in the market. The program can be configured, say, for weekly monitoring and receive a report with price dynamics.
– Clean up the website. The effectiveness of this process is especially noticeable in online stores with a large assortment. The program is able to detect duplicates, incorrect links, lack of images and even check the compliance of the status of the goods on the site and the real status in the warehouse.
– Fill the website of the online store with a description of the goods. If the company does not manufacture exclusive invisible hats with examples of 2-3 works, then parsing is a salvation, filling the catalog comes down to almost one click.
Parsing is often used to obtain product information from foreign sites. A little adaptation to the Russian language and descriptions of hundreds of categories and positions are ready. But you should be careful not to get sanctions from search engines.
– Create a database of potential customers. In social networks, on thematic forums, using the analysis of hashtags and geotags, you can collect a database of potential customers in a few hours, instead of months and years. And taking into account the possibility of tuning the program to the most accurate parameters, this base will consist of people who may really be interested in the product.

The advantages of parsing
The dignity of parsing over a human being is undeniable:
– The speed of data processing day and night.
– Following the most accurate search options.
– Lack of the human factor in the form of inattention and fatigue.
– Regularity of operations and monitoring.
– Daily, weekly, annual reports with the necessary information in a format convenient for each individual person.
– Prevention of DDOS-attacks due to the uniform load distribution on the site.

Parsing Limitations
Of course, everything cannot be ideal and there are a number of limitations in parsing.
  1. Many sites do not allow parsers to collect information (user-agent restrictions), but if you use something like Googlebot and send the correct requests, this can be avoided.
  2. There are difficulties in obtaining information, say, from closed accounts on social networks. In this case, you should try to ignore robots.txt in the program settings.
  3. Have you met pictures on which you need to click on the squares with cars or write a word? Parsers also meet them and this can be a problem. To teach a program to recognize specific images and bypass captcha is possible, but very difficult and expensive.
  4. The receipt of the same type of requests to the site can lead to blocking of the IP address. Using a VPN will be the right decision.

What information can be parsed
Using parsing, you can collect any information that is publicly available. Most often, users are interested in:
– Prices for similar products
– Names and descriptions of the goods themselves
– Breakdown of goods into categories and their description
– Information about promotions and news from competitors
You can even parse pictures, but as a rule, they are protected by copyright and their use will be illegal. As well as personal data of users from their personal accounts.

The algorithm of the parsing
Depending on the task, the principles of the program vary, but in general the process looks something like this:
  1. the parser searches for data according to the specified parameters in all open sources
  2. initial systematization is carried out – excess is cut off.
  3. data is stored in appropriate databases, usually based on SQL, from where it can be extracted, both by programs that use them for work, and by a person, for manual analytics or reports.

Application methods
There are two main scenarios for using parsing:
– a detailed analysis of your own site to further make changes and improvements;
– in- depth analysis of competitors, determining for themselves the development and expansion of the assortment.
As a rule, one scenario pulls a second. For example, in order to analyze the prices of a particular product from competitors, you are based on your assortment. In the course of this analysis, you discover those products that you do not have and are deciding whether your customers need them or not.
This is exactly what happened with us when we were developing our main product of SVA (Scoring Value Analysis. SVA is a system that was designed to solve the problems of analyzing the value of real estate and obtain the necessary economic indicators online.
We were approached by a customer, major Russian bank, with a complex task of assessing the value of real estate. Previously, they made many attempts to find a decision, but failed to find a suitable algorithm for solving them.

1) Identification of office classes.
The class shows the level of comfort for employees and is significant issue is price formation: the higher the class, the more value the owner can get for renting and selling real estate.
When looking for an office to create comfortable conditions for their employees, many tenants are guided by the class. The task is compound and includes many subtasks.
2) Determining the level of infrastructure.
For offices above class C infrastructure is important – the availability of gyms, shops, cafes, restaurants, parking. This increases the cost of office. The task is difficult in terms of data distribution – the level of infrastructure does not concern one building, but covers a vast territorial area.
3) Geolocation tasks.
Accessibility of the metro or other transport, location relative to the city center, prestige of the area, general availability of the office. This directly affects the cost and class of the office.
4) The task of finding analogues.
To assess the cost of the office often use the method of comparison with peers. The task of finding analogues is important for determining many parameters, and also provides an opportunity for choice for both appraisers and tenants
5) The task of assessing the cost of renting or buying real estate.
It is difficult to assess the cost of rent without expert opinion, and people who are not directly related to the real estate market prefer to contact specialists for an accurate assessment.

Our comprehensive product solves these problems in stages:
  1. collects data from many sources,
  2. conducts a deep mathematical analysis,
  3. takes into account all the generating factors,
  4. aggregates data into one database.
Based on complex machine learning algorithms and financial logic, we make estimates and determine the necessary parameters for any type of real estate.
Process automation frees up time resources, speeds up processes, removes the likelihood of error due to the human factor, and provides accurate data.
In the Fincase portfolio there is a victory in the Vienna Start-up Package 2018 for an innovative idea in the Property Technology sector and 25 major projects in the banking and construction sectors. Implemented projects have shown the effectiveness of parsing – it solves almost any task if it is done by specialists for you.
Made on