What is Data Science and why use Python

What is Data Science?

Data science is the art and science of acquiring knowledge through data. Data science is all about how we take data, use it to acquire knowledge, and then use that knowledge to do the following:

  • Make decisions
  • Predict the future
  • Understand the past/present
  • Create new industries/products

The data science Venn diagram

The basic areas of Data Science are:

  • Math/statistics: This is the use of equations and formulas to perform analysis
  • Computer programming: This is the ability to use code to create outcomes on the computer
  • Domain knowledge: This refers to understanding the problem domain (medicine, finance, social science, and so on)

The following Venn diagram provides a visual representation of how the three areas of data science intersect:

Venn Diagram

Those with hacking skills can conceptualize and program complicated algorithms using computer languages. Having a Math & Statistics Knowledge base allows you to theorize and evaluate algorithms and tweak the existing procedures to fit specific situations. Having Substantive Expertise (domain expertise) allows you to apply concepts and results in a meaningful and effective way.

Data Science is the intersection of the three key areas mentioned earlier. In order to gain knowledge from data, we must be able to utilize computer programming to access the data, understand the mathematics behind the models we derive, and above all, understand our analyses’ place in the domain we are in. This includes the presentation of data.

Why Python in Data science?

  • Python is an extremely simple language to read and write, even if you’ve never coded before
  • It is one of the most common languages
  • The language’s online community is vast and friendly.
  • Python has prebuilt data science modules that both the novice and the veteran data scientist can utilize

The last is probably the biggest reason because we should focus on Python. Some of these modules are as follows:

  • pandas
  • sci-kit learn
  • seaborn
  • numpy/scipy
  • requests (to mine data from the Web)
  • BeautifulSoup (for the Web-HTML parsing)

Some more terminology

  • Machine learning: This refers to giving computers the ability to learn from data without explicit “rules” being given by a programmer.
  • Probabilistic model: This refers to using probability to find a relationship between elements that includes a degree of randomness.
  • Statistical model: This refers to taking advantage of statistical theorems to formalize relationships between data elements in a (usually) simple mathematical formula.
  • Exploratory data analysis (EDA) refers to preparing data in order to standardize results and gain quick insights.
  • Data mining is the process of finding relationships between elements of data. Data mining is the part of data science where we try to find relationships between variables.

 

Lascia un commento

Il tuo indirizzo email non sarĂ  pubblicato. I campi obbligatori sono contrassegnati *