Data Science Foundations

A beginner’s guide to the fundamental principles of data science.

Artwork by author

Data has become an integral part of our lives. It is no longer limited to just numbers and texts but also includes images and videos. Two decades ago, data was small and structured, represented in kilobytes not even megabyte. The popular storage device of that time, the floppy disk, had a maximum storage capacity of only 512 kb. Data was structured in a tabular format.

Fast forward to today, zettabytes of data are generated every single minute. This data is unstructured and not in the typical row and column structure. It includes all images and videos data, making it more challenging to analyze. However, this massive amount of data provides businesses with valuable insights that were not possible before.

This is where data science comes in. Data science is used to extract knowledge and insights from data. It helps businesses increase revenue, optimize processes, and make informed decisions. During the pandemic, data science played a crucial role in controlling the spread of the disease. With the help of data science, we were able to analyze the spread of the virus and take necessary measures to contain it.

— Detecting fraudulent transactions and filtering spam emails.

One way data science is applied is through data mining, which involves extracting useful information from large sets of data. For example, if a customer receives a call from their bank about a $15,000 transaction for a diamond necklace purchase in Australia, despite never having been to Australia and never having made a transaction over $5,000, data mining algorithms can quickly flag the transaction as fraudulent and alert the customer to verify the transaction.

Another example of data science in action is email filtering. Gmail, for instance, uses text analytics and data science algorithms, including text mining, to determine whether an incoming email is genuine, spam, or a promotional message. Using positive and negative dictionaries, the algorithm can identify specific words and phrases that are commonly used in spam emails. For example, subject lines in spam emails are often in all caps or contain phrases like “Congratulations! You’ve won a jackpot!”, “Emergency! Please donate money” or “Important update please do this”. If the number of spam words in an email exceeds a particular threshold, the email will be filtered into the spam folder.

Data Science Life Cycle: Animation by author

As previously discussed, data is often collected from a variety of sources and can be voluminous and unstructured in nature. These data are typically stored in a centralized storage repository known as a data warehouse. However, due to the different sources of the data, the challenge lies in integrating them into a unified structure. For instance, data may be collected in various formats such as mp3, Png, pdf, among others. Thus, it is essential to transform and consolidate the data into a common format for analysis.

Then identify the relevant target data for a specific business problem or analysis. Not all data points are equally important, and focusing on the essential information is key to deriving meaningful insights. Therefore, proper data integration and identification of target data are essential steps in data acquisition.

Data aquisition
Data acquisition: Animation by author

This is a crucial stage in the data analysis life cycle, occupying more than 50% of the cycle. The process comprises two main steps:

  • Data manipulation: it involves filtering data from thousands of rows using programming languages like Python, where one line of code can filter data efficiently.
  • Data visualization: it utilizes techniques such as bar charts, histograms etc. It provides insights visually, enabling easy comprehension and interpretation. Visual representation is vital in data analysis as a picture speaks a thousand times more than text.
  • Upon the successful transformation of raw data into a tidy format, it becomes feasible to proceed with the implementation of a machine learning algorithm, which serves as an intelligent tool that aids in the extraction of meaningful insights from data. The most commonly employed types of ML algorithms include:

  • Classification: the assignment of data points to specific categories
  • Regression: utilizes data patterns to make predictions
  • Clustering: grouping of similar data points into clusters
  • Machine learning algorithm
    Machine Learning Algorithm: Artwork by author

    In the context of utilizing machine learning algorithms, pattern evaluation is a critical step in determining the accuracy and usefulness of the results obtained. For instance, if the accuracy of the model is a mere 35%, it is indicative of an extremely rudimentary model that requires further refinement to provide practical results that can potentially solve the underlying problem.

    When presenting data and patterns to stakeholders or clients, it is important to use simple and aesthetic graphs to represent the information. This is because not all stakeholders or clients may be familiar with technical jargon. Using simple and aesthetically pleasing graphs can help to communicate the information more effectively and ensure that everyone can understand the insights being presented.

    The process of identifying unusual patterns or outliers in data, which can be useful in understanding the variation in the data. Anomaly detection is under data pre-processing of data science life cycle that can help to detect errors, such as missing data or incorrect data.

    For instance, when you listed data in a tabular format, if the column names are wrong or different data is listed under different column like age column data is listed under name column. In this situation you can take the help of anomaly detection.

    if a data set has 10 data points and 8 data points have values that lie between 4 and 6, but the values of 2 extreme data points are equal to 20, the presence of these 2 extreme values can significantly impact the overall average of the data set, causing it to be skewed towards a higher value. In this situation, anomaly detection can be used to identify these extreme values as outliers and either correct them or remove them from the data set, so that they do not skew the results.

    The technique that involves discovering interesting relationships or associations among items in large data sets. The goal is to identify frequent patterns, correlations, or co-occurrences which can provide insights into customer behavior, market trends, and other important factors.

    In the 1990s by a retail data analysis company called Catalina Marketing, which analyzed purchasing patterns at grocery stores. They conducted a case study beer diaper syndrome where they wanted to find out the correlation between the sales of beer and the sales of diapers. When a single dad comes into the store to buy diapers there is a very good likelihood that he will also buy a cane of beer along with diaper as the correlation was simply due to the fact that both items were frequently purchased on the weekends. The theory was that fathers were buying beer while picking up diapers for their children. However, later studies have shown that the correlation was likely due to the fact that both beer and diapers were frequently purchased on weekends.

    super mart
    Image from bayut

    Association rule mining can be used to identify relationships between items in a store and to up-sell or cross-sell items. For example, If a customer buys a notebook, a store might recommend pens, highlighters, or other related items to increase the overall sale. This technique is commonly used in retail and e-commerce and is often based on analysis of customer purchasing patterns and behavior.

    The goal of machine learning is to enable computers to learn and make predictions or decisions based on data, similar to how humans learn from experience. To achieve this goal, a large dataset is typically required to train the machine learning model. The dataset is split into a training set and a testing set.

    The training set is used to train the model by feeding it input data and the corresponding correct outputs. The model then learns to recognize patterns in the data and adjust its parameters to minimize errors, which is called model fitting.

    Once the model is trained, it is evaluated on the testing set to measure its performance and generalization ability. The testing set is a separate dataset that the model has never seen before, and it is used to simulate how the model would perform on new, unseen data.

    Machine Learning: Animation by author
  • Supervised learning: there are indeed two variables. The input variable (also known as the predictor or independent variable) and the output variable (also known as the response or dependent variable). The input variable is used to make predictions or decisions about the output variable.
  • Supervised learning: Animation by author

    Types of supervised learning:

  • Regression: aims to estimate the relationship between one or more independent variables and a continuous dependent variable. Linear regression is a specific type of regression analysis in which the dependent variable is continuous numerical, such as income or temperature. The primary goal of linear regression is to fit a straight line to the data points.
  • Classification: The process of predicting a class of new variable. For example, in medical diagnosis, we can classify a patient as having cancer or not based on whether they smoke or not. In classification, the dependent variable is categorical in nature, and can be binary or multi-layered. In the case of binary classification, the dependent variable has only two possible outcomes, such as “yes” or “no”, or “0” or “1”, as in the cancer diagnosis example. Multi-layered classification involves predicting between more than two classes (such as “low”, “medium”, and “high”).
  • Classification
    Classification: Artwork by author

    2. Unsupervised learning (Clustering): The goal is to identify patterns or relationships in the data without any prior knowledge of the outcome variable. The input data does not have a specific label and the algorithm is left to find structure or hidden patterns on its own. One way to evaluate the quality of clustering is to measure the intra-cluster similarities, which refers to the degree of similarity between the data points within the same cluster based on common features. On the other hand, inter-cluster dissimilarities refer to the degree of difference between data points in different clusters based on different features.

    Clustering: Artwork by author
    R Studio and Jupyter notebook
    Python and R: Image by Pablo Casas in R bloggers

    Python and R are both popular programming languages used in data science. Python is known for its simplicity, versatility, and scalability. It has a large and active community, which makes it easy to find support and resources.

    R, on the other hand, was specifically designed for statistical computing making it a popular choice for data analysis and visualization. It also has a large and active community and a vast collection of libraries and tools for data manipulation, modeling, and visualization.

    Jupyter Notebook is an interactive environment for writing and running code in Python and other languages, while Anaconda is a distribution of Python and R that includes many of the most popular data science libraries and tools.

    Both Python and R have their strengths and weaknesses, so it ultimately depends on your specific needs and preferences. Some people prefer Python for its general-purpose nature and ease of use, while others prefer R for its statistical capabilities and visualization tools.

    Data has evolved over time from small and structured data to massive amounts of unstructured data. Data science plays a crucial role in extracting valuable insights and knowledge from this vast amount of data. Real-life applications of data science include fraud detection and email filtering.

    The data science life cycle involves data acquisition, data pre-processing, machine learning, pattern evaluation, and knowledge representation. Anomaly detection is an important task in the data pre-processing stage, which helps to identify unusual patterns or outliers in data. Simple and aesthetic graphs are useful in presenting data and patterns to stakeholders or clients, as not everyone may be familiar with technical jargon.

    Additionally, machine learning algorithms play a significant role in data science, enabling the creation of predictive models and automation of certain tasks. There are various tools and programming languages that data scientists use, including Python, R, SQL, and various data science libraries and frameworks.

    [post_relacionado id=»1319″]

    Deja un comentario

    Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *

    Scroll al inicio