Useful Panda Functions for Data Discovery

Rashid Baset
3 min readFeb 26, 2020

--

When looking at a data set it can be overwhelming at first to get a grasp of a complex or messy dataset. Data discovery functions in Panda helps you gain a strong foothold when you start peering into your dataset. Let’s take a look at a ‘Hotel Booking Demand’ dataset I retrieved on Kaggle and explore what we’re working with.

DataFrames, Series, & Importing data

Pandas is a popular package for data science that allows for expressive and powerful data structures that make manipulating data and analysis easy. A DataFrame lets you store data in rectangular grids that you can overview in two dimensions. They typically have columns of different types.

3 main components of a DataFrame are:

1. Data

2. Index

3. Columns

A DataFrame can contain different types of data such as two-dimensional arrays and dictionaries with lists or of other dictionaries.

DataFrame’s can also include Series which is a one-dimensional labeled array capable of holding any data type with index or axis labels (ex. of Series object can be one column from a DataFrame.)(Source: Datacamp).

Let’s import our ‘Hotel Booking Demand’ dataset.

After executing the df data frame you’ll be able to see the first and last 5 rows and columns and get a gist of the dataset you’re working with.

Scrolling horizontally, you’ll find additional columns

Data types & Describe

Data Types

In order for you to perform operations correctly, understanding data types of columns is necessary to change.

df.info()

Describe

The describe() function is a way to find quick statistic summary of your data. It’s able to analyze column sets of mixed data types in addition to numeric and object series.

df.describe() 

Keep in mind you’ll have columns that will not make sense to summarize like strings and date-times that you can just ignore.

Sort, Select, Conditionals

Sort Values

Sorting allows you to identify the extremes for individual columns.

df.sort_values('lead_time', ascending=False)

Selecting a single column

If you want to select a column from the data frame to future computations with, you can select the column name from our data frame.

df[‘hotel’] 

Conditionals

If you’re interested in gathering data on a conditional basis, you apply conditions.

df[df.stays_in_week_nights > 20]

As you can see in the above image we applied a condition on our column stays_in_week_nights > 20 which means that it will return all the rows that are greater than 20 nights of stay during the week.

--

--

Rashid Baset
Rashid Baset

No responses yet