Useful Panda Functions for Data Discovery
When looking at a data set it can be overwhelming at first to get a grasp of a complex or messy dataset. Data discovery functions in Panda helps you gain a strong foothold when you start peering into your dataset. Let’s take a look at a ‘Hotel Booking Demand’ dataset I retrieved on Kaggle and explore what we’re working with.
DataFrames, Series, & Importing data
Pandas is a popular package for data science that allows for expressive and powerful data structures that make manipulating data and analysis easy. A DataFrame lets you store data in rectangular grids that you can overview in two dimensions. They typically have columns of different types.
3 main components of a DataFrame are:
1. Data
2. Index
3. Columns
A DataFrame can contain different types of data such as two-dimensional arrays and dictionaries with lists or of other dictionaries.
DataFrame’s can also include Series which is a one-dimensional labeled array capable of holding any data type with index or axis labels (ex. of Series object can be one column from a DataFrame.)(Source: Datacamp).
Let’s import our ‘Hotel Booking Demand’ dataset.
After executing the df data frame you’ll be able to see the first and last 5 rows and columns and get a gist of the dataset you’re working with.
Data types & Describe
Data Types
In order for you to perform operations correctly, understanding data types of columns is necessary to change.
df.info()
Describe
The describe() function is a way to find quick statistic summary of your data. It’s able to analyze column sets of mixed data types in addition to numeric and object series.
df.describe()
Keep in mind you’ll have columns that will not make sense to summarize like strings and date-times that you can just ignore.
Sort, Select, Conditionals
Sort Values
Sorting allows you to identify the extremes for individual columns.
df.sort_values('lead_time', ascending=False)
Selecting a single column
If you want to select a column from the data frame to future computations with, you can select the column name from our data frame.
df[‘hotel’]
Conditionals
If you’re interested in gathering data on a conditional basis, you apply conditions.
df[df.stays_in_week_nights > 20]
As you can see in the above image we applied a condition on our column stays_in_week_nights > 20 which means that it will return all the rows that are greater than 20 nights of stay during the week.