EXPLORATORY DATA ANALYSIS AND MODEL BUILDING OF CRAIGSLIST USED CAR DATA SET

Mina Omobonike
9 min readMay 20, 2021

--

In this article, I will explain how I did the analysis and model building for the craigslist used car dataset from Kaggle. The tools used in the analysis include but are not limited to Python, Numpy, Pandas, Matplotlib, and Seaborn. We would be focusing on predicting the price of a used car from certain features of the car. The dataset is between the years 1923–2020 and contains the data on used car adverts on the craigslist website. The data set is restricted to adverts from the United States of America.

The biggest perk of buying a secondhand car is that you let the first owner handle the highest deprecation of the car. A brand new car loses most of its value in the first 2 or 3 years of use. Anyone can get a lot more for their money if they have the right information at their disposal!

The dataset can be downloaded from Kaggle.

Goals:

1. Identify the trends in the sales of cars over the years and the car industry in general.

2. Correlation between car price and its specifications.

3. Prediction of a used car’s price given its specifications

Metadata of Our Data

ID — Unique ID is given to every ad and is the primary key to the data set.

Price — Price is given in the US dollar and has not been adjusted for inflation.

Year — The year in which the car was manufactured

Manufacturer — with 43 unique businesses engaged in the manufacture of automobiles.

Model — The exact model of the car. Like sierra classic 2500hd.

Condition — The condition of the car; excellent, good, fair, like new, salvage, new.

Cylinders — The number of cylinders in the car engine ranging from 3 to 12. Also has the ‘other’ category too.

Fuel — There were five types of fuel, ‘diesel’, ‘gas’, ‘electric’, ‘hybrid’, and ‘other’.

Odometer — This is the distance that the car has traveled after it being bought.

Status — The cars also had 6 types of statues; ‘clean’, ‘lien’, ‘rebuilt’, ‘salvage’, ‘parts only’, and ‘missing’.

Transmission — There are 3 types of transmission; ‘automatic’, manual and other.

Vin — Vehicle identification number

Drive — There are 3 types of drive transmissions; ‘4WD, ‘FWD’, and ‘RWD’. (Four-wheel drive, forward wheel drive, and rear-wheel drive.)

● FWD — cars are commonly equipped with this.

● RWD — a lot of sports cars, SUVs, and trucks come standard with this.

● 4WD — trucks frequently boast this.

Type — This feature identifies if a vehicle is an SUV or a mini-van. There 13 unique values in this feature.

State — The state is political territory and is represented in short form in the data set. Like “fl” is used for the state of Florida.

Latitude, Longitude — When both features are combined, they give the location of where the car is being sold at.

● FWD — cars are commonly equipped with this.

● RWD — a lot of sports cars, SUVs, and trucks come standard with this.

● 4WD — trucks frequently boast this.

DataSet:

The Data Set before cleaning has 426880 rows with 26 columns. One thing we should pay attention to is the fact that this dataset has a lot of missing rows. These missing rows are up to half of our dataset, dropping them would give us fewer data to work with and in data science, the more the data, the better the algorithm. So what can we do?

DATA CLEANING STEPS

  1. Removing columns that can’t be filled like the county column which was empty and columns like ‘id’, ’URL,’ region’, ‘region_url’, ‘image_url’, ‘lat’, ‘size’, and ’long’ that we wouldn't be needing for this analysis.
  2. Filling forward missing values in title_status, posting_date, and state columns.
  3. Filling up columns with the highest numbers of missing values:’drive’, ‘type’, and ‘paint_color’ from their values in the description column. The interesting about this data is that some of the missing values in some columns were present in the description column. A particular row might have a missing value for the type column but the type of the car can be described in the description column of that row.
  4. Converting strings within the description column to lower case and creating a new description column.
  5. Creating a new column for drive by searching for occurrences of the 3 drive types within the description column. So if the original drive has missing rows, replace it with the value in ‘new_drive’.
  6. A total of 102109 rows are still missing after filling with values from the description column. That implies there are additional 28458 valid rows added to the ‘drive’ column from the 130567 previously empty rows.
  7. Dropping duplicate values in our dataset.
  8. Creating a new column for type by searching for occurrences of the 13 car types within the description column. So if the original car type has missing rows, replace them with the value in ‘new_type’.
  9. A total of 41992 rows are still missing after filling with values from the description column. That implies there are additional 50866 valid rows added to the ‘type’ column from the 92858 previously empty rows. Although some of the rows might still be dropped or cleaned since they have multiple entries.
  10. Creating a new column for type by searching for occurrences of the 12 paint colors within the description column. So if the original paint colors have missing rows, replace them with the value in ‘new_paint_color’.
  11. A total of 29690 rows are still missing after filling with values from the description column. That implies there are additional 100513 valid rows added to the ‘type’ column from the 130203 previously empty rows. Although some of the rows might still be dropped or cleaned since they have multiple entries.
  12. Filling forward the remaining empty rows in drive, type, and paint_color and the empty condition rows.
  13. Filling empty rows with the string ‘Null’ to make it easy to locate them
  14. Getting rows that have null for most of their columns and dropping them.

15. To get the null values in the year column from the description column, split the description column then assign it to a new data frame, create a second-year column, and assign it to the first column in the new data frame. Strip the string in the second year column, where there are null values in the original year column replace with values in the second year column. Then drop the second year column. The same thing was done for the manufacturer column.

16. Remove all null values in the year column and clean the model column. Drop the posting date column.

17. The cylinder column is split and the number extracted.

18. Where fuel is electric replace null values in the cylinder column with 0. Filling forward the remaining missing values in the cylinder column.

19. Checking and dropping outliers in the price and odometer column. Filling the last of the missing rows with the values with high occurrence of each column.

EXPLORATORY DATA ANALYSIS

Some conclusion we can see are:

  • 3 of the top5 car manufacturers are from the USA, the other two are from Japan.
  • Gas dominates in fuel for cars in the US. Although gas cars are more popular than diesel in the U.S., diesel engines have nearly half the market share in Europe. So if we were to analyze cars in Europe we might be seeing the difference in visualization. Americans haven’t been clamoring for diesel because their fuel is so comparatively inexpensive — and diesel engines cost so much more to manufacture. Diesel engines cost more because they require added equipment such as a turbocharger to make power levels close to a gas engine.
  • 4wd and fwd shared almost the same number of drive types of car, with four-wheel drive dominating.
  • London-based automotive business intelligence firm JATO Dynamics recently published new statistics which reveal that as of April 2020, more than half of new cars sold in the U.S. are equipped with either AWD or 4WD. The findings were published via an infographic, which was posted to the JATO Dynamics Twitter feed. The infographic shows a steady rise in AWD and 4WD new vehicle sales, starting at 39.7 percent in 2016, and topping out at 50.8 percent in 2020. Front-wheel drive vehicles remain in second place, declining from 47.9 percent in 2016, to 40.1 percent in 2020, bottoming out at 38.7 percent in 2019. Rear-wheel drive vehicles remain a distant third, falling from 12.4 percent in 2016, to 9.1 percent in 2020. The proliferation of AWD and 4WD vehicles goes hand in hand with broader automotive industry trends. All-wheel drive and 4WD systems are becoming lighter, more efficient, and less expensive, while enthusiast interest in off-roader-style pickups and utility vehicles, as well as high-performance models requiring enhanced on-road traction, has increased, thus boosting sales of AWD and 4WD vehicles. All this is confirmed by our visualization of the drive column.
  • Almost all used cars are in clean title_status

TIME SERIES FOR FOR-SALE USED CARS EACH YEAR FROM TOP 5 MANUFACTURERS

They had a similar fluctuation of for-sale used cars with Ford was always dominating almost every year. Used cars from manufacturers like Ford, Chevrolet, and Toyota dominated for sale, especially in 2012 to 2018. Means cars from those manufacturers are loved among US people. This comes as no surprise since Ford’s F-Series line of pickups has been America’s Best Selling Truck for 43 years straight. For the past 38 years, Ford F-Series trucks have also been the Best Selling Vehicle in America. Electric vehicles are relatively new so we see that vehicles manufactured by Tesla are higher in average price compared to other manufacturers. Porsche and Aston Martin are both manufacturers that produce luxurious supercars with premium features so it makes sense for them to have high-priced vehicles.

2017, WHAT TYPE OF CARS PEOPLE IN CALIFORNIA WANTED TO SELL

If you wanted to buy SUVs or Pickup, many choices were coming from Ford. If you wanted Toyota, you would see mostly Sedan. If you want a honda, mostly sedans are available and for pickup go for Chevrolet.

MACHINE LEARNING:

Our goal with the machine learning phase was that we wanted to train a model which when given the specifications of a car gave us the estimated price. For this purpose, we first converted all our categorical columns to numerical with a label encoder. For the odometer and price, we normalized the values so that the range was not very large. Using all of these new columns we made a new data frame.

I trained 8 models, linear regression, decision trees, random forest, extra trees, cat booster, LGBM Regressor, XGB regressor, and XGBRF Regressor on this new dataset. All of them had over 95% accuracy. This is the Light Gradient Boost Machine regressor showing the graph of the actual prices and the predicted ones. It had an r2 score of 0.998.

Conclusion:

After a thorough analysis of the data set, I was successful in making reasonable conclusions regarding the sale trends in used cars and the automobile industry generally. Some of my hypotheses regarding used cars and what factors affect their prices were proved right. For the final part, I wanted to predict the price of a car given the state it is being sold in, its type, title, year, drive, odometer, and condition. I was somewhat successful in accurately predicting the price so that if someone wants to buy a used car they have an idea of what it would cost them beforehand and not end up paying more than the car’s worth. If you would like more information on the analysis and model building for this dataset, Here is the link to my notebook. Also, I can be contacted at minaomobonike@gmail.com.

Thank you for reading!

--

--