In order to build a good predictive model, a data scientist has a lot of things to consider. But something that often goes overlooked or undermentioned—likely because it’s not that glamorous—is cleaning the data. You want to build the best mathematical model that works for your data, but a step of tantamount importance is getting the input data as clean as possible to reduce the noise that can steer you down a path of lower accuracy. A significant percentage of my time is spent cleaning data, especially if that data contains text.
As a data scientist at Bright, I spend my days working on the Bright Score. Many factors go into creating a Bright Score, a lot of which involve text data. One pesky thing I noticed happening—and that could falsely influence Bright Scores—was an attempt to match job seekers’ skills to a keyword or skill found in a job description, even if that keyword had nothing to do with the job’s requirements. How was this happening? Recruiters want to write robust, appealing job descriptions, so they include a lot of details about the responsibilities of the role, but also mention things like company vision and history, benefits and perks of the job, etc. But when our scoring engine tried to quantify the match between a job seeker and a job post, all words were created equal. For example, a job post for a forklift driver may have a “Benefits” section that includes 401k and dental. Bright’s scoring engine would determine that the “forklift driver” job post needed candidates who are certified forklift drivers with experience in retirement planning and dentistry.
The problem was that the word-correlating features of the Bright Score were reading the entire job description as flat text. If we could find a way to split the description into sections, then we could tell the Bright Score to ignore sections like “Benefits” in text-matching and just focus on sections like “Responsibilities” or “Qualifications.” To plan this improvement, I skimmed through a ton of job descriptions on Bright to understand their general appearance. I noticed there was already a lot of structure to job descriptions: common header words were bold or ended in colons to denote the content of the following section; lists of requirements were bulleted; and company or job descriptions were in long paragraphs. The second thing I noticed was that a lot of our job descriptions had less-than-perfect formatting.
Bright is one of the biggest online repositories of job posts on the internet. We get jobs from partner sites that may not require a specific structure for a job description, which can result in a sacrifice of quality for quantity. While these jobs are all valid and open, bad formatting can be a detractor. Copying and pasting job descriptions can result in inconsistent spacing and broken html. I saw entire paragraphs in bold or italics, and lists of bullets changing from dots to dashes or from single indentation to double indentations. So I thought that while trying to develop a program that parses the job description into separate labeled sections, I would put in some methods to clean up invalid html, bad encoding characters, and inconsistent bullets and spacing. This would give job description a cleaner look, as well as improve Bright Score calculations.
My first step was cleaning up all the bad encoding characters—actually, my first step was learning PHP, which can be painful for a data scientist, but I wanted my improvements to fit nicely into our code stack. Next, I replaced all of the various incarnations of bullets, bold text, italics, and spacing with my own consistent set. These steps involved a lot of brute force and little finesse with regular expressions. (Do not post a question to stackoverflow asking if anyone has a fancy regex for replacing weird combinations of html tags with sensical ones… just start designing your regex straight away.) Then, I pooled a small batch of cleaned job descriptions where I had hand-tagged lines that should be section headers and ran some statistics to determine the features and feature combinations that can be used to classify a line as a section header.
Now, with a nicely parsed job description, I used different sections for different parts of the Bright Score to make it more accurate and effective. The tagging of headers and the cleanup of job descriptions had two awesome added bonuses for Bright:
- User experience: A poorly formatted “dirty” job description may provide job seekers all the necessary information, but its appearance can make it hard to read or even rouse distrust for some users, which can cost an employer a good applicant and cost Bright sign-ups and engagement.
- SEO: Replacing invalid html and removing bad encoding characters in job descriptions helps the page rank of jobs on Bright that show up in search results, which increases our traffic.
The end result was more effective than I had anticipated, and was lightweight enough to be done on page view, which is great for testing and tweaking. We saw an immediate spike in new users, applicants, and—more importantly—regard for the humble science team here at Bright. With the new job parsing, our team can continue to improve the quality of Bright Scores and the quality of the experience for users.