How I built a skill-based recommendation system that tells me what things I need to learn next

How I built a skill-based recommendation system that tells me what things I need to learn next

Aka a very convoluted way of solving a problem.

·

7 min read

One of the biggest problems with tech from a developer's point of view is the sheer amount of things you need to learn to keep up with the market, especially when you look for jobs. Job descriptions list out skills like React.JS, Vue, Angular, Svelte, Javascript, HTML, CSS, SASS, Webpack, Vite... like Pokemon Badges and trying to learn something of each haphazardly can lead to a lot of shallow learning.

The question becomes, what do I choose to learn in my spare/learning time? You can find the answer on countless Reddit and Quora posts, even on the StateOfJS survey. But I wanted data, and I wanted data on it locally in Brisbane Australia (which is where I live).

So I decided to build a webapp that gives out skill recommendations based on my current set of skills.

https://stating.io

How does it work?

Basically, you input all of your skills into the searchbar and it looks through the most relevant jobs to return the most common skills that you are missing.

Screen Shot 2022-06-16 at 2.29.56 pm.png

By upskilling based on those skills, you can then increase the 'match %' on the most relevant set of jobs for you leading to being more up to date (and also ranking higher on job searches).

There is also a tab with the set of jobs that the analysis was built on. I built this out because it seemed useful for someone actually doing a job search and because the default search tool on Seek and Indeed didn't search by stack (which makes sense because most jobs are not developer jobs).

Screen Shot 2022-06-16 at 3.13.39 pm.png

How does it really work? On a Technical Level?

To get the jobs, I built a scraper using Python BeautifulSoup that parses job boards every day. Initially, I wanted to hit up the Indeed, LinkedIn and Seek APIs. But I quickly found that Indeed doesn't allow just anyone access to their job postings unless they have 10000 unique visitors per day on the site. Seemed pretty premature. Seek has an API but only returns applicant data which wasn't very helpful. It really isn't in the interests of these job sites to help the job seeker side of things. So after building a scraper for Indeed, I added a scraper for Seek and haven't yet written one for LinkedIn.

Getting this job description data was trivial but getting the skills was more difficult. Most of the time, skills were listed as bullet points li but sometimes, some job postings would use dashes to separate requirements.

When I first began this project, I would extract this data out manually by hand. So I would look at the job posting and if I see a new term "React.js", I would add it to the list of skills that I kept in the database. The scraper would look for that in jobs.

After a few weeks of doing this, I found some major problems.

The first one was synonym data. When the token "Algorithm" and "Algorithms" pops up, writing a dumb check for plurals doesn't seem too difficult. But a lot of the time, "React.js" would be referenced as "React". And then there was short-hand and acronyms as well, e.g. "K8S" and "Kubernetes", and "CI/CD" and "Continuous Integration Continuous Deployment". No matter how fuzzy the search algorithm is, there is no way that these synonym tokens would link together.

So I added a link model for this data so that one main skill name is connected to many others. I also changed the recommendation algorithm so that it would take into account the synonyms of the search terms you type in and also synonyms of the skills listed on the job ad.

Then there was the problem of adding skills into your search term. E.g. if you know "React", "Node.JS" and you click "Add Git" to your search (because everyone uses Git). The problem with this is that this introduces a much wider set of job ads which can pollute your search results. If you do mostly mobile development and you added Git, you'd probably get a lot of Front-End web developer skills because there are waaay more front-end jobs. So I also implemented a basic weighing system on the search terms. This kept the set of jobs fairly uniform across searches.

One way someone can get around this is to categorise their job ads by title. E.g. "Front-End Developer", "Back-end developer", "Full-stack developer". However, I found that there are actually heaps of job titles ranging from "React Developer", "C# developer", ".NET developer", "Blazor Developer 100k great office location", "Web Developer". So categorisation of jobs becomes difficult. Companies like Seek can probably do categorisation, but they probably get a bunch of data scientists to do it, and it's unlikely that it's based on real-time data. (Maybe it is, they have a lot of people working for them)

Improvements

I've had a lot of fun just by playing around with this problem for a few months. From setting this up as a React/Node.JS/MongoDB app on Heroku. To introducing Next.JS for Server Side Rendering, setting up a DigitalOcean Ubuntu droplet, deploying on that, and getting CircleCI to do continuous deployment because manual deployments were a pain. As a project, it is pretty much automated completely with the jobs being scraped automatically. But as a use case, there are a number of improvements I can make.

  • One issue that I've had was manually writing JOINS on NoSQL data. While I could use aggregate on MongoDB to make the calls more efficient, I found that this data should have been modelled as relational data. So in my spare time, I'm in the process of converting this data to a SQL database. This will also make the search algorithm faster as well and development less painful.
  • Another major issue was scalability. Not so much with the hosting of the app because one droplet can handle up to a few hundred requests per second, but with the way job data is extracted. The use of a scraper does not seem sustainable due to limits on the number of requests the sites can handle. Unless I can set it up so that the job board sites do not think I'm running a DDoS attack on their sites then I probably won't be able to scale it up beyond a few cities. Although travel sites also use scraping for airline and hotel data, so there probably is a way to get that data out.
  • Skill recommendations like "JavaScript", "HTML", and "CSS" would be pretty unintelligent if you already know "React.js" so a more intelligent recommendation system can be built that factors in prerequisites on those skills. Going deeper into skill subtopics by recommending things like the Nullish Coalescing Operator in JavaScript, scoping, and the this keyword would be great along with a checklist for whether you really know something. Like for example I think I know CSS because I know how to centre a div, but there are some people who can really do magic! Also, I'm not a data scientist or a machine learning expert, but I am sure a more powerful skill recommendation system can be written with NLP or neural networks because I like playing the mud with manual tools than implement something with a tool I don't have much knowledge in.

  • Equally problematic is Resume-Driven Development where people choose to use tools not because they need them but because they are buzz words. Being able to give a context and a use case on recommendations may be able to help with this, so that developers with no business using Kubernetes don't prematurely implement horizontal scaling when it's not necessary.

  • Each company implements the same stack of skills differently and if some level of standardisation could be made through sharing of resources then that would be really helpful to developers who may only have had experience with one or two companies' way of doing things especially during the time of onboarding. I am sure that there are resources out there that exist but most of the articles on Medium or video tutorials do not go very deep. So recommendation of resources across multiple skills would be very helpful.

Last Thoughts

So you are probably thinking that I'm really interested in job boards because of the emphasis in using jobs for skill recommendation. There are about 50 companies out of the 461 companies in my database who are recruitment agencies, so I think this space is extremely difficult.

I think the far more interesting and useful question is how do you upskill developers cost and effort efficiently. The real question that I'm facing now when looking at the results of my search is:

Where do you go to learn automated testing (in 20% of the set of jobs) in a way that is comprehensive and is in-line with many of the companies in Brisbane? That's the hard question.

Screen Shot 2022-06-16 at 2.28.02 pm.png

If you found this post interesting, send me an email at hello@stating.io! I'm always looking for feedback and collaborators.

Cover Image from r/programminghumor