Creating a Dataset for LinkedIn User Analysis

30 May 2024


(1) Ángel Merino, Department of Telematic Engineering Universidad Carlos III de Madrid {};

(2) José González-Cabañas, UC3M-Santander Big Data Institute {}

(3) Ángel Cuevas, Department of Telematic Engineering Universidad Carlos III de Madrid & UC3M-Santander Big Data Institute {};

(4) Rubén Cuevas, Department of Telematic Engineering Universidad Carlos III de Madrid & UC3M-Santander Big Data Institute {}.

Abstract and Introduction

LinkedIn Advertising Platform Background



User’s Uniqueness on LinkedIn

Nanotargeting proof of concept


Related work

Ethics and legal considerations

Conclusions, Acknowledgments, and References


3 Dataset

This work assesses if combining location and professional skills makes a user unique on LinkedIn. To do that, we created a dataset including thousands of audience sizes for different combinations of locations and skills from real LinkedIn profiles.

We had to implement two different pieces of software to obtain that dataset. The first one retrieved the location and professional skills from hundreds of LinkedIn users. The second used the information from those profiles to retrieve the audience size from the LinkedIn Campaign Manager for thousands of audiences combining location and professional skills.

3.1 Users’ Skills and Location

Our goal was to create a user base with random profiles. To that end, we used the Campaign Manager’s job type classification. This field is known as job functions and includes a general list of 26 professional fields.

We launched a search query on DuckDuckGo for each of the 26 job functions, using a filter to only obtain LinkedIn profiles, and retrieved the first ten results for each category. This led to 260 seed users from whom we collected the skills and locations they reported in their profiles. Starting with those users, we conducted a Breadth-First Search (BFS) [12] and gathered information from the users LinkedIn suggested when accessing the seed profiles. Overall, we collected skills and location information from 1699 LinkedIn users. We did not collect unique IDs from LinkedIn user profiles and instead used a pseudonymization process that assigned a random identifier to each user in our database. We aimed to avoid directly linking our collected data to the LinkedIn profiles from which we retrieved the information.

Our dataset included 4941 unique skills that appeared 39095 times across the 1699 user profiles. This means that each unique skill was reported by 8 users in our dataset on average. All the users in our data sample reported at least one skill, and 1690 (99.47%) also provided a location (country, region, or city). Appendix A shows the distribution of the users across countries, derived from the location reported by each user.

Figure 1 shows the CDF of the number of skills reported per user. The distribution’s median value is 20 skills. This implies that half of the individuals in our data sample have written at least 20 skills in their profile. We note the maximum number of skills users can report on LinkedIn is 50.

3.2 Audience Size

We implemented ad hoc software to systematically obtain the size of LinkedIn audiences based on the skills and location of the 1699 individuals in our dataset. This data feeds the methodology outlined in the next section, which computes how many skills are necessary to make a user unique on LinkedIn.

Figure 2 shows the CDF of the worldwide audience size associated with the 4941 professional skills in our final dataset. A LinkedIn skill is featured in 780k profiles on average.

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.