{"id":23627407,"url":"https://github.com/sameermujahid/job-analysis","last_synced_at":"2025-06-14T22:09:22.895Z","repository":{"id":258354938,"uuid":"873777818","full_name":"sameermujahid/job-analysis","owner":"sameermujahid","description":null,"archived":false,"fork":false,"pushed_at":"2024-10-17T04:09:17.000Z","size":7453,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-14T22:09:21.642Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sameermujahid.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-16T17:45:46.000Z","updated_at":"2024-10-17T04:09:20.000Z","dependencies_parsed_at":"2024-10-18T16:01:30.773Z","dependency_job_id":"c65e8690-3646-4f9f-aba4-89f5dfcd1a8b","html_url":"https://github.com/sameermujahid/job-analysis","commit_stats":null,"previous_names":["sameermujahid/job-analysis"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/sameermujahid/job-analysis","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sameermujahid%2Fjob-analysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sameermujahid%2Fjob-analysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sameermujahid%2Fjob-analysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sameermujahid%2Fjob-analysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sameermujahid","download_url":"https://codeload.github.com/sameermujahid/job-analysis/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sameermujahid%2Fjob-analysis/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259890446,"owners_count":22927373,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-27T23:59:11.954Z","updated_at":"2025-06-14T22:09:22.870Z","avatar_url":"https://github.com/sameermujahid.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# JOBS ANALYSIS\n\n![Jobs Analysis](https://github.com/user-attachments/assets/ea96b3d3-ca29-4d51-92ff-85a35e922f35)\n\nThis repository presents a comprehensive analysis of job data extracted from the Naukari website. The project aims to gather over 700 job details using the Python library Selenium, systematically organizing the information for further analysis. It includes data cleaning and exploratory data analysis (EDA) using Pandas and NumPy, while MySQL and MS Excel were utilized for insights, and Power BI was employed for visualization.\n\n## Workflow\n\n1. **Data Scraping**: Use the [Naukari Web Scraper](https://github.com/sameermujahid/Naukari-web-scraper) to extract job data from the Naukari website.\n2. **Data Validation**: Cleaned and validated the scraped data to ensure accuracy and consistency.\n3. **Exploratory Data Analysis (EDA)**: Conducted univariate, bivariate, and multivariate analyses to uncover patterns in the dataset.\n4. **Data Visualization**: Developed visual representations of the data using Power BI.\n\n## Tools \u0026 Technology Used\n![Tools and Technologies](https://user-images.githubusercontent.com/55955478/235897588-756e9ed4-33b3-45f6-83c8-4f68988fe8ba.png)\n\n### Data Collection\nUtilized web scraping with the Python library Selenium to extract and retrieve the following attributes:\n\n| Column             | Meaning                                                                                          |\n|--------------------|--------------------------------------------------------------------------------------------------|\n| Job ID             | Unique identifier for each job listing.                                                         |\n| Job Title          | Title of the job position.                                                                       |\n| Company            | Name of the company offering the job.                                                            |\n| Reviews            | Average reviews or ratings for the company (if available).                                      |\n| Location           | Geographical location of the job.                                                               |\n| Experience         | Required experience level for the job (e.g., years of experience).                             |\n| Salary             | Salary range or figure associated with the job position.                                        |\n| Posted On          | Date when the job was posted.                                                                    |\n| Openings           | Number of job openings available for the position (if available).                               |\n| Applications       | Number of applications received for the job listing.                                            |\n| Job Description    | Detailed description of job responsibilities and requirements.                                   |\n| Role               | Specific role associated with the job (if specified).                                           |\n| Industry Type      | Type of industry the company operates in (e.g., IT, Healthcare).                               |\n| Department         | Department within the company relevant to the job position (e.g., Marketing, Engineering).     |\n| Employment Type    | Type of employment (e.g., Full-time, Part-time, Contract).                                     |\n| Role Category      | Classification of the job role (e.g., Manager, Intern).                                        |\n| Education          | Educational qualifications required for the job.                                               |\n| Key Skills         | Essential skills required to perform the job effectively.                                       |\n\n## Data Preprocessing\n\nThe script processes job listing data to clean and structure it for analysis. The main tasks include cleaning company text, normalizing reviews, extracting experience and salary information, and standardizing educational qualifications. Below is a detailed explanation of each preprocessing step.\n\n### Import Required Libraries\n```python\nimport pandas as pd\nimport numpy as np\nimport re\nfrom datetime import datetime, timedelta\n```\n\n### Clean Company Text\n```python\ndef clean_company_text(company_text):\n    cleaned_text = re.sub(r\"\\n\\nemployees' choice\", '', company_text)\n    return cleaned_text.strip()\n\ndf['Company'] = df['Company'].apply(clean_company_text)\n```\n\n### Round Reviews and Handle Missing Values\n```python\ndf['Reviews'] = df['Reviews'].round(1)\nmode_value = df['Reviews'].mode()[0]  # Uncomment to fill missing reviews\n# df['Reviews'].fillna(mode_value, inplace=True)\n```\n\n### Extract Experience Data\n```python\ndf['Min_Experience'] = df['Experience'].str.split(' - ').str[0].str.replace(' years', '', regex=False).astype(float).fillna(0).astype(int)\ndf = df[df['Experience'].str.contains(r'\\d{1,2} - \\d{1,2} years| \\d{1,2} years', na=False)]\ndf['Max_Experience'] = df['Experience'].str.split(' - ').str[1].str.replace(' years', '', regex=False).fillna('0').astype(int)\n```\n\n### Convert and Clean Salary Data\n```python\ndef convert_salary(salary):\n    salary_replacements = {\n        '50,000': '0.5',\n        '60,000': '0.6',\n        '70,000': '0.7'\n    }\n    for old_salary, new_salary in salary_replacements.items():\n        if re.search(rf'\\b{old_salary}\\b', salary):\n            return salary.replace(old_salary, new_salary)\n    return salary\n\ndf['Salary'] = df['Salary'].apply(convert_salary)\n\ndef clean_salary(salary):\n    if isinstance(salary, str):\n        if 'not disclosed' in salary.lower():\n            return np.nan, np.nan, np.nan\n        if 'p.a.' in salary:\n            salary = salary.replace('p.a.', '').strip().replace(',', '')\n        if 'lacs' in salary:\n            salary = salary.replace('lacs', '').strip()\n        if '-' in salary:\n            min_salary, max_salary = salary.split('-')\n            min_salary = float(min_salary) * 100000\n            max_salary = float(max_salary) * 100000\n            average_salary = (min_salary + max_salary) / 2\n            return min_salary, max_salary, average_salary\n        try:\n            salary_value = float(salary) * 100000\n            return salary_value, salary_value, salary_value\n        except ValueError:\n            return np.nan, np.nan, np.nan\n    return np.nan, np.nan, np.nan\n\ndf[['Min_Salary', 'Max_Salary', 'Average_Salary']] = df['Salary'].apply(clean_salary).apply(pd.Series)\n```\n\n### Extract Posted Date Information\n```python\ndef extract_days(posted_on):\n    posted_on = posted_on.replace(' days ago', '').replace(' day ago', '').strip()\n    if '30+' in posted_on:\n        return 31\n    if posted_on.isdigit():\n        return int(posted_on)\n    return np.nan\n\ndef calculate_date_from_days(days):\n    if pd.isna(days):\n        return np.nan\n    reference_date = datetime.strptime('10-09-2024', '%d-%m-%Y')\n    return (reference_date - timedelta(days=days)).strftime('%d-%m-%Y')\n\ndf['Days Posted On'] = df['Posted On'].apply(extract_days).fillna(0).astype(int)\ndf['Date Posted'] = df['Days Posted On'].apply(calculate_date_from_days)\n```\n\n### Clean Applications Data\n```python\ndef clean_applications(value):\n    if isinstance(value, str) and 'less than' in value:\n        return int(re.findall(r'\\d+', value)[0]) - 1\n    try:\n        return int(value)\n    except ValueError:\n        return np.nan\n\ndf['Applications'] = df['Applications'].apply(clean_applications)\ndf['Applications'] = df['Applications'].astype('Int64')\n```\n\n### Education Data\n```python\ndef parse_education(row):\n    ug, pg, doctorate = None, None, None\n    if isinstance(row, str):\n        for entry in row.split('\\n'):\n            entry = entry.lower().strip()\n            if entry.startswith('ug:'):\n                ug = entry.replace('ug:', '').strip()\n            elif entry.startswith('pg:'):\n                pg = entry.replace('pg:', '').strip()\n            elif entry.startswith('doctorate:'):\n                doctorate = entry.replace('doctorate:', '').strip()\n    return pd.Series([ug, pg, doctorate])\n\ndf[['UG', 'PG', 'Doctorate']] = df['Education'].apply(parse_education)\n\ndef standardize_ug(qualification):\n    if pd.isna(qualification):\n        return 'Not Specified'\n    qualification = qualification.lower()\n    if 'b.tech' in qualification or 'b.e.' in qualification:\n        return 'B.Tech/B.E.'\n    # Add other qualifications as needed\n    return 'Not Specified'\n\ndef standardize_pg(qualification):\n    if pd.isna(qualification):\n        return 'Not Specified'\n    qualification = qualification.lower()\n    if 'm.tech' in qualification:\n        return 'M.Tech'\n    # Add other qualifications as needed\n    return 'Not Specified'\n\ndef standardize_doctorate(qualification):\n    if pd.isna(qualification):\n        return 'Not Specified'\n    qualification = qualification.lower()\n    if 'ph.d' in qualification or 'doctorate' in qualification:\n        return 'Ph.D/Doctorate'\n    return 'Not Specified'\n\ndf['UG'] = df['UG'].apply(standardize_ug)\ndf['PG'] = df['PG'].apply(standardize_pg)\ndf['Doctorate'] = df['Doctorate'].apply(standardize_doctorate)\n```\n\n### Clean Location Data\n```python\ndf['Location'] = df['Location'].str.split(', ')\ndf = df.explode('Location')\ndf['Location'] = df['Location'].str.strip().str.lower()\n\ndef clean_location(location):\n    location =\n\n re.sub(r'\\(.*?\\)', '', location).strip()\n    return location\n\ndf['Location'] = df['Location'].apply(clean_location)\n\n# Aggregate unique locations into a list per Job ID\naggregated_locations = df.groupby('Job ID')['Location'].apply(lambda x: list(sorted(set(x)))).reset_index()\naggregated_locations['Location'] = aggregated_locations['Location'].apply(tuple)\ndf = df.drop(columns=['Location']).merge(aggregated_locations, on='Job ID', how='left')\n```\n\n### Clean Key Skills Data\n```python\ndf['Key Skills'] = df['Key Skills'].str.split(', ')\ndf = df.explode('Key Skills')\n\ndef clean_skill(skill):\n    return skill.strip().lower()\n\ndf['Key Skills'] = df['Key Skills'].apply(clean_skill)\n\naggregated_skills = df.groupby('Job ID')['Key Skills'].apply(lambda x: list(sorted(set(x)))).reset_index()\ndf = df.drop(columns=['Key Skills']).merge(aggregated_skills, on='Job ID', how='left')\n```\n\n# Analysis\n\n## Top Categories Analysis\n\n### Top Categories in Roles\n![Top Categories in Roles](https://github.com/user-attachments/assets/6d1c7e9b-5754-479e-aea0-497fcd01f3c6)\n\n### Top Categories in Companies\n![Top Categories in Companies](https://github.com/user-attachments/assets/33fa5533-9b80-4c6d-91da-6fbb81cd9ca8)\n\n### Top Categories in Locations\n![Top Categories in Locations](https://github.com/user-attachments/assets/6c2921b9-cca6-46e0-b788-01a06643b3af)\n\n### Top Categories in Skills\n![Top Categories in Skills](https://github.com/user-attachments/assets/277ae6c6-61dc-40c1-9542-2eb4b2814539)\n\n### Top Categories in Role Categories\n![Top Categories in Role Categories](https://github.com/user-attachments/assets/587564d3-62f1-4652-9eb4-136c8715c853)\n\n\n## Salary and Employment Analysis\n![Max Salary by Industry Type](https://github.com/user-attachments/assets/65c02754-5fff-4533-b4f4-aa8c5d37bd03)\n\n![Average Openings for Employment Type](https://github.com/user-attachments/assets/47a516fe-6182-41f5-9d15-a7d003890111)\n\n## Skills Requirement Analysis\n![Skills Needed for Top 10 Roles](https://github.com/user-attachments/assets/463bbc3e-2af5-460a-8d42-6636f36b329e)\n\n## Power BI\n\nAn interactive Power BI dashboard has been developed to consolidate data from multiple sources. It showcases visually engaging charts, graphs, and tables, enabling users to explore key metrics and extract valuable insights. The dashboard enhances data-driven decision-making and facilitates effective communication with stakeholders.\n\n![image](https://github.com/user-attachments/assets/df6d0e2b-b4e6-4d33-8cc7-9432c53cce8b)\n\n\n## Job Market Analysis Conclusion\n\n### Key Insights\n#### Uni-Variate Analysis:\n- **Job Roles**: Full Stack Developer and DevOps Engineer (35.51%) are in high demand, indicating job diversity with 50 unique roles.\n- **Companies**: Accenture leads in postings, highlighting major tech companies as top employers among 461 total companies.\n- **Locations**: Bengaluru and Hyderabad dominate (40.03%), reflecting the concentration of IT industries.\n- **Salaries**: A significant portion (79.53%) of salaries are \"Not Disclosed,\" complicating analysis, though high-paying outliers exist.\n- **Employment Type**: Full-time permanent roles (92.90%) are predominant, with minimal flexible or freelance roles.\n- **Skills \u0026 Education**: Python is the top skill, indicating high technical demand; \"UG: Any Graduate\" is common, but specific degrees are favored for technical roles.\n\n#### Bi-Variate Analysis:\n- **Salary and Industry Type**: IT Services and Consulting offer the highest salaries; part-time roles have competitive pay, but permanent roles are generally higher.\n- **Experience and Salary**: There is a weak correlation between experience and salary, suggesting skills or industry type may be more influential.\n- **Role and Key Skills**: Specialized roles require specific skills; for instance, Python and Machine Learning for data roles, and cloud technologies for DevOps.\n- **Location and Role**: Larger cities offer more tech roles, while smaller cities present limited opportunities, indicating regional job disparities.\n- **Educational Background**: Engineering degrees (B.Tech/B.E.) are preferred for technical roles, while generalist roles accept \"Any Graduate.\"\n- **Experience and Role**: DevOps Engineer roles often require higher experience (up to 14 years), while Full Stack Developer roles frequently hire less experienced candidates.\n\n---\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsameermujahid%2Fjob-analysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsameermujahid%2Fjob-analysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsameermujahid%2Fjob-analysis/lists"}