Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/najmaelboutaheri/patents_analysis

This repository contains code and resources for analyzing patents using Apache Spark, Python, and AWS services. The objective of this project is to extract insights and trends from patent data to inform business decisions and intellectual property strategies.
https://github.com/najmaelboutaheri/patents_analysis

Last synced: 27 days ago
JSON representation

This repository contains code and resources for analyzing patents using Apache Spark, Python, and AWS services. The objective of this project is to extract insights and trends from patent data to inform business decisions and intellectual property strategies.

Awesome Lists containing this project

README

        

# Patent Data Analysis and Visualization

## Overview

This repository contains code, resources, and workflows for analyzing patent data using Python, Apache Spark, AWS, and Microsoft Azure services. The objective of this project is to extract actionable insights and trends from patent datasets to aid intellectual property strategies and business decisions.

---

## Architecture

The workflow follows a **4-phase architecture**:

![image](https://github.com/user-attachments/assets/3025532d-d3aa-4c58-8892-b12441963da5)

1. **Sourcing**: Data is scraped and ingested from major patent repositories such as:
- Google Patents
- WIPO
- USPTO
- FPO
- Espacenet

2. **Storage**: Patent data is stored in cloud solutions:
- Amazon S3
- Microsoft Azure Blob Storage

3. **ETL (Extract, Transform, Load)**:
- **Tools Used**: Apache Spark (Azure Databricks) and Delta Lake
- Data pipelines are built using Azure Data Factory to clean and transform data.
- The **Medallion Architecture** ensures:
- Bronze: Raw ingestion
- Silver: Filtered and clean data
- Gold: Aggregated and analytics-ready data.

4. **Visualization**: Insights are visualized using:
- Power BI
- Matplotlib & Seaborn (Python libraries)

---

## Key Features

- **Web Scraping**: Patent data is extracted using BeautifulSoup and Python scripts.
- **Preprocessing**:
- Data cleaning
- Parsing XML, JSON, CSV, and PDF formats
- **Feature Engineering**:
- Keyword extraction
- Citation network analysis
- **ETL Pipelines**: Scalable data processing with Apache Spark.
- **Visualizations**: Interactive charts for patent trends, keyword frequency, and metrics.

Count of Patents by Year

image

Count o Power BI Desktop f inventor by country

image

Th Power BI Desktop e development of countries' interest in patenting

image

---

## Project Structure
```bash
├── Analysis of Patents on Virus Engineering.pdf # PDF report on virus engineering patents
├── ETL_PROCESS.ipynb # Notebook for the ETL process
├── Interface_DEMO.rar # Demo interface (compressed file)
├── Patents_Scraping.ipynb # Notebook for web scraping patent data
├── Project_Architecture.png # Architecture diagram for the project
├── Project_Presentation.pdf # Project presentation file
├── Projet_visualizations.pdf # Visualizations and insights in PDF
└── README.md # Project documentation
```
---

## Installation

### Prerequisites
- Python 3.x
- Apache Spark
- AWS credentials for S3
- Microsoft Azure access

### Steps:
1. **Clone the repository**:
```bash
git clone https://github.com/your_username/patent-analysis.git
cd patent-analysis
```
## Data Sources
The project leverages patent data from:

- Google Patents
- WIPO
- USPTO
- FPO
- Espacenet
## Usage
- **Data Scraping:** Use Patents_Scraping.ipynb to collect and store patent data.
- **ETL Process:** Run the ETL_PROCESS.ipynb notebook to clean, transform, and prepare the data.
- **Visualization:** Load the processed data into Power BI or Python notebooks to generate insights.

## Contributions
Contributions are welcome! Follow these steps:

1. Fork this repository.
2. Create a new branch: git checkout -b feature/new-feature.
3. Commit your changes: git commit -m "Add new feature".
4. Push to the branch: git push origin feature/new-feature.
5. Submit a Pull Request.

## Contact
For questions, feedback, or collaborations, contact:

Najma El boutaheri
Email: [email protected]

## Acknowledgments
Special thanks to all contributors and the open-source libraries used in this project.