In this course, we apply Natural Language Processing to cyber threat analysis and OSINT, to assess and analyze data gained from open sources and social networks. The course is a project-based course so that after learning a concept we immediately bring it into action in order to analyze a dataset. We will work on election subjects and use gathered datasets to evaluate hate speech tweets, popularity, using plots to show most common words, and monthly popularity. We also learn how to find the best documents for a query and document ranking and how to cluster documents based on their similarity.
This course is for everyone who wants to become familiar with Natural Language Processing and use it for OSINT and cyber threat analysis.
- Forensics professionals
- OSINT enthusiasts
- Security analysts
- Penetration testers
Machine learning is no longer an alien field. It has penetrated through all aspects of technology. NLP is one of the greatest machine learning fields, which aims to process and extract information from text data. Cyber threat analysis using these new tools is more critical than ever before, due to breaking out of social networks.
This is a short but valuable course that aims to help students empower their abilities in OSINT and threat hunting and to take the first step toward machine learning - especially NLP. This is a new and unique approach to OSINT and web crawling.
What tools will you use?
What skills will you gain?
- Text analysis and training models for hate speech and sentiment analysis.
- Parsing raw data and using NLP pipelines.
- Crawling information from open sources.
Course general information:
DURATION: 7 hours
CPE POINTS: On completion you get a certificate granting you 7 CPE points.
COMPLETE, SELF-PACED, PRERECORDED
- Accessible even after you finish the course
- No preset deadlines
- Materials are video, labs, and text
In this course, we work on Kali Linux distribution. It could be installed on a virtual machine or could be live. Though it is not necessary to use Kali Linux. The concepts can be implemented on other systems.
- Before beginning this course, make sure you have a good knowledge of Python and requests.
- No prior experience with NLP is needed!
YOUR INSTRUCTOR: Saeed Dehqan
Saeed is currently a project leader working with OWASP and an instructor in Hakin9.org e-learning. At OWASP, he is a security researcher and project leader.
He has extensive experience in security areas such as network security, secure coding, threat hunting, applied deep learning for threat analysis, DevOps, and more. He has 5 years of experience in research and works in the software engineering and cyber-security fields with some companies. He is also a mentor in Google Summer of Code 2021. He is passionate about Natural Language Processing and uses it for Cybersecurity purposes.
Before the course
Introduction: Wordcloud and histogram for term-frequency.
An introduction to NLP
In this section, we first learn beginner NLP concepts and then how to implement preprocessing NLP pipelines to have a clean dataset and corpora. The topics may be alien for some of the students, but they are very simple. The course is based on an assumption that participants have no prior experience with Machine Learning and NLP.
- File formats
- Regular expressions
- Histogram for term-frequency
Writing a pipeline for preprocessing.
In this module, we will learn how to scrape Twitter, Reddit, Google, and PubMed and how to use lxml and XPath.
- Extracting snippets from Google
- Extracting abstract of articles from PubMed
- Extracting tweets from Twitter without API
- Extracting trended hashtags from Twitter
- Extracting posts from Reddit based on topicality
Extract hashtags, phone numbers, emails from content with regular expressions.
Gaining knowledge from data
In this module, we will try to process data in order to gain information from the data. We will classify, label, merge, and make semantic clusters from the gathered tweets. Then, we use a sentiment analysis algorithm and hate speech to know the popularity of subjects for tweets.
- Classification and clustering and use them in action
- Similarity formulas: cosine, euclidean
- Document clustering
- Using GloVe for clustering
- Merge algorithms to sort results accordingly
- Sentiment analysis algorithms and use them in action
- Train a hate speech model
- Evaluating tweets with hate speech model for election
- Save the trained model
Implement a simple document retriever.
If you have any questions, please contact us at [email protected].