Introduction to Text Mining and Analytics

Introduction to Text Mining and Analytics

Text data is ubiquitous and growing rapidly. Some examples of sources for variety of text data are Internet, Blogs, News, Email, Literature, Twitter, etc. These text data present some challenges for people. It is impossible for anyone to read all these data and digest it. There is a need for some tool that will help people digest this data. These text data can also be used to extract knowledge which can be used for better decision making. Product managers today use many data mining techniques to extract data from the feedback of customers and sales reports to improve the market growth.

 

Main techniques for harnessing Big Text Data are Text Retrieval and Text Mining.

The two terms Text Mining and Text Analytics are roughly the same. Mining emphasizes more the process, while Analytics focuses more on the result. In both cases we turn the text data into high-quality information or actionable knowledge so as to minimize human effort and supply knowledge for optimal decision making.

Text retrieval is an essential component and the per-processor in any text mining system.

Data can be broadly classified into 2 types;

  1. Text Data
  2. Non-Text Data (Numerical, Categorical, Relational, Video)

The non-text data are at times very important to extract knowledge. In the data mining software module, we have a number of different kind of mining algorithms. This is because for different types of data, we will need different types of algorithms.

Related Topics;

  1. Pattern Discovery in Data Mining
  2. Text Retrieval and Search Engines
  3. Cluster Analysis
  4. Data Visualization
Read Me Leave comment

Introduction to Version Control System & Github

Introduction to Version Control System & Github

This documentation has been made to share the knowledge about the GitHub platform, its advantages, features, and importance in building and sharing projects or code files online. This documentation also focuses on the need to have an online repositories, branches, commits, and pull requests.

Agenda

  • Version Control
  • Tools for Version Control
  • GitHub and Git
  • Git Features
  • Git Operations & commands

Version control can be thought of a management system that manages the changes that we make to the projects till the end. The changes might be adding new files, modifying old files, change the source code, etc.

Whenever we make any changes to the project, the version control system creates a snapshot of the entire project and saves it. And these snapshots are actually known as different versions. Snapshots are the entire state of your project at a particular time, i.e., it will contain what kind of files the project is storing at that time and what are the changes we have made.

Example

We started building a web-page, and initially, we created the start page, say, “index.html”. Then we added an “about.html” page to it. Then again we made some changes to the “about.html” page by adding some texts, changing the page layout, etc.

Now, the VCS (version control system) shall detect that some modifications have been made, and something new has been created. We can consider all of these different modifications as different versions.

Version 1 – index.html webpage

Version 2 – Addition of “about.html” web page

Version 3 – Modification of “about.html” web page

Question. Can we go back to a previous version, if we make a mistake?

Yes, that is what the whole purpose of a version control system is.

Sometimes, we make changes, and then we don’t want them. VCS always keeps the older versions neatly packed inside it. If at a period of time, we want to roll-back to a previous version, we can.

Why Version Control?

Collaboration

The first thing that the version control system avails are “Collaboration”.

Let’s say, there are 3 developers working on a particular project, and everyone is working in isolation, or even if we are working in the same shared folder, there might be conflicts sometimes, when each one of us is trying to modify the same file. At the end, when we try to collaborate or merge the work together, we will end up with a lot of conflicts. We don’t know who has done what kind of changes.

But the VCS provides us with a shared workspace and continuously tells us who has made what kind of changes, and what has been changed. We get notified if someone has done some change in our project. Now, we can visualize everyone’s work properly. The project will evolve as a whole from start, and it will save a lot of time for us cutting down the time for resolving conflicts.

Storing Versions

Saving a version of the project after making any changes, is very essential. Now, we may have some questions in our mind, like, how much would we save, would we save the entire project, or would we save only the changed part.

If we only save the changes, it would be hard for us, to view the entire project at a time. And if we save the entire data, we would have a large amount of redundant data and occupy a large amount of unnecessary space.

Another problem arrives when we name these different versions. Even if we are very organized, and use a very comprehensible naming scheme, with new varying versions, there is a chance that we will actually lose track of naming them.

The third problem is, how we actually know what the difference between the versions are, and what exactly was changed.

If we have a VCS, we don’t need to worry about all these.

Backup

VCS provides us a backup. We have a central server, where all the project files are located. And apart from that, every developer has a copy of the file on their local machines, known as local copies.

What developers do is actually, every time when they start their work, they fetch all the project files from the central server, and store them on their local machine. And when they are done working, they actually transfer all files back to the central server.

At the time of crisis, when the server crashes, we don’t need to worry. The copy of the entire project is stored on the local machines. If one developer, forgot to keep a backup, there is always someone who will keep the files updated.

Helps to Analyze the Project

When we have finished the project, and we want to know how the project actually evolved. We want to know what the drawbacks were. If we need to analyze the entire span of work, the VCS provides us with the proper description, what exactly was changed, and when was it changed.

Version Control Tools

There are 4 most famous Version Control Tools;

  • Git (it is a distributed version control system)
  • SubVersion (do not provide local backup functionality)
  • CVS (do not provide local backup functionality)
  • Mercurial (it is very similar to git)

Question. Is Git a open-source?      Yes, Git is an open-source platform.

Git and GitHub

For now just consider repository as a data space, where we store all the project files or related files. In a distributed version control system, we got the central repository and local repository. Developers first do changes to their local repositories, and then push changes to the central repositories. Also periodically, the developers pull data from the central repository to their local repository for backup.

GitHub is the central repository, and Git is the tool that allows us to create a local repository.

Git is a version control management tool that will allow all these operations, i.e. to fetch data from the server and to push all local files to the central server.

GitHub is a code hosting platform for version control collaborations. It is a company that allows hosting the central repository on a central remote server. In short, it can be thought of as a social network for developers. Developers share their code.

In a distributed version control system, we do not need internet connection always. We just need it when we push or pull from the central server.

 

What is Git?

Git is a distributed Version Control Tool that supports distributed non-linear workflows by providing data assurance for developing quality software.

Different Features of Git

  • Distributed
  • Compatible
  • Non-Linear
  • Branching
  • Light Weight
  • Speed
  • Open Source
  • Reliable
  • Secure
  • Economical

Distributed

Git allows the distributed development of the code. As we already know, every developer has a local copy of the entire development history, and changes are copied from one repository to another. It now is immaterial if the developers reside at different geographical locations, they can still work together.

Compatible

It is compatible with existing systems and protocols. Migration from other Version Management system repositories to Git is possible. If we have an SVN and SVK repository, and we want to migrate to Git, it can be directly accessed using Git-SVN.

Non-Linear

Git tracks the current state of the project by creating a tree graph from the object. It also includes techniques using which we can navigate and visualize all of our work.

Branching

It allows us to do non-linear software development. Git is the only one which has a branching module. We can have multiple independent branches. A master branch which starts from the start of the project and till the end contains the entire project.

Light Weight

Git uses lossless compression techniques to compress data on the client’s side. So don’t worry about the local repository.

Speed

It provides us with a lot of speed. We do not need to have internet available always to work in the distributed environment. We can work with our local repository. Git is actually written in C. It reduces all the run-time heads, and makes it faster.

Open Source

Git was actually created by Linus Torvalds, the famous man who created the Linux Kernel. He actually used Git for the development of the Linux Kernel. The source code is available, and we can modify and use it.

Reliable

It is very reliable. We have multiple backups. We can also make a duplicate copy of the central repository.

Secure

Git used the SHA1 to name and identify the objects. Whenever we make a change, it makes a commit object. SHA1 is a type of cryptographic encryption technic. No change is hidden from the entire group of developers.

Economic

Git is released under GPL’s license and is for free. We save a lot of money, by not using costly servers.

 

What is a Repository?

The repository is a directory or a storage space where our projects can live, and it can be local to a computer, or it can be a storage space on GitHub or another online host.

Types of repository are;

  • Central Repository
  • Local Repository

These files are stored in as a “.git” folder inside the project’s root folder locally and centrally.

Git Operations & Commands

Creating Repository

For creating a central repository, we first need a GitHub account. Create one first, and then using the GUI, create a central repository.

Install Git on your local machine. Run the Git Bash. Go to a folder or directory, right click, and you will get the option “git bash”.

get init

use the command to create a local repository

git remote add origin <link>

To link the local repository with the central repository. To find the link, go to the GitHub account and then to the central repository. Click on the green “Clone/Download” icon on the screen and copy the https url.

git pull

To pull files from the central repository

git push

To push files from local repository to the central repository

git clone

Use this command to clone or download your existing repository from GitHub. Not file, clone the entire repository.

Making Changes

There is an intermediate layer called “index” which resides between the workspace (project folder) and the local repository. When we want to commit changes or make changes to the local repository, we have to add those files to the index first.

git status

Shows us the files that are added to the index and are ready to commit.

git add <file_name>

This adds our files to the index

git add –A

All the files will be added to the index

git commit

This records a snapshot of the repository at a given time. Committed snapshots will never change unless done explicitly. Commit to the local repository.

git commit –m “ ”

Commit with a commit message. This is optional. It will automatically pop-up a window to enter commit message later.

git log

To see how git performs commit.

Parallel Development – Branching

Branch is a pointer to a commit.

There are two types of branches;

  • Local Branches
  • Remote-tracking branches

git branch <branch_name>

As we were initially in the master branch. The new branch that we created will contain all files in the master branch.

git checkout <branch_name>

To change the current branch.

Parallel Development – Merging

git merge <branch_name>

Merge a branch to master branch. While performing this operation, you need to be in the master branch.

 

Parallel Development – Rebasing

Rebasing is also another kink of merging. The advantage is that we get a much cleaner project history. The new base commit is done to the tip of the master branch containing both the logs.

git rebase <branch_name>

 

To push from local repository to Central repository

To perform a push operation for a repository, we first need an SSH key.

ssh-keygen

To generate an SSH key.

Go to the GitHub account, then go to Personal settings, then SSH and GPG keys.

Add the SSH key. Provide a name. Paste the entire SSH key.

ssh –t <ssh url>

To make SSH authentication.

git push origin <name_of_local_branch>

Push local branch as a remote branch of the repository.

Read Me Leave comment