Do you have information out there on the process of transitioning from a different data catalog to IBM's? - watson-knowledge-catalog

I'm looking for general information on how the process of transitioning to WKC (on cloud and/or on-prem) will be for Collibra data catalog customers, for example. Specifically if it's an easy transition, and/or if there are services dedicated to helping with transition.


Data Consistency Across Microservices

While each microservice generally will have its own data - certain entities are required to be consistent across multiple services.
For such data consistency requirement in a highly distributed landscape such as microservices architecture, what are the choices for design? Of course, I do not want shared database architecture, where a single DB manages the state across all the services. That violates isolation and shared-nothing principles.
I do understand that, a microservice can publish an event when an entity is created, updated or deleted. All other microservices which are interested in this event can accordingly update the linked entities in their respective databases.
This is workable, however it leads to a lot of careful and coordinated programming effort across the services.
Can Akka or any other framework solve this use case? How?
Adding the below diagram for clarity.
Basically, I am trying to understand, if there are available frameworks today that can solve this data consistency problem.
For the queue I can use any AMQP software such as RabbitMQ or Qpid etc.
For the data consistency framework, I am not sure if presently Akka or any other software can help. Or is this scenario so uncommon, and such an anti-pattern that no framework should be ever needed?
The Microservices architectural style tries to allow organizations to have small teams own services independent in development and at runtime. See this read. And the hardest part is to define the service boundaries in a useful way. When you discover that the way you split up your application results in requirements impacting multiple services frequently that would tell you to rethink the service boundaries. The same is true for when you feel a strong need to share entities between the services.
So the general advice would be to try very hard to avoid such scenarios. However there may be cases where you cannot avoid this. Since a good architecture is often about making the right trade-offs, here some ideas.
Consider expressing the dependency using service interfaces (API) instead of a direct DB dependency. That would allow each service team to change their internal data schema as much as required and only worry about the interface design when it comes to dependencies. This is helpful because it is easier to add additional APIs and slowly deprecate older APIs instead of changing a DB design along with all dependent Microservices (potentially at the same time). In other words you are still able to deploy new Microservice versions independently, as long as the old APIs are still supported. This is the approach recommended by the Amazon CTO, who was pioneering a lot of the Microservices approach. Here is a recommended read of an interview in 2006 with him.
Whenever you really really cannot avoid using the same DBs and you are splitting your service boundaries in a way that multiple teams/services require the same entities, you introduce two dependencies between the Microservice team and the team that is responsible for the data scheme: a) Data Format, b) Actual Data. This is not impossible to solve, but only with some overhead in organization. And if you introduce too many of such dependencies your organization will likely be crippled and slowed down in development.
a) Dependency on the data scheme. The entities data format cannot be modified without requiring changes in the Microservices. To decouple this you will have to version the entities data scheme strictly and in the database support all versions of the data that the Microservices are currently using. This would allow the Microservices teams to decide for themselves when to update their service to support the new version of the data scheme. This is not feasible with all use cases, but it works with many.
b) Dependency on the actual collected data. The data that has been collected and is of a known version for a Microservice is OK to use, but the issue occurs when you have some services producing a newer version of the data and another service depends on it - But was not yet upgraded to being able to read the latest version. This problem is hard to solve and in many cases suggests you did not chose the service boundaries correctly. Typically you have no choice but to roll out all services that depend on the data at the same time as upgrading the data in the database. A more wacky approach is to write different versions of the data concurrently (which works mostly when the data is not mutable).
To solve both a) and b) in some other cases the dependency can be reduced by hidden data duplication and eventual consistency. Meaning each service stores its own version of the data and only modifies it whenever the requirements for that service change. The services can do so by listening to a public data flow. In such scenarios you would be using an event based architecture where you define a set of public events that can be queued up and consumed by listeners from the different services that will process the event and store whatever data out of it that is relevant for it (potentially creating data duplication). Now some other events may indicate that internally stored data has to be updated and it is each services responsibility to do so with its own copy of the data. A technology to maintain such a public event queue is Kafka.
Theoretical Limitations
One important caveat to remember is the CAP theorem:
In the presence of a partition, one is then left with two options:
consistency or availability. When choosing consistency over
availability, the system will return an error or a time-out if
particular information cannot be guaranteed to be up to date due to
network partitioning.
So by "requiring" that certain entities are consistent across multiple services you increase the probability that you will have to deal with timeout issues.
Akka Distributed Data
Akka has a distributed data module to share information within a cluster:
All data entries are spread to all nodes, or nodes with a certain
role, in the cluster via direct replication and gossip based
dissemination. You have fine grained control of the consistency level
for reads and writes.
I think there are 2 main forces at play here:
decoupling - that's why you have microservices in the first place and want a shared-nothing approach to data persistence
consistency requirement - if I understood correctly you're already fine with eventual consistency
The diagram makes perfect sense to me, but I don't know of any framework to do it out of the box, probably due to the many use-case specific trade-offs involved. I'd approach the problem as follows:
The upstream service emits events on to the message bus, as you've shown. For the purpose of serialisation I'd carefully choose the wire format that doesn't couple the producer and consumer too much. The ones I know of are protobuf and avro. You can evolve your event model upstream without having to change the downstream if it doesn't care about the newly added fields and can do a rolling upgrade if it does.
The downstream services subscribe to the events - the message bus must provide fault-tolerance. We're using kafka for this but since you chose AMQP I'm assuming it gives you what you need.
In case of network failures (e.g. the downstream consumer cannot connect to the broker) if you favour (eventual) consistency over availability you may choose to refuse to serve requests that rely on data that you know can be more stale than some preconfigured threshold.
I think you can approach this issue from 2 angles, service collaboration and data modelling:
Service collaboration
Here you can choose between service orchestration and service choreography. You already mentioned the exchange of messages or events between services. This would be the choreography approach which as you said might work but involves writing code in each service that deals with the messaging part. I'm sure there are libraries for that though. Or you can choose service orchestration where you introduce a new composite service - the orchestrator, which can be responsible for managing the data updates between the services. Because the data consistency management is now extracted into a separate component, this would allow you to switch between eventual consistency and strong consistency without touching the downstream services.
Data modelling
You can also choose to redesign the data models behind the participating microservices and to extract the entities that are required to be consistent across multiple services into relationships managed by a dedicated relationship microservice. Such a microservice would be somewhat similar to the orchestrator but the coupling would be reduced because the relationships can be modelled in a generic way.
"accordingly update the linked entities in their respective databases" -> data duplication -> FAIL.
Using events to update other databases is identical to caching which brings cache consistency problem which is the problem you arise in your question.
Keep your local databases as separated as possible and use pull semantics instead of push, i.e. make RPC calls when you need some data and be prepared to gracefully handle possible errors like timeouts, missing data or service unavailability. Akka or Finagle gives enough tools to do that right.
This approach might hurt performance but at least you can choose what to trade and where. Possible ways to decrease latency and increase throughput are:
scale data provider services so they can handle more req/sec with lower latency
use local caches with short expiration time. That will introduce eventual consistency but really helps with performance.
use distributed cache and face cache consistency problem directly

How to think about microservices?

I am in the learning process of converting my monolithic GAE app into a Microservices Architecture.
I understand that the app is separated into services that can communicate with each other. Different categories of requests are handled by different services as designated by the dispatch.yaml file.
How do we determine what to make into a service? Consider an online job board website which has the following functionalities:
There are two user roles: JobSeeker and Company
Companies can post a Job entity
Job Seekers can create a JobApplication entity (that corresponds with a Job)
Both user roles can have to be authenticated
Companies manage their own CompanyProfile
JobSeekers manage their own list of JobApplications and their JobProfiles
What is the guiding thinking process that goes into separating our app into a microservices?
Some of the places that microservices are advantegeous over monolithic application are:
When you want different release cycles for different parts of your application.
When one component is used by numerous different upstream components (e.g. a shared authentication system.)
When you want to isolate failures (e.g. so a upstream component can gracefully degrade if a downstream component is down.)
When you want to limit the scope of data availability (e.g. keeping encryption keys to the smallest possible surface)
When you want to minimize the stateful parts of your application (e.g. so you can have a stateless frontend that scales easily and a stateful backend that you scale by sharding).
In other words: split up into microservices where it will actually be useful to you. Merely splitting up your application for the sake of it is going to make your life more complex than necessary.

Android + Express + Mongodb

I researched a bunch of posts here, though some of the responses validates my questions, there were a few gaps, and hence posting to get a complete end-to-end validation of my architecture.
I want to build an MVP and want to get some input on the fastest path to make this happen and validation on my technology choices. I can handle Javascript, express-node.js, mongo, have a beginner level experience with android programming, and medium java.
I want an android app that can authenticate users before using android app, then enable user to manage customers and their meta data, manage some documents pertaining to per customer like in CRM apps, send the doc back to customer, and process payments from customer via a payment gateway, and also send some notifications to customers via SMS to do some action "a" "b" "c".
I considered moving to a cloud-based solution rather than local database for obvious reasons. Here is the technology stack I was considering. I need validation that this makes sense and I am not missing something.
Client side programming
Android Studio 3.0
ButterKnife: a view binding library to generate some code
Retrofit for java interface into restful api
Android app will talk to restful api/web service powered by node.js
Express framework + node.js server
passport.js for authenticating
payment gateway: stripe
may be amazon cdn to hold some docs, templates, and then hold static urls to that data in mongodb, and use node-aws modules in my express app
Backend storage
noSQL: Moongose ORM + mongodb
modules: pugm moment
At some point later 5-6 months later, I do anticipate having a offline option. I know this is tricky with mongodb, and couchdb etc might be a better since they have a lite option. The other option was for me to use Google firebase. Or use some thing asymettric sync with SQLlite and mongodb (though I don't like to mix sql and nosql).
Again speed is very important to me.
Please share your thoughts on the technology choices I have made and if I am on the right track, and things I need to worry about. Again I am getting MVP ready, soon want a few customers to try it out, and later go GA in 6-9 months.
There is no "best" tech stack for a project. Even if we knew enough about you and the project to make a recommendation (which we don't), none of us have a crystal ball to predict the future, and the technology that seems to fit now may be a very poor fit later.
The best advice I can offer for building an MVP is this: pick the tech stack that lets you validate your riskiest assumptions as quickly/cheaply as possible.
In fact, you should expect to build many MVPs, as the MVP is not a single product, but a process.
A few examples:
The riskiest assumption for most startups is that users want their product in the first place. The quickest/cheapest way to validate this is often to avoid building anything (which can take months), and instead, just talk to potential customers (which can take hours)! The "tech stack" in this case may be no more than a napkin sketch or a few mock-ups created in Balsamiq. If users seem uninterested, you'll have to go back to the drawing board, but at least you only lost a few hours instead of months. On the other hand, if customers start salivating when you show them your sketch, you can go on to the next MVP.
The next MVP you do may be a hard-coded, static HTML prototype of your product. Your "tech stack" may be to use html5up for free HTML5 templates, Bootstrap for styling, and GitHub Pages for hosting. Perhaps when you ask customers for money at this stage, they turn you down. That's a shame, but it only cost you a few days of static prototypes instead of months of coding. On the other hand, perhaps a few users will sign a check at this stage, which is great validation!
After that, the next MVP may be to add a dynamic backend behind your HTML. In this case, you may be far better off using a Platform as a Service (PaaS) such as Heroku or Backend as a Service (BaaS) such as Firebase so you don't have to spend time deploying/managing infrastructure. Use whatever data storage they offer and don't worry about scaling or fancy features. If you're lucky enough to get some traction, and scaling is becoming an issue, that's a great problem to have!
At this point, you will know much more about your problem space, and will be able to decide if a relational DB or NoSQL DB is a good fit for your data model, whether Node is the tech your team knows best, whether PaaS or BaaS still fits your needs or if you need to move to Infrastructure as a Service (IaaS) such as AWS, and so on. Even at this stage, you still typically want to optimize for moving quickly, so you'll typically pick technologies that (a) your team knows best and (b) have a large community that has built open source libraries you can use for your problem space.
For a more detailed look at these sorts of trade-offs, check out the "Picking a Tech Stack" chapter of Hello, Startup.
I can't speak on the Android side of things, but for the backend yes.
My understanding for the backend you want the following features:
User authentication
Online payments via Stripe
Data persistance
Data synchronization
User Authentication
You specified that you are going to use Passport, but you have not specified a strategy. If you are going to basic local auth (username/password), then you'll want to make sure you follow best practices such as hashing the password.
Online payments via Stripe
Stripe is easy to work with and fast to get going. Make sure it meets all of your requirements before making the commitment to it.
Data persistance
I would strongly recommend against using a NoSQL database such as MongoDB. The reason being is that NoSQL is for non-relational data. SQL databases are meant for relational data which you have described above:
enable user to manage customers
This would be a 1:N or one-to-many relationship; a user can have many customers.
There will be more relationships defined such as one-to-one or many-to-many which makes SQL the clear choice. If you're looking for speed then look into PostgreSQL as a free option otherwise Amazon Aurora would be best.
Data synchronization
This is something you will need to spend considerable time with. You have to decide if you want to roll your own solution or use an existing solution such as Realm Platform. Rolling your own solution comes with its own complexities on its own. For example what if user A and user B both have access to the same customers and edit them at the same time? How will the mobile apps respond? And many other scenarios can happen.
In addition to the above:
I would not make your Node.js/Express application a traditional monolith application/REST API. Take each feature and make it its own "app" or microservice.
Be mindful of any local, state, country, etc. regulations you may have to follow if you are accepting payments. Stripe has its own to follow and you may very well have another set you need to follow.
Consider the Spring framework, specifically the Spring Boot project.

can you do multiple Salesforce deployments, putting most users in the cheaper one and a few in one with best features? [closed]

since Salesforce requires high per user payment for the deployment with some advanced features, have there been attempts to get the advanced features (which are probably less likely to be needed in day to day work than the basic ones) in a separate deployment with few users and transfer in the data from the cheaper basic deployment actually used by most employees?
We use similar scheme to sync data from a subsidiary instance before we bring them up to the full feature instance. Jitterbit 4 has salesforce connectors which allow very easy creation of sync tasks (query/upsert, but even without the wizards its easy to use sf api service), you just need to chain pairs of operations (one web service call to load data from source, another to upset at destination using source record ID as external ID in destination).

Service Oriented Architecture and Loose Coupling vs SQL JOINS

Let's suppose we have got a SOA infrastructure like the one painted below and
that every service can run on a different host (this is especially valid for the two extra-net service "web site" and "payment system").
Clearly we have got a data (persistence) layer. Suppose it's implement through EJB + JPA or something alike.
If we want to join data (in user UI) between the different services I see at least a couple of alternatives:
we want to do efficient JOINs at RDBMS level so we have a package (ie. persistence.package) that contains all the entities and session facades (CRUD implementation) which in some way has to be shared (how ?) or deployed for every service. That said, if I change something in the order schema I must redeploy this packages introducing tight coupling between pretty much everything. Moreover the database must be unique and shared.
to avoid such issues, we keep an entity package for each different service (i.e. order.package) and let the services communicate through some protocol (soap, rest, esb, etc.). so we can keep data locally in each host (share nothing architecture) and we don't need to redeploy the entity package. But this approach is terrible for data-mining as a query that must search and return correlated data between multiple services will be very inefficient (as we cannot do SQL joins)
Is there a better / standard approach to the issues pointed above ?
The main motivation for SOA is independent components that can change separately. A secondary motivation,as Marco mentioned, is simplifying a system into smaller problems that are easier to solve.
The upside of different services is flexibility the downside is more management and overhead - that overhead should be justified by what you get back - see for example a SOA anti-pattern I published called Nanoservices which talks about this balance
Another thing to keep in mind is that a web-service API does not automatically mean that that's a service boundary. Several APIs that belong to a larger service can still connect to the same database underneath. so for example, if in your system payments and orders belong together you shouldn't separate them just because they are different APIs (In many systems these are indeed different concerns but, again, that's not automatic)
When and if you do find the separation into services logical than you should follow Marco's advice and ensure that the services are isolated and don't share databases. Having services isolated this way serves toward their ability to change. You can then integrate them in the UI with a composite front end. You should note that this works well for the operational side of the application as there you only need a few items from each service. For reporting you'd want something like aggregated reporting i.e. export immutable copies of data into a central database optimized for reporting (e.g denormalized star-schema etc.)
Oh my friend you're just complicating the whole scenario, but it is not your fault, companies like MSFT, Oracle and other big vendors of enterprise class software like to make a big picture of something that is way easier, and they do it for a reason: scaling services vertically means more licenses.
You can take a different approach and forget for a moment about all those big words, EJB, JPA... and like someone smart once said, split the big problem in smaller parts so that rather than having a big problem you have a couple of small problems which in theory should be easier to deal with.
So you have a couple of services in your system: people, payment system, orders, products, ERP... for a moment lets think that those boundaries are right in terms of business entities. Imagine those services are different physical departments of you company, which means that they only care with the data that belongs to them, nothing else.
You could then say that Payments department has its own database, the same applies to Orders, of course they still need to communicate with each other as all departments do, and that can be made easy with a system generated public surrogate key. What this all means is that each service maintains the referential integrity of all its internal entities using internal keys, but if records need to my made available to other services you can for example use a Guid key, e.g.:
The payments service needs the order ID and the Customer ID, but those entities belong to their own services, so instead of sharing a private key (primary key) of each record, each record will have instead a primary key and a surrogate external key the services will use to share among them. The thing is, you should aim to build loose coupled services, each with its own "small" database. Each service should also have each own API, which should be used not only by the front end, but by the other services as well. Another thing you should avoid is using DTC or other transaction management provider as a service wide transaction guarantor, it is something that can be archive easily with just a different architecture approach.
Anyway, read, if you haven't already about DDD, it will give you a different overview on how to build enterprise class software, and btw EJB, run away from them.
You could use something like event SOA, but lets keep things simple here. A registered client comes to your site to place an order. The service responsible for this is the Orders. A list of external IDs for the products is submitted to the Orders service which then register the order, at this point the order is in a "awaiting payment" status and this service returns a public Guid order ID. For the order to be completed the customer needs to pay for the goods. The payment details are submitted to the payment's service which tries to process a payment, but for that it needs the order details because the only thing the frontend sent was the order id, and to do that it uses the GetOrderDetails(Guid orderId) from the order´s API. Once the payment is completed the Payments service calls yet another method of the Order´s API PaymentWasCompletedForOrder(Guid orderID). Let me know if there is anything you are not getting.